TIL: How to get RedShift to access S3 buckets in a different region
While trying to get a Spark EMR job running, I encountered this error on a Spark step that copied data from RedShift to S3.
I’ve seen issues in the past with S3 buckets outside of US-East-1 needing to be targeted with region-specific URLs for REST (https://bucketname.s3.amazon.com vs. https://bucketname.us-west-2.s3.amazon.com) but had not seen anything similar for s3:// targeted buckets.
This got me looking at Hadoop file system references, none of which are helpful because EMR rolls their own, proprietary file system for Hadoop S3 access. So Hadoop’s recommended s3a:// (which is fast and resilient – and supports self-discovering cross-region S3 access!) does not work on EMR. Your only option is s3://, which appears to be region-dumb.
The fix turns out to be simple, you just have to pass the bucket region to the Spark step as a separate argument. i.e. us-west-2
… simple, but annoying, because the steps worked in a pre-prod environment (in a different region), so it wasn’t immediately apparent what was causing the failure, which was buried in the logs.