AWS SDK Credential Provider Multi-Region Behaviors

In the course of investigating an issue with a multi-region deployment in AWS EKS, I ran into not just some obscure behavior from the application operating in the second region, but just straight-up bugs in the AWS SDK for .NET itself. Investigating further across other SDKs in different ecosystems and there are interesting behaviors for region configuration across them all; both good and bad.

Problem

It started on the first deployment of a standard .NET Core containerized app to SPS Commerce’s internal multi-region compute platform based on AWS EKS. I was simply adding some resiliency to my application that had been running in us-east-1 for a long time to now shift some of its traffic to the same app running in us-east-2. Typically this transition is pretty straightforward with the internal deployment mechanisms and infrastructure provided by SPS. This time it started failing with the following behaviors:

  • The application would fail on start in the second region only (i.e. us-east-2).
  • There was zero hardcoded configuration for AWS regions in the codebase and the app was driven by region using the “AWS_DEFAULT” environment variable specified during configuration in each respective region.
  • Secrets from the AWS Secret Manager are loaded and merged with the startup configuration very early in the pipeline.
  • Subsequent services are used in AWS depending on the API request coming in, including SQS and Dynamo.

The initial symptom of failure during deployment to us-east-2 was the Kubernetes readiness probe failing to successfully ping the APIs endpoint at “/healthz“. This was because the container was not starting and the only log output was “STARTING-HOST“. No further logs were provided. The readiness check would eventually fail entirely and stop the container, and startup up another with the same result.

Through more verbose logging enablement with Serilog in ASP.NET Core, it was obvious that loading secrets from AWS Secret Manager early on in the startup pipeline were stuck in an endless loop attempting to access AWS STS service to grab credentials to make the AWS Secret Manager call (provided via OIDC with WebIdentity in EKS).

Take a look at the more verbose logging output:

AmazonSecretsManagerClient 28|2022-04-18T13:38:26.824Z|INFO|AmazonClientException making request GetSecretValueRequest to https://secretsmanager.us-east-2.amazonaws.com/. Attempting retry 1 of 4.
DefaultConfigurationProvider 29|2022-04-18T13:38:27.226Z|INFO|Resolved DefaultConfigurationMode for RegionEndpoint [us-east-1] to [Legacy].
AmazonSecurityTokenServiceClient 30|2022-04-18T13:38:37.232Z|ERROR|An exception of type IOException was handled in ErrorHandler. --> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.
 ---> System.Net.Sockets.SocketException (104): Connection reset by peer
   --- End of inner exception stack trace ---
   at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
   at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
AmazonSecurityTokenServiceClient 31|2022-04-18T13:38:37.233Z|INFO|IOException making request AssumeRoleWithWebIdentityRequest to https://sts.us-east-1.amazonaws.com/. Attempting retry 1 of 4.

Notice that:

  • It wants to use us-east-2 for requesting the secret.
  • It then seems to switch back to us-east-1.
  • Attempts to call AWS STS in us-east-1 instead of the us-east-2 region: https://sts.us-east-1.amazonaws.com
  • No further error logging results, since the retry is hidden information unless you configure it to expose it for debugging.
  • This loop continues endlessly until the container orchestrator eventually gives up.

Of course, I scour the code looking for possible references or indirect references to hardcoded regions. I did not find it. Shelling into the container in the cluster, I discover I can use AWS-CLI just fine and it works and respects the AWS_REGION environment vars. This tells me the platform and infrastructure are configured fine.

Consultation with our platform team leads me into examining: AWS_STS_REGIONAL_ENDPOINTS environment variable. This seems directly related to what I’m experiencing. This environment variable defines what region and behavior the AWS STS initial authorization request will use and how that might be different from the AWS_REGION environment configuration. It basically boils down to “legacy” (which uses a global endpoint), and “regional” which should just use AWS_REGION environment variable essentially. I had thought for some reason my app was attempting to use the global endpoint, which depending on the region might redirect to use AWS STS in us-east-1 by default. I confirmed it was set to regional and working properly inside the deployed application.

Diving into the AWS SDK for .NET

The investigation lead me into reviewing the AWSSDK for .NET to identify how the region materialized in the cases above.

I updated my demo app to be a single AWS STS call with no dependency injection and a straight-up hardcoded region for simplicity and clarity.

using (var sts = new AmazonSecurityTokenServiceClient(RegionEndpoint.USEast2))
{
  	var result = await sts.GetCallerIdentityAsync(new Amazon.SecurityToken.Model.GetCallerIdentityRequest());
    var userId = result.UserId;
}

It appears that the Constructor for service instantiation in the AWS SDK properly passes the region along to the config. However, digging deeper, that region passed in is used for the actual request, but to grab the initial identity credentials (requested anonymously) it creates an AWS STS client in which it re-discovers the region:

var configuredRegion = AWSConfigs.AWSRegion;
var region = string.IsNullOrEmpty(configuredRegion) ? _defaultSTSClientRegion : RegionEndpoint.GetBySystemName(configuredRegion);

In this case, the default STS Client Region is hardcoded as us-east-1:

private static readonly RegionEndpoint _defaultSTSClientRegion = RegionEndpoint.USEast1;

Based on this behavior we can identify this is likely the reason that we are getting redirected to us-east-1 for our role assumption. But why didn’t the value from AWSConfigs.AWSRegion use the region value that we configured? Largely because AWSConfigs does not use any provided region fallback chain or lookup based on the AWS_REGION environment variable. It simply checks existing local configuration files for it. Additionally, this implies that the environment variable AWS_STS_REGIONAL_ENDPOINTS really only works in the SDK for subsequent calls after credentials are retrieved anonymously

This behavior exists int credential sources for AssumeRoleWithWebIdentityCredentials, AssumeRoleAWSCredentials and CognitoAWSCredentials.

In good fashion, my investigation led me finally to an existing GitHub Issue (AWS_STS_REGIONAL_ENDPOINTS as environment variables) for this exact problem: (wish I was able to find this sooner). Unfortunately, this issue has been active for almost a year by this point without a formal fix.

AWS SDK for .NET Workaround

Based on the investigation above we know that we can easily work around this problem by setting the AWSConfigs.AWSRegion value before the AWS SDK initializes, in this case before secret manager requests go out on startup in our startup middleware.

// bootstrapping configuration
var builder = WebApplication.CreateBuilder(args);

// MUST manually set region to force us-east-2 in deployed app
// otherwise falls back to us-east-1 STS in all regions in SDK
AWSConfigs.AWSRegion = Environment.GetEnvironmentVariable("AWS_REGION");

This will ensure that the value used by the initial role assumption AWS STS calls use the same value as provided in AWS_REGION environment variable. This means that AWS STS calls will be mimicking the “regional” setting of the value AWS_STS_REGIONAL_ENDPOINTS, which is the recommended non-legacy approach anyhow, and exactly what we need here.

It would definitely be great to contribute some enhancements to the AWS SDK for this. Time allowing that would be great, but contributions with the code in a local runnable state and proper testing I expect is not trivial, especially across all affected credential providers. Maybe you have some time?

Behaviors in Other AWS SDKs

Given the polyglot ecosystem we have across our applications at SPS Commerce I was interested to see how this behavior looked in other AWS SDKs we might have teams using.

AWS SDK for Java V2: functionality is definitely different but closer to what you expect where the credentials are retrieved using the DefaultAwsRegionProviderChain which does in fact include pulling the region from AWS_REGION environment variable; essentially what our shim/workaround for .NET does above, but just natively through the existing region provider chain. This is how I would expect the .NET AWS SDK to be updated. This effectively makes the Java SDK act as if it was set for “regional” STS endpoints. Although interesting, the Java SDK makes no usage or reference to the AWS_STS_REGIONAL_ENDPOINTS environment variable for legacy support, it does fallback to the global STS endpoint when no region is provided.

AWS SDK for Python (Botocore) – functionality appeared to work as you would expect by passing along the configured region for the service client your instantiating, including support for AWS_STS_REGIONAL_ENDPOINTS environment variable. However, interesting to note is that the region used is always cached as the supplied first region, and cannot be changed with subsequent client requests when retrieving credentials. The provided note indicates that IAM/STS is really global, so credentials retrieved by one region-specific endpoint are valid in another region. Worth noting that might seem odd in your application logs to continually see STS credentials garnered from one region for applications that may be accessing multiple regions with different clients. I’m not settled on if this poses potential resiliency issues if the “first cache” regional endpoint for STS becomes unavailable and you are expecting your application to be able to efficiently switch over to a different region. What do you think?

Leave a Reply