AWS X-Ray for Distributed Tracing in AWS ECS

AWS X-Ray for Distributed Tracing in AWS ECS

What is AWS X-Ray?

X-Ray is AWS’s cloud-native service for distributed tracing. It provides real-time visualization of service maps based on traffic flowing through the applications. X-Ray allows for both pushing and pulling of its trace data into X-Ray. This means I can use X-Ray as a distributed tracing mechanism without the need for any type of application service mesh, such as the AWS App Mesh or other community offerings like Istio and Jaeger. If you plan to not use it with a mesh of any kind, you’ll, of course, be required to install a few plugins and configure at the app level (we have this integrated into our standard service template that all APIs get by default).

It is fully integrated into the AWS Console, meaning you have a high degree of vendor lock-in, but at the same time, it does a great job of “just working” with little setup and has native integrations to simply “turn it on” for AWS services like Lambda (FaaS), Elastic Beanstalk (PaaS) and a lot more. Whether your willing to accept the high degree of lock-in would highly depend on your existing values and architecture. If your sold and baked into AWS services like API Gateway, Lambda, SNS, SQS, EB, ECS, EC2 (to name a few) then the true value add is that you can have high quality, contextualized distributed tracing running with a low level of effort (but watch out for those costs depending on your volume and sampling configuration).

Running X-Ray on AWS ECS Service

While there are many reasons to consider other container orchestrators and patterns such as AWS EKS, you may be in a situation like many others where your organization is currently in the ECS world. This was our case, and we already had numerous usages of AWS X-Ray across other AWS native services that it provides. Getting integration to complete the distributed tracing coverage in our containerized ECS apps was the next logical step.

When AWS App Mesh integration natively is not an option, AWS provides great documentation already on setting up the X-Ray Daemon inside ECS (*Note: the daemon is used to batch up the logged requests and ship them to the cloud). You can use the official “amazon/aws-xray-daemon” container provided by AWS and run as a side-car style container to your apps using “links”… or using “awspvc network mode”. In either case this is going to require access to modify the port mappings to expose port 2000 (by default) over UDP. All the configuration is provided and straightforward to accomplish this. However, for a few different reasons you may not be ready to run the X-Ray agent container in a separate container…

Packing into a Single Container

If you are caught in the same scenario we are in ECS, where it would be difficult to run a side-car container with the AWS X-Ray Daemon, it is possible to run the X-Ray daemon inside your existing single container. In our case, our custom deployment mechanisms internal to our organization do not allow the specification of using UDP protocol for exposing ports yet. Or additionally, you may have cost concerns in doubling up your container usage (but its so cheap I would not imagine that to be the case).

While running multiple processes inside a single container is not a best practice it can dramatically reduce or at least “shift-left” the configuration of the X-Ray Daemon. Depending on where the friction in your organization or deploy process lives, this can be highly desirable. In either case, we will look at the AWS documentation on running the daemon locally to compose into our final runtime single container image.

FROM runtime-sdk:latest AS runtime
WORKDIR /app

# setup aws xray
RUN apt-get update && apt-get install -y --force-yes --no-install-recommends apt-transport-https curl ca-certificates wget && apt-get clean && apt-get autoremove && rm -rf /var/lib/apt/lists/*
RUN wget https://s3.dualstack.us-east-2.amazonaws.com/aws-xray-assets.us-east-2/xray-daemon/aws-xray-daemon-3.x.deb
RUN dpkg -i aws-xray-daemon-3.x.deb
RUN ["touch", "/var/log/xray.log"]
RUN ["chmod", "a+rw", "/var/log/xray.log"]

COPY ./docker-entrypoint.sh /
RUN ["chmod", "a+x", "/docker-entrypoint.sh"]

# using a non-root user is a best practice for security related execution.
RUN useradd --uid $(shuf -i 2000-65000 -n 1) app
USER app

ENTRYPOINT ["/docker-entrypoint.sh"]

This example Dockerfile is based on an Ubuntu distro and pulls down and installs the latest X-Ray daemon. Additionally, it pre-creates the X-Ray log file and ensures that permissions are wide enough for when a non-root user (thanks to XtraKrispi for that) is setup further down (non-root user is a Docker best practice).

It also sets up the ENTRYPOINT against a shell script that will execute. We will use the shell script to start multiple processes on entry (not typically allowed on ENTRYPOINT for obvious best practices reasons).

#!/bin/bash

/usr/bin/xray -f /var/log/xray.log &
dotnet "AspNetCore.Demo.Web.dll"

In this docker-entrypoint.sh (above) the first line starts up the X-Ray Daemon process at the same log directory we pre-created with permissions earlier. And finally the last line, in this case, is starting a .NET Core application (replace with whatever start command would normally be executed in your ENTRYPOINT).

While this particular pattern is not recommended, it definitely works easily and quickly when you stuck between a rock and a hard place. We have had some our of services running this pattern for the last 12-18 months with next to zero issues related to the multiple processes or extra docker build lines. While we will be pushing to more isolated services/processes/patterns in the future but this pattern has proved well and allowed us to focus on more differentiating work knowing we’ll get to address this later.

Capturing SIG Commands

In the example above we are pushing off the process startup and execution into a shell script. As a result of the current implementation above you may have noticed that your container is no longer directly routing the SIG commands (for TERM And KILL) to your application. Depending on the type of workload you have this is highly essential to allow any resource cleanup, or shutdown behavior for your application.

Most modern application frameworks will automatically attach to the SIG commands and log output accordingly. You can easily attach to these events yourself for custom code execution as well. As an example, .NET Core allows you to easily do this. Regardless of your custom code, your application probably logs the application shut down for clarity and posterity, and you can test it out.

Execute .NET Core via CLI and let it boot up for a moment. When its ready, at the command line hit CTRL+C to cancel the command. A shutdown log is initiated and it is stopped pretty quickly:

> dotnet run 
CTRC+C
[15:25:41 INF] STARTING-HOST {}
[15:25:48 INF] HTTP GET /swagger/favicon-32x32.png responded 200 in 12.1919 ms
[15:25:48 INF] HTTP GET /swagger/v1/swagger.json responded 200 in 129.5502 ms
[15:25:54 INF] Application is shutting down…

That last line is key: “Application is shutting down…” clearly indicates our app is recognizing the SIGTERM command.

Lets try it out again, but this time in our container image with the shell script.

> docker run --name  app -d -p 8080:5000 app:latest
> docker ps
# grab the id from your docker ps and run a docker stop to send the SIG commands
> docker stop e1929b5d4db0
> docker logs -f e1929b5d4db0

This should easily allow us to pull the logs for our container and notice, there is no shutdown:

[15:25:41 INF] STARTING-HOST {}
[15:25:48 INF] HTTP GET /swagger/favicon-32x32.png responded 200 in 12.1919 ms
[15:25:48 INF] HTTP GET /swagger/v1/swagger.json responded 200 in 129.5502 ms

Also did you notice that your “docker stop” command took a bit longer than it should have? Without receiving a response from the process the SIGTERM command waited for the full-time allotment before forcing a SIGKILL on the container.

For reasons better explained here, we can route around this by ensuring we are not executing our “dotnet” command in a sub-shell, and change our docker-entrypoint.sh to this:

#!/bin/bash

/usr/bin/xray -f /var/log/xray.log &
exec dotnet "AspNetCore.Demo.Web.dll"

Adding the “exec” will ensure the SIG Command is routed to our application. Executing the above test will now show the application shutting down, and in my applications it rendered a noticeable difference in its ability to stop the container faster.