.NET Core: The configured user limit (128) on the number of inotify instances has been reached.

If your reading this then you no doubt encountered the following error running you .NET Core app:

System.IO.IOException: The configured user limit (128) on the number of inotify instances has been reached, or the per-process limit on the number of open file descriptors has been reached.

This is a full application error that results in your host application terminating. Essentially this error is telling us that we are watching too many files… more than the host OS allows for any single user (or application). Some piece or library within our application is consuming more file watchers than we thought.

Turning to our trusty friend “google” for some help, it would seem there are a couple of workarounds being suggested, all depending on where you are encountering the error.

Development File Watchers

Disable Razor View Engine File Watches

 RazorViewEngineOptions.AllowRecompilingViewsOnFileChange = false; 

This may work, I never tried it… don’t really use Razor or views in ASP.NET Core much these days with most of the UIs being built as static SPAs. In my case, this was not the issue.

DOTNET Polling File Watches

export DOTNET_USE_POLLING_FILE_WATCHER=true 

By default, the DOTNET runtime can watch for file changes by receiving notification asynchronously when that happens. An example would be the FileSystemWatcher class. You can adjust this behavior so that when using this type of API, instead of place the file watcher the application can poll for file changes. This change supposedly places a small performance hit and memory increase as a result of polling for the changes.

In either case, I have no custom code or file watching in place for most of my API driven applications.

Windows vs Linux

Unclear to me at first when looking at this was that the error message is actually platform-specific and only occurs on Linux. That makes sense given that “inotify” max users is a Linux specific construct. Additionally, in my case, I never experienced this issue in my local Windows development environment, nor should I expect to.

The local development issues have come up for those with Linux locally. Specifically, some have a workaround for experiencing additional file watchers as a part of just developing and using vscode with .NET Core.

That being said I was able to experience the issue locally on Windows when running the application inside a container. Whether locally on Linux or in a Linux Container you can work around the issue by increasing the number of watchers:

echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p

Build & Runtime Watchers

.NET Core 2.2+ SDK Issue?

While I can use the fix above to increase the file watchers inside my container execution, it didn’t seem to get at the root problem. It is a clear workaround.

Another piece of the puzzle is that I experienced this issue after upgrading to .NET Core 2.2 and above (including .NET Core 3.1). Prior to .NET Core 2.2, we had very little application changes and did not experience these intermittent issues at all. Is this an SDK problem of some kind?

This particular issue plagued multiple applications after upgrading, specifically by causing some of the tests to fail during the build. Unit tests would consistently run without issue and integration tests would fail as a result quite often. For the integration test suites where this was observed, a pattern of booting up the entire application in memory using the “TestServer” feature was used. Two key observations can be made based on this behavior:

  • Integration tests were not consistently failing with low file watchers when executing inside a container on a build host. This indicates to me the file watcher limit is integrated with the base OS somehow (the only variable that would be shifting the number of available file watchers), perhaps as a result of sharing volumes?
  • If this is occurring only when starting the web application to execute integration tests it is likely happening when deployed as a container starts in our PROD environment.

Given that this issue is now not just an intermittent test issue but rather a PROD stability concern, it is suddenly a much bigger priority to resolve.

NET Core 2.2+ Runtime Issue

Within a few days, we were able to spot an example deployed in our DEV environment where our AWS ECS Container Orchestrator had started one instance but was consistently failing to start a second instance for high availability. The container orchestrator kept trying to place the container on the same host without enough file watchers to allow it to start. The container would subsequently die and another one would try again. This is what we refer to as a “flopping” service stuck in a death loop. The difference is that because the available file watchers varied per host this is very difficult to track down. We likely have many instances where the containers have failed to start and the orchestrator respawns another and it works just fine.

We could begin to scan the logs of our .NET Core services for:

HOST-TERMINATED-UNEXPECTEDLY 

A Simple Fix

After looking for commonalities on the affected services outside of the .NET Core upgrade that happened, there was only one other similar pattern which these applications shared that is associated to file system watchers. That is the loading of the “appsettings.json” file on the application start. This was the “AddJsonFile” configuration method. It accepts an overload parameter for “reloadOnChange” which reloads the application settings should the appsettings.json file be modified.

configuration = new ConfigurationBuilder()
                     .SetBasePath(Directory.GetCurrentDirectory())    
                     .AddJsonFile("appsettings.json", optional: true, reloadOnChange: false);

The documentation is not great in telling you the default. But looking at the code if you don’t use that overload, the default for Add Json File is “false”. BUT be careful since if you’re relying on the default configuration for your app it sets the reload on change to “true”. Depending on how you expect to reuse your settings in your appsettings.json to read updated values on change, it may have different behavior than you expect.

In the ASP.NET Core world driven inside a container, there is next to zero reasons I can think of in which I would want to enforce reloading of that configuration file. Once my container is built and pushed to its container repository with an immutable version it does not change (only environment variables on deploy modify its behavior). In this straightforward scenario, there is very little reason to turn on “reloadOnChange”. Turning it off, lead to slightly less number of occurrences of this problem.

It looks like the configuration code propagated internally through copypasta from one app to the next without consideration of that particular flag.

The problem was fully resolved after reviewing a common internal library that was used amongst all the services that incorrectly loaded the appsettings.json file multiple times with the reloadOnChange.

Essentially “reloadOnChange” in the application was being set to “true” at 3 different points in the same application to the same file. After disabling this the problem was resolved in its entirety and has not been observed since during integration tests or during deployed runtime.

Why?

I still have more questions than answers. Let me know if you have any answers?

  • What change in .NET Core led to this being a problem in 2.2+ (and not in previous versions). Previous versions do not seem to have changed the defaults.
  • Adding “reloadOnChange” even 3 times to the same file feels as though it should not cause any thresholds on limitations to be crossed for file watchers. Is this a bug in .NET Core? Or a problem in the Linux distribution in my container?
  • What is the relationship between “inotify” thresholds on a single host but used inside shared volumes of containers?

UPDATE :It’s Back

The following was added as an update when the issue returned in a dramatic way in mid-May 2020.

Just when you think you have it solved, the same issue strikes again. The problem seems to have been alleviated for a bit, likely due to the drastic reduction of the file watchers in the AddJsonFile method. However, we began launching more .NET Core services into our AWS ECS cluster, and very quickly the issue not only came back but seemed to have gotten much worse. Instead of sporadic once and awhile failures, it was entirely blocking and failing many different unrelated deploys of any .NET Core service. At this point, I decided it best to pull in some more folks from our SRE team to help guide me a bit. Here is a breakdown of where its at:

How many watchers does .NET need?

Profiling and looking at the “inotify” usage in a single container shows .NET Core 3.1 uses about 15 on startup (that is with a default web app, no customizations, no app settings file). That is really funny, and I don’t understand why it needs it. Looking at .NET Core 2.1 for comparison actually showed around 40 “inotify” references on start. This invalidates the original assumption this is getting worse for us as a result of migrating to .NET Core 3. Instead, this is getting worse strictly due to the volume of new services sharing a cluster and ultimately a single host.

# get the inotify related references
apt-get install lsof
lsof | grep inotify | wc -l
 
// simplest asp.net core configuration 
// (ditching IIS and other defaults) still shows 
// 15 watchers in the container
Host.CreateDefaultBuilder(args)
  .UseEnvironment("development")
  .ConfigureWebHost(builder =>
  {
    builder
      .UseKestrel()
      .UseStartup<Startup>();
  });

Why not just increase the “ulimit” for “nofile” option on the container in your ECS configuration?

In this case, the problem is not a ulimit. Instead of the actual watchers, its a problem with the number of “instances”. I’m a Linux newbie, so this doesn’t completely make sense to me, but at least gives me a reason why ulimit modifications are not any help.

Can you change the user in the container?

The short answer is YES. I never previously reviewed the best practice for specifying a user in your container, but in general, there are a lot of reasons to do this from a security standpoint (i.e. so my container does not run as root). Additionally, though, it would appear that the “inotify” limit we are hitting is not namespaced or grouped by the containers themselves, and so when we specify a user in the container it maps to a different limit / per user outside the container on the host kernel. This exact suggestion was provided in this post here: https://github.com/dotnet/aspnetcore/issues/7531

It seems like our usage of the same user across all our pods in the cluster was creating this issue. I rewrote our dockerfiles to have their own unique users and our pods appear to be behaving better (ajamrozek).

I had previously overlooked it given the K8 reference but also given the implication that somehow the limit is spread across the cluster, which is not the case (I misread that). This issue only is a result of the number of containers that are bin-packing on a single host in the cluster. Our multi-stage Dockerfiles with this update look fully like this now:

FROM mcr.microsoft.com/dotnet/core/sdk:3.1 AS build
WORKDIR /app
 
COPY . ./
RUN dotnet restore
RUN dotnet publish src/Demo.Web -c Release -o /app/out
 
FROM mcr.microsoft.com/dotnet/core/aspnet:3.1 AS runtime
WORKDIR /app
 
COPY --from=build ["/app/out", "./"]
 
# we must use a port above 1024 is using a non-root user for permission for that port
ENV ASPNETCORE_URLS=http://+:5000
 
# using a non-root user is a best practice for security related execution.
RUN useradd --uid $(shuf -i 2000-65000 -n 1) app
USER app
 
CMD ["dotnet", "Demo.Web.dll"]

The key changes near the end in the second runtime stage. Creating a new user named “app”. In order for us not to run into the same problem again, we need to ensure each container is created with a somewhat random UID for the user id. If all our containers were to switch to the same user we would run into the same limit and same problem. 

Careful to review internal practices to ensure that the range of UID you intend to provide does not conflict with a real potential user on the host, as it will then inherit all the permissions of it.

Great now we are back up and running, zero failures… phew. BUT… still so many questions…

Why? Again…

While in general, we should have been using USERs within our Dockerfile setup in the first place, at its heart it is still just a workaround in my belief to a more concerning problem. Lots of questions and associated behaviors to be aware of that will require further investigation:

  • We have to be careful to ensure are applications are indeed using different UIDs. If not for the “shuf” function I would fully expect to encounter the issue again in a year as a result of copypasta.
  • While we have partitioned this issue down now basically by service, there is still a scenario that will cause a failure. Specifically, if we need to scale a single service with dozens or more of containers. Depending on the placement strategy configured for the service you may end up with a bunch of the same container on the same host all using the same user id and run into this problem again. Given our current limit is set to 128 on the host, with 15 instances per container, if we put more than 8 instances the 9th would begin to fail with the error message.
  • The inotify limit is set to 128 right now. Perhaps this in itself needs to be increased on the hosts to alleviate the issues with scaling. I’m not nearly experienced enough in this area to comment but feels as though some type of arbitrary increase will be necessary. 
  • I’m still concerned about if these inotify watcher instances are really required? What is the intention? Perhaps more discussion is warranted here… https://github.com/dotnet/aspnetcore/issues/7531 as it seems that any .NET Core services would encounter a similar problem. It feels as though there is still another piece of the puzzle? Do you know what it is?