Scenario 2: Out of Memory
Setup
Deploy the file app-oom.json
:
Once deployed, when we take a look at the DC/OS web interface, we see some strange results under CPU Allocation:
Figure 1. CPU allocation display
How is it that CPU Allocation is continually oscillating between 0 and 8 percent? Let’s take a look at the application details in the web interface:
Figure 2. Application details
Based on this, the application runs for a few seconds and then fails.
Resolution
To get a better handle on understanding this unexpected behavior, let us start by looking at the application logs — either in the web interface or via the CLI. You can find the application logs in the web interface by looking under ‘Output’ in the ‘Logs’ tab of the application:
Figure 3. Application log displa y
The log output “Eating Memory” is a pretty generous hint that the issue might be related to memory. Despite this, there is no direct failure message regarding memory allocation(keep in mind that most apps are not so friendly as to log that they are eating up memory).
As suspected, this might be an application-related issue, and this application is scheduled via Marathon. So let’s check the Marathon logs using the CLI:
We see a log entry similar to:
Now we have confirmed that we exceeded the previously set container memory limit in app-oom.json
If you’ve been paying close attention you might shout now “wait a sec” because you noticed that the memory limit we set in the app definition is 32 MB, but the error message mentions 64MB. DC/OS automatically reserves some overhead memory for the executor which in this case is 32 MB.
Please note that OOM kill
is performed by the Linux kernel itself, hence we can also check the kernel logs directly:
The resolution in such cases is to either increase the resource limits for that container, in case it was configured too low to begin with. Or, as in this case, fix the memory leak in the application itself.
General Pattern
As we are dealing with a failing task it is good to check the application and scheduler logs (in this case our scheduler is Marathon). If doing this is insufficient, it can help to look at the Mesos Agent logs and/or to use dcos task exec
when using UCR (or in a Docker containerizer, ssh into the node and use docker exec
).
Cleanup
Remove the application with