Home Twitter GitHub

AWS ECS: No Space Left On Device

Cover image

This morning I was told by a coworker that she couldn't get to any of the staging environments to test something and asked if I could take a look.

2019-04-18 11:03:27 -0400 service ox-create-a-cart has started 1 tasks: task 61f7213b-9859-44cc-8ea1-20cfa036d8dd.
2019-04-18 11:03:56 -0400 service ox-create-a-cart is unable to consistently start tasks successfully. For more information, see the Troubleshooting section.

Narrator: he did not check the "troubleshooting section"

I clicked into the task and found the real issue in the ECS admin console:

CannotPullContainerError: write /var/lib/docker/tmp/GetImageBlob256449446: no space left on device

The next thing to do was go on to the server and delete everything myself. So that's what I tried to do. I've seen a similar problem before, and thought deleting the old images would free up enough space to get everything back up and running.

$ docker images | grep 'mmlafleur/ox' | awk '{print $3}' | xargs docker rm

Error response from daemon: container 3caa2de88e6a3d703aa51ee712576879cb694feb86e06f714659f0a87b9d4ab5: driver "devicemapper" failed to remove root filesystem: failed to remove device f722282d4012ce63d735b6f28f6e685632d72e14861def383b28ad0c4204f279: devmapper: Error saving transaction metadata: devmapper: Error writing metadata to /var/lib/docker/devicemapper/metadata/.tmp334197095: write /var/lib/docker/devicemapper/metadata/.tmp334197095: no space left on device

I needed to now search around and find out why I had no space left anywhere, so little space that it couldn't even write a (what I'm assuming to be tiny) metadata file. I checked the /var/lib/docker folder with du to see if it was something to do with Docker, and I found out that the /var/lib/docker/containers folder was pretty large:

$ du -d1 -h /var/lib/docker/containers | sort -h

44K /var/lib/docker/containers/14661b2a1c4ee5b43bb29fba8b2d2a5836aefb04dd7776e4dc71aad67ba8a95f
44K /var/lib/docker/containers/1552fe1f260292950ba1eb97ec1a4639af13ae64956f9cb074f8744fe082ce85
44K /var/lib/docker/containers/3caa2de88e6a3d703aa51ee712576879cb694feb86e06f714659f0a87b9d4ab5
44K /var/lib/docker/containers/50fdd6457ef04e2f1a6d9a72bff295545eb9012a1c9e6f567fb166ad0a418e3f
44K /var/lib/docker/containers/579e7c91bd446ab5b43141a4f38ae1cbf2d5408496d79616f79a2bae77a437b4
44K /var/lib/docker/containers/9f1928f90876c3feda66ebe9fb4c27c15d600ef1c8b9f8221385ad82df7d8a4c
44K /var/lib/docker/containers/a2990950eb8b07ff34314a972616278e5ed5d86999579041af61cb0274f184c2
44K /var/lib/docker/containers/b295006421048dfac031af8cba3638b68d0982cd9d59c667fe0e6b2d69a848c4
44K /var/lib/docker/containers/b611dc256699d1a8e2b37674acc4dd5e7219edc7b9ba675e39f4ec220fa6f5f5
44K /var/lib/docker/containers/e7c6219d0164fbe2dfbf613a8a6b7e401d24adadbcb79c193afc008b501114a4
44K /var/lib/docker/containers/f65e0f01f4edd51320969a4b0269c7ffd163d786b992174dc04129c403210bcb
44K /var/lib/docker/containers/f8fe8dc6504bb55a81939d1e19e67ac027a028db47ca7ac308fb482a1c11654e
44K /var/lib/docker/containers/fac571a290b7b91d5777215b19dbd9912df14eeb60e514d0a5aed8fec7987124
44K /var/lib/docker/containers/fea2d8d707bc364bc2e6a0750f340545e35a1aab34c238ed87a2913c8fae7db7
60M /var/lib/docker/containers/4604e64c6c8deee3527e41ab12e315376ef5bd130d0545f26242a2cab15731cf
6.8G /var/lib/docker/containers
6.8G /var/lib/docker/containers/bdfe1ca56f6fb45594526bc61ac34258ffb9184870919dfb83eecfd56d514a62

The disk size is only 7.8GB so now it made sense that this is the issue. Next thing to do was figure out why this was happening, so I took a look at the directory and saw that there was a 6.8GB log file! I took a look at it to make sure it wasn't anything important, and then cleared it out.

$ cat /dev/null > bdfe1ca56f6fb45594526bc61ac34258ffb9184870919dfb83eecfd56d514a62-json.log

After taking a look at the log and also the container id I saw that it actually was coming from datadog/agent:latest which surprised me. I also was surprised that ECS didn't auto rotate/kill those logs for me by default since I set that up as a daemon container process that has been running healthy for the past 6 weeks. I wouldn't be surprised if they had a configuration for that somewhere.

The next step is making sure that this doesn't happen again, and to write a script to automate this since it didn't seem to be automating itself. I noticed that in the du | sort command the main directory was included, so I added a grep for /containers/ (notice the trailing slash) to make sure it's not included in the result list.

$ du -d1 -h /var/lib/docker/containers | sort -h | grep '/containers/'

44K /var/lib/docker/containers/14661b2a1c4ee5b43bb29fba8b2d2a5836aefb04dd7776e4dc71aad67ba8a95f
44K /var/lib/docker/containers/1552fe1f260292950ba1eb97ec1a4639af13ae64956f9cb074f8744fe082ce85
44K /var/lib/docker/containers/3caa2de88e6a3d703aa51ee712576879cb694feb86e06f714659f0a87b9d4ab5
44K /var/lib/docker/containers/50fdd6457ef04e2f1a6d9a72bff295545eb9012a1c9e6f567fb166ad0a418e3f
44K /var/lib/docker/containers/579e7c91bd446ab5b43141a4f38ae1cbf2d5408496d79616f79a2bae77a437b4
44K /var/lib/docker/containers/9f1928f90876c3feda66ebe9fb4c27c15d600ef1c8b9f8221385ad82df7d8a4c
44K /var/lib/docker/containers/a2990950eb8b07ff34314a972616278e5ed5d86999579041af61cb0274f184c2
44K /var/lib/docker/containers/b295006421048dfac031af8cba3638b68d0982cd9d59c667fe0e6b2d69a848c4
44K /var/lib/docker/containers/b611dc256699d1a8e2b37674acc4dd5e7219edc7b9ba675e39f4ec220fa6f5f5
44K /var/lib/docker/containers/e7c6219d0164fbe2dfbf613a8a6b7e401d24adadbcb79c193afc008b501114a4
44K /var/lib/docker/containers/f65e0f01f4edd51320969a4b0269c7ffd163d786b992174dc04129c403210bcb
44K /var/lib/docker/containers/f8fe8dc6504bb55a81939d1e19e67ac027a028db47ca7ac308fb482a1c11654e
44K /var/lib/docker/containers/fac571a290b7b91d5777215b19dbd9912df14eeb60e514d0a5aed8fec7987124
44K /var/lib/docker/containers/fea2d8d707bc364bc2e6a0750f340545e35a1aab34c238ed87a2913c8fae7db7
60M /var/lib/docker/containers/4604e64c6c8deee3527e41ab12e315376ef5bd130d0545f26242a2cab15731cf
6.8G /var/lib/docker/containers/bdfe1ca56f6fb45594526bc61ac34258ffb9184870919dfb83eecfd56d514a62

That got me closer, but I also only care about erasing the largest log file each time the script runs, so I piped the results into tail:

$ du -d1 -h /var/lib/docker/containers | sort -h | grep '/containers/' | tail -1

6.8G /var/lib/docker/containers/bdfe1ca56f6fb45594526bc61ac34258ffb9184870919dfb83eecfd56d514a62

And once I have that, I want to isolate the directory:

$ du -d1 -h /var/lib/docker/containers | sort -h | grep '/containers/' | tail -1 | awk '{print $2}'

/var/lib/docker/containers/bdfe1ca56f6fb45594526bc61ac34258ffb9184870919dfb83eecfd56d514a62

The last step is to save that directory in a variable, and then replace the contents of the log file with /dev/null.

#!/bin/bash

DIR=`du -d1 -h /var/lib/docker/containers | sort -h | grep '/containers/' | tail -1 | awk '{print $2}'`
cat /dev/null > $DIR/*.log

I ran the script to make sure that it worked, and then setup a crontab for every hour to run it on each of the servers.

0 */1 * * * /root/log-cleanup.sh

It's not a longterm plan, but should be enough to prevent this issue for a few more weeks.