Why did this server run out of disk space?

Less than two hours after I logged into the admin end of this website, which is powered by WordPress, the site went offline, with a 502 Bad Gateway error. What the…?

So I logged in via ssh and noticed a serious lag between when I type a letter and when it appears on the screen. I’ve got a problem. And with latency like that, running top will not help. After verifying that all the server applications were running, I checked disk usage using df -h. The output hinted at the problem I was facing.

Filesystem      Size  Used Avail Use% Mounted on
/dev/vda         40G   38G     0 100% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            992M   12K  992M   1% /dev
tmpfs           201M  320K  200M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none           1002M     0 1002M   0% /run/shm
none            100M     0  100M   0% /run/user

The server was out of disk space! But this is a Cloud server with 40 GB of disk space and only serving a couple of websites. What could have consumed all the disk space?

Just to recover some space and see if I could get the site back online, I decided to delete a few zip files that had served their purpose. Another df -h after that revealed that I had recovered 500 MB.

Filesystem      Size  Used Avail Use% Mounted on
/dev/vda         40G   37G  500M  99% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            992M   12K  992M   1% /dev
tmpfs           201M  336K  200M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none           1002M     0 1002M   0% /run/shm
none            100M     0  100M   0% /run/user

That brought the site back online and put a smile on my face. And just at that point I recalled having installed a backup plugin. Perhaps the backups were the culprits. An ls -lh in the backup directory confirmed my suspicion.

-rw-r--r-- 1  93M Feb  8 03:00 database-backup-1417314413.sql
-rw-r--r-- 1  14M Jan 25 03:05 backup-1417314413-complete-2015-01-25-03-00-51.zip
-rw-r--r-- 1  11G Feb  1 03:08 backup-1417314413-complete-2015-02-01-03-00-02.zip
-rw-r--r-- 1 3.4G Feb  8 03:00 backup-1417314413-complete-2015-02-08-03-00-00.zip

Two huge backup files were in the directory and the latest had a timestamp that matched when the site went offline. Since I had already downloaded the previous backup, I deleted it from the server. Another df -h broadened my smile.

Filesystem      Size  Used Avail Use% Mounted on
/dev/vda         40G   27G   11G  72% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            992M   12K  992M   1% /dev
tmpfs           201M  336K  200M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none           1002M     0 1002M   0% /run/shm
none            100M     0  100M   0% /run/user

Now the site is running like it’s supposed to. The server running out of disk space was an oversight. The main lesson here is I need a better backup strategy. Probably need to be backing up to a Cloud storage service. The other lesson is I need to have a script that reports disk usage by email daily, because this should never happen again.