We had major downtime on Monday 23rd to Tuesday 24th July 2018. All of our services, including our web app, dashboard and desktop were unavailable, some people were unable to track time (the ones who were no longer able to log in with the app), there were reports that the time logged in the application does not reflect real time worked.
The issue started to manifest itself on Saturday 21st July 2018, when we also had a minor downtime, but at that time, we misidentified the issue as "a heavy SQL query cleaning up a large number of database records slowing down other database accesses / causing database locks, coming from related triggers / stored procedures". The heavy SQL database cleanup query was in the end interrupted by us, which didn't immediately restore our service -- but the database restart did (when a database is severely overloaded, it may be still unable to recover, even if a blocking/heavy query is interrupted - this is because it is already overloaded, and new queries are coming all the time; a better approach is to kill some queries, which we've tried to do on queries going to the database table, but it didn't immediately help -- hence the restart).
Note that this problem was on Saturday when our load is lower (less people are tracking time on weekends).
Monday is a normal business day, with much more traffic than the weekend. A similar issue started to show up: a large number of queries in the database were taking a long time. We identified that this query isn't really crucial and disabled it in the code. This restored the service for some time. Unfortunately as the traffic increased, we again started to see a lot of queries taking a lot of time.
What was especially puzzling to us, is that we were observing hundreds, if not thousands of very slow queries (1-2 mins, some timing out), even when retrieving data from tiny tables (30 MB in size, a handful of records).
This didn't make sense to us, because the server CPU load was moderate (about 25% of all CPUs used), IO (disk) load was 0-1%, network was not any close to being saturated. Moreover, the number of queries sent to the database seemed more or less "as always", same with web server queries (it wasn't a DDoS attack).
Moreover, we didn't do any major code changes, database structure changes which could lead to such severe issues.
Not understanding the initial cause, we tried a number of "blind" (and unsuccessful) approaches to the problem:
Further, we rejected some of the ideas as we assumed they were unlikely to solve the issue:
We also tried getting help from external parties:
Until this downtime, we were assuming that the most significant metric which tells us if the database is overloaded, is CPU. Our database server was using just around 25% of CPU. The consultant pointed us to the number of threads we had in the database:
+-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | Threads_running | 2154 | +-----------------+-------+ 1 row in set (0.00 sec)
This is significantly higher than the number of CPUs (40 CPU at the time we were starting to experience the issues; 96 CPUs after we've blindly upgraded hoping it will solve the problem).
Threads_running represent a number of busy mysqld threads. A thread, from kernel point of view, is almost the same as a process when it comes to assigning CPU time to them.
For example, right now, on a healthy Time Doctor SQL database system, we will have around 2,700 mysql threads.
Given that most of the time, on a healthy Time Doctor database system, we typically have 10-20 Threads_running. It means that the kernel schedules the execution of the first running thread on one CPU, the second thread is running on a second CPU, and so on (other ~2,700 threads, are in Sleep state, with PHP connected, and not sending any queries, they are in background, don't cause any CPU activity):
+-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | Threads_running | 11 | +-----------------+-------+ 1 row in set (0.00 sec)
To simplify, if the number of threads_running is higher than the number of CPUs, the kernel has to pause the execution of thread i.e. 156, let thread 157 run for 0.01 sec, then pause it, let thread 158 run for 0.01 sec, and so on, for all thousands of threads, switching all the time between threads and CPUs. This has high overhead and is extremely inefficient - more so if the number of threads_running goes into thousands. In layman's terms, think of yourself having to read 50 books at the same time, one sentence from each book at a time.
We think it's a mix of three things:
Some queries take significantly longer than 0.001 seconds, some taking as long a 0.3 seconds or longer, but it's the average that counts, and there are many queries which take less time. But if the average time for each query is too long then it causes a cascading effect which causes the number of threads to increase to the point where even if the CPUs are not fully utilized the servers are spending all the time switching between threads.
Contributing to this problem is that we also think that we hit some kind of MySQL bug which resulted in getting corrupt index/slow index lookups. It could also be an internal innoDB data integrity issue.
We've removed a large number of records from the main large database table. We are continuing to remove unused records. This was in addition to using a larger database instance from AWS (Amazon Web Services). However the larger database by itself was not initially able to solve the issue.
Our sincere apologies to all our customers for this incident. We were aware of several issues with scalability of our database and had plans to work on it over the next year, but we anticipated at least one year of runway before we would hit a problem like this. We made a mistake of not working on this sooner and not prioritizing the scalability. The one year estimate was obviously wrong and now we are putting our full attention into everything related to scalability of our databases.