Subscribe

25th July 2018

Time Doctor Web App and Dashboard Detailed Description of Time Doctor Downtime

Summary

We had major downtime on Monday 23rd to Tuesday 24th July 2018. All of our services, including our web app, dashboard and desktop were unavailable, some people were unable to track time (the ones who were no longer able to log in with the app), there were reports that the time logged in the application does not reflect real time worked.

Prelude

The issue started to manifest itself on Saturday 21st July 2018, when we also had a minor downtime, but at that time, we misidentified the issue as "a heavy SQL query cleaning up a large number of database records slowing down other database accesses / causing database locks, coming from related triggers / stored procedures". The heavy SQL database cleanup query was in the end interrupted by us, which didn't immediately restore our service -- but the database restart did (when a database is severely overloaded, it may be still unable to recover, even if a blocking/heavy query is interrupted - this is because it is already overloaded, and new queries are coming all the time; a better approach is to kill some queries, which we've tried to do on queries going to the database table, but it didn't immediately help -- hence the restart).

Note that this problem was on Saturday when our load is lower (less people are tracking time on weekends).

Major Downtime

Monday is a normal business day, with much more traffic than the weekend. A similar issue started to show up: a large number of queries in the database were taking a long time. We identified that this query isn't really crucial and disabled it in the code. This restored the service for some time. Unfortunately as the traffic increased, we again started to see a lot of queries taking a lot of time.

What was especially puzzling to us, is that we were observing hundreds, if not thousands of very slow queries (1-2 mins, some timing out), even when retrieving data from tiny tables (30 MB in size, a handful of records).

This didn't make sense to us, because the server CPU load was moderate (about 25% of all CPUs used), IO (disk) load was 0-1%, network was not any close to being saturated. Moreover, the number of queries sent to the database seemed more or less "as always", same with web server queries (it wasn't a DDoS attack).

Moreover, we didn't do any major code changes, database structure changes which could lead to such severe issues.

Approaches to Solve the Problem

Not understanding the initial cause, we tried a number of "blind" (and unsuccessful) approaches to the problem:

  • Using a different database server
  • Migrating to a much bigger database server
  • Some MySQL configuration changes, including setting transaction isolation to READ-UNCOMMITTED and other micro-optimizations
  • Rolling back to some earlier code releases
  • Plenty of low level system analysis, which were proving we have some issues, but which didn't lead to any clues about the nature of the problem.

Further, we rejected some of the ideas as we assumed they were unlikely to solve the issue:

  • Using a CDN for the api/dashboard (in case it was a DDoS after all)
  • Migrating the database to Amazon RDS.

We also tried getting help from external parties:

  • Reaching out for AWS business support (they are really only able to comment if it’s a specific AWS hardware issue and they said “all is fine")
  • Reaching out for help from several other consultants, but we were not able to find anyone who could help immediately or who had the skills
  • After some efforts we did find another consultant who was able to give us some help.

Things We Learned

Until this downtime, we were assuming that the most significant metric which tells us if the database is overloaded, is CPU. Our database server was using just around 25% of CPU. The consultant pointed us to the number of threads we had in the database:

+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| Threads_running | 2154  |
+-----------------+-------+
1 row in set (0.00 sec)

This is significantly higher than the number of CPUs (40 CPU at the time we were starting to experience the issues; 96 CPUs after we've blindly upgraded hoping it will solve the problem).

What does a very high number of threads_running mean?

Threads_running represent a number of busy mysqld threads. A thread, from kernel point of view, is almost the same as a process when it comes to assigning CPU time to them.

For example, right now, on a healthy Time Doctor SQL database system, we will have around 2,700 mysql threads.

Given that most of the time, on a healthy Time Doctor database system, we typically have 10-20 Threads_running. It means that the kernel schedules the execution of the first running thread on one CPU, the second thread is running on a second CPU, and so on (other ~2,700 threads, are in Sleep state, with PHP connected, and not sending any queries, they are in background, don't cause any CPU activity):

+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| Threads_running | 11    |
+-----------------+-------+
1 row in set (0.00 sec) 

To simplify, if the number of threads_running is higher than the number of CPUs, the kernel has to pause the execution of thread i.e. 156, let thread 157 run for 0.01 sec, then pause it, let thread 158 run for 0.01 sec, and so on, for all thousands of threads, switching all the time between threads and CPUs. This has high overhead and is extremely inefficient - more so if the number of threads_running goes into thousands. In layman's terms, think of yourself having to read 50 books at the same time, one sentence from each book at a time.

Why Did the Issue Happen Now?

We think it's a mix of three things:

  • How many queries are being sent to the database server (every second)
  • How long does it take for the database to answer to all these queries
  • The number of CPUs
  • Depending on time of day, we're processing some 30-40k queries every second
  • With a database server with 40 CPUs, it's around 1000 queries per CPU every second, which translate to "for each CPU, we can afford to spend some 0.001 sec per query, on average, before we get overloaded".

Some queries take significantly longer than 0.001 seconds, some taking as long a 0.3 seconds or longer, but it's the average that counts, and there are many queries which take less time. But if the average time for each query is too long then it causes a cascading effect which causes the number of threads to increase to the point where even if the CPUs are not fully utilized the servers are spending all the time switching between threads.

Contributing to this problem is that we also think that we hit some kind of MySQL bug which resulted in getting corrupt index/slow index lookups. It could also be an internal innoDB data integrity issue.

How We Resolved the Issue

We've removed a large number of records from the main large database table. We are continuing to remove unused records. This was in addition to using a larger database instance from AWS (Amazon Web Services). However the larger database by itself was not initially able to solve the issue.

Further Measures We Are Taking to Make Sure This Doesn’t Happen Again

  • We are planning to remove older data from the main database and then send it to a separate
  • We have hired an external consultant system administration company to consult with us on an ongoing basis and provide further backups during any issues with our servers. This is in addition to our two internal System administrators
  • Further clean ups in the main MySQL database
  • We plan to move older records (older than January 2017) to a separate database. This will take some time to connect the data from the new database to our front end and may mean that our customers do not have access to data prior to January 2017 for some time while we move the data
  • Caching more queries
  • Optimising queries.

Conclusion

Our sincere apologies to all our customers for this incident. We were aware of several issues with scalability of our database and had plans to work on it over the next year, but we anticipated at least one year of runway before we would hit a problem like this. We made a mistake of not working on this sooner and not prioritizing the scalability. The one year estimate was obviously wrong and now we are putting our full attention into everything related to scalability of our databases.