an image alt text

MySql is still widely used in application development as a stable, fairly performant and reliable relational database. People tend to use languages, technologies and tools they feel most comfortable with which is expected. We have got to the point where we generate more data daily than we can store or process, not to mention getting valuable information from it. I have worked on a project for one client where they had been storing time series data into MySql and after a year or two their database started to choke and indexing was falling apart. What’s even worse they didn’t even use that data properly but just stored it and displayed it with a couple of different queries.

The client’s requirement was to have better performance than it already was, with the ability to continue storing their event data. The problem with MySql was that they were getting web request timeouts before retrieving data from database. Queries were not complex at all but the sheer amount of data was too much for MySql. We had a competition with another company that went with Hadoop and we were going to measure up against them. Knowing that Hadoop didn’t have the speed that Cassandra has, this was no contest. On the other hand, Hadoop platform had more functionalities besides storage and would provide us with a way to analyze stored data but we had our eyes on Spark for that.

The first step was to define all the ways we need to query data in order to create a suitable data model. This is a crucial step with any database and especially with Cassandra because it has a query based data model.

After a few sessions together with engineers with domain knowledge we got to the point where all queries were covered and data model looked good. After initial implementation and testing, we had to make some minor changes to the data model to get the best performance possible. One of the issues when migrating from Cassandra is that the request timeout is probably much less by default than the database you are coming from. In Cassandra the default is 10 seconds and we opted for keeping that value. It sounds like a lot, but for some heavy queries MySql it took up to a minute or more. Since our events have a lot of information and tend to be bigger than the average time series data, query responses were easily measured in megabytes. Pulling that amount of data from replica nodes, merging it into a response on a coordinator, pushing it over the network and packing it up into application models has to take some time. We weren’t targeting millisecond or even sub millisecond responses but rather staying under that 10 seconds timeout.

When implementing our solution, we used all bells and whistles that .net driver has to offer. Starting out with a 5 node cluster, we were able to leverage the parallel execution of partition based queries. There is no benefit in executing parallel queries on the same partition since they will all get executed on the same replica nodes. When executing on different partitions there is a high chance that results will be handled by different nodes thus having the response time of the slowest node.

One of the major problems was having query limits while executing parallel queries on multiple partitions. We ended up having some overhead in these queries since you don’t know how many events will get queried from each partition because they are unevenly populated. Investigation proved that these queries made up a small percentage of total queries so we decided to leave it at that. There are different ways of handling this and squeezing more performance but all in all, we were happy with the results.

The end result was being able to query events for a one-year timespan with reasonable response time which was not even possible with mysql, and prioritized queries were several times faster than mysql. The benefits of this migration are mainly performance and scaling but we have also enabled the client to execute the analysis of their data which can help them improve their business.