Cassandra migration tool

an image alt text

Developing a product usually means that during the period of development you are going to change requirements and technologies especially if it’s an RnD type of a project. Cassandra as a database is known for great performance but this performance comes from the fact that data modeling is query based, meaning that it is not that simple to extend functionality as is with a traditional relational database where data model is entity based. We were working on our second RnD project when we felt the need for something that would help us apply changes to schema but keeping the existing data. The early stages of development and prototyping are easy because you can always drop and recreate the database to reflect the new schema but this was not the case since some clients were already testing implemented functionalities and evaluating the system we were working on. This was really crucial for our product because it helped us shape the new functionalities and make changes to the existing ones.

The first steps towards schema migration tool made us do a bit of research. There were a few projects that were aimed at solving this issue but none was really what we needed: a lightweight framework that would allow us to make changes to schema and take care of the data. A big bonus would be the ability to execute this tool with different parameters in different stages of deployment. We started our investigation with Pentaho with kettle, Talend and Clover ETL. These projects are using heavy ETL and are generally well supported and have a great community. But it comes at a price of having a huge overhead and complexity with more features than we needed. Also, there were free and paid versions meaning of functionality set differences. The next one was Ruby cassandra migration and it was looking great. It was truly lightweight and easy to install using gems but it had some drawbacks. Application logic for calculations and transformations had to be implemented in both ruby and java and we could not benefit from java object mapper.

The last one was Mutagen cassandra which was a really small framework and built around the similar idea we had in mind but it was a one man repository with no fresh commits and it required Neflix’s astyanax driver which would require some changes before we could use it.

We decided to build a simple tool for executing database schema transformations and keep track of schema version in the database itself. This is required if we want to be able to make changes to schema while database is operational and make these changes through the code so that we can test it before running in production. Simple schema changes are easy because cassandra allows adding or removing columns, changing types with some restrictions and similar simple updates but we also had the need to make drastical changes to some tables. Since we were working on a prototype project which was also serving clients at the same time we couldn’t just drop and recreate schema but had to keep all the data. Being able to write schema transformation and execute it in a unit test is something you would want on a live system.

The migration tool started as just a runner for migration implementations but then we figured out that there were two stages where we wanted to execute migrations. The first stage is when application code is build and we want to deploy it to the server. In order for the application to run we needed to update the schema. In this step we executed schema type migrations. This is where schema gets changed and everything gets set so that the application can consume data. But there were some cases where after doing updates to schema we needed to handle big amounts of already existing data. For this purpose, we defined data type migrations. This is usually used when we create a new table for the existing application model in order to serve new queries but there is already a certain amount of data in the database. Executing this in the pre-deploy stage would take time and thus keep the application down. We wanted to get the application up and running as soon as possible so this work had to be done afterwards. Executing newly implemented queries wouldn’t return all results until data migration finishes execution but we would still be able to serve requests and handle new incoming data. Here, uptime vs consistency won and application had minimal down time.

For future work, we plan to leverage schema updated events from the database so that we are 100% sure that the change has been propagated to all the nodes and there is no need to use any waiting mechanism in the migration implementation. Here is a JIRA ticket that solves this issue.

Of course there is no perfect tool for all possible use cases but this one helped us a lot and made our life much easier through handling constant schema updates on a live prototype system. If this sounds like a viable solution for your problem head out to our github page and try it out. Please do send us a feedback, report any issues you have or even contribute to the project if you have an idea for improvement.