Over the last four years we have been using Flyway to migrate our databases in a large client project. The project involves 100+ people in multiple sites. Over time the application and the data grew rapidly. As consequence, the database must evolve to cope with the changes of the application. There were some challenges while we were migrating the databases in the staging environments. In this blog post I like to share how we migrate our databases with Flyway.
Database migration with Flyway
Develop migration scripts as code
As startup we define the domain model in a set of JPA entities. Also we use JPA annotation to constraint the data type for not nullable, max. length, etc. Then we use the JPA mapping-tool to generate the DDLs corresponding to the JPA entities and setup the initial Database with the DDLs. This set of DDLs are checked into Git repository together with the source code. When a developer makes a change to database, he must generate the DDLs again, then creates a change set to the database. He can see the changes in Git the differences between the current DDLs and the previous ones. Based on the differences the developer creates some scripts for database migration. The scripts migrate the database schema and the existing data as well.
We develop in scrum and deliver the release package (I will name it the increment) at the end of the sprint. The following diagram illustrates the scripts in different increment.
When a sprint ends, our jenkins release build packs the Flyway scripts into the new release package. Usually the scripts are placed in directories which we have discussed and agreed with the DevOps. The DevOps configure the Flyway to scan in the predefined directories. Before the deployment of the new application, they run Flyway to execute the found scripts. In this way the database migration works in a controlled manner.
Our workflow to migrate database with Flyway
We have a classic staging process to promote our increment to the PROD (production) stage. The following diagram is a simplified version of our development process from DEV to PROD. As in the diagram we have three stages, DEV, TEST and PROD.
At the end of the sprint we deliver our new increment to the lowest testing environment (DEV). The DevOps receive the delivery and deploy the increment on DEV for testing. After a while if this increment is tested and approved, then they deploy it to TEST. We do not wait for the test result of the out-going increments, instead we continue to develop a new version, then deliver it in the next sprint. If an issue is found by tests in the staging environments, we will fix it in the same git branch. Based on the fix we build a new increment, and deliver it as hotfix to the same stage.
Diagnosis of the database with Flyway schema history
After the Flyway migration is finished, we can see the change history in the schema history table. The schema history records all important information of the executed Flyway scripts. The following diagram shows the simplified schema history of the environment DEV, TEST, and PROD.
When you get a broken database and you are supposed to repair it, the information in the schema history can be extremely helpful. Since we automate the database migration with Flyway, we can easily find out what and when happened to the database, then quickly identify the irregularity. Sometimes, it can be tricky if there was previously hotfix deployed on the environment, because the schema history contains the extra hotfix scripts. In the next section, I will explain the treatment for hotfix migration.
Hotfix with Flyway
From time to time we find issues on a certain staging environment. As mentioned above we fix issues by hotfixes. We confronted some challenges in hotfix development. One challenge for instance was to keep the database schema history of each stage environment in the right chronological order.
Generally we begin to fix issues in the highest stage, then try to populate the database changes to the other staging environments. We transform the scripts with some techniques to avoid the side effect of repeated execution.
Fix on the issue environment first
Usually if an issue appears on a stage environment, we solve that issue on the same git branch, then deliver hotfix to that environment. It’s the easiest and the most time efficient way for us.
Assuming that we detect an database relevant issue on PROD with version 7, then we take care of it with a hotfix v7.1 which contains some scripts. By the time application increment of version 8 reaches the PROD, Flyway can recognize that the script version in v8 is higher than version 7.1 on PROD and execute the scripts of v8 on top of v7.1. For Flyway it is important the script version should always be ascending, and not descending.
Merging Flyway scripts from hotfix back to main stream
After deployment of the hotfix v7.1 we have done a database change on PROD, which is absent on DEV and TEST. How can we make the DEV and TEST DB adherent to PROD? The short answer is to merge the scripts in v7.1 back to main stream.
The diagram above visualizes the merge from hotfix v7.1 to main stream which is v10. At first glance it seems simple, since we can cherry-pick the hotfix commit and move the scripts into main stream. Actually, a simple copy-paste will not work, because Flyway only executes the scripts on top of the highest version. We can recall that the top version in database schema history in TEST is version 8 and in DEV version 9. Flyway will notice that the v7.1 scripts have lower version than v8 or v9, thus ignore them on both DEV and TEST. Again, the default behavior of Flyway is executing the scripts in an ascending order.
Now let me explain how we merge the scripts.
- At first we merge the scripts in hotfix to the development branch with a higher version number
- Secondly we rewrite the scripts so that they are rerunnable and idempotent. Specifically the merged scripts are capable of discovering whether the same thing has been done on the database, then skip themselves correspondingly. When the new increment is afterwards deployed on the PROD, these “new” scripts will skip themselves, as the hotfix has been deployed there previously.
Zero downtime while migrating database with Flyway
As our application is 24×7 online, there is no downtime on the production environment permitted. It means when we are migrating the database, the application must stay live.
Blue green deployment for zero downtime
The strategy that we follow is so called blue-green-deployment. That is to say we maintain 2 servers on every stages simultaneously. The following diagram describes how blue-green-deployment works. As showed in the diagram the two servers share a database.
Before the deployment, the router reroutes the requests to the blue server with a newer version, while the green server with a older version stands by as backup. When a new version is released and delivered to this stage, we deploy the new version to the green server. We generally migrate the database before the deployment.
After the migration on the database and the deployment on the green server are finished, we reroute the requests to the green server. Then this server begins actively serving the requests and we call it now the active server. Next time when another new release is delivered to this stage, it will be deployed to the blue server. We repeat the same process over and over.
2-phase database migration to support blue and green server
Since we let the blue and green servers share a database, this database must support both servers after each migration. That way, we can switch back to the server with older version, if the new deployed version does not work out. We apply a so called 2-phase migration technique for this use case. For instance, say we want to drop a column PHONE from a table USER. The diagram beneath illustrates what we do:
- At phase one, we remove the references to the column in application v6. There is no schema change. The records inserted by v6 will not contain value for colum PHONE. In case we need to switch back to v5, we create a trigger, which fills the column PHONE with some value. So the server with v5 will still work with the newly inserted data.
- At phase two, we drop the column PHONE from the table USER. The database schema changes. But it is totally fine, because the both servers (v7 and v8) have no references on the column.
Results of the migration with the zero-downtime requirement
- The blue and green server share a database on each stage, while this database supports both servers at any time.
- We have always one server which serves request actively, and the other server stands by.
- In case of failure of new application, we can switch the passive server back online. But we do not undo the database migration.
Environment specific database migration
Normally we set up our databases on every environments using the same migration scripts. Sometimes, this will not work properly, while certain configuration of database table differs from environment to environment. For instance we have a column LOCALE in table CUSTOMER for language preference. Then we write the initial migration script with the default value ‘EN’ for that column, which is correct for the environment in western countries. But it is not always true for the environments out side of Europe. On the environments of non-English speaking countries such as China the default value for that column shall be ‘CN’. The dilemma is that it will be overkill to design a script to work differently from environment to environment.
Luckily we have an option from Flyway for that situation. We copied the original script in a different directory and adjusted it according to the target location, then configured the Flyway in China environment to scan in this directory for scripts. That way the migration scripts in that directory can set the column LOCALE correctly depending on the environments.
This solution with environment specific directories enables Flyway to resolve the local dependency in database schema. However this approach should only be used sparingly, because it can lead to different schema history in different environments, and the differences in schema histories will complicate the analysis on database.
Migrating large table with Flyway
Some of our tables in the production database contains tens of millions records. Those large tables make the migration more complicated than we expected. In short, it takes a longer time to migrate the large tables. Since each Flyway script runs in a transaction, we need bigger disk space to enable the rollback. When we are migrating the large tables, the other accesses (from the application) to that large tables are blocked. In worst case the accesses reach their time-out and fail, before the migration is finished.
Migrating data of large table
In the practice we will analyze the large table, before we migrate the data. We always organize a test run on a copy of the real PROD database, in order to see how long the migration takes. If a migration to one large table is a long running process, say over 30 seconds, we just take that script out of Flyway, optimize it and execute it manually. Additionally, we must apply the divide and conquer principle. That is to partition the data in proper portions, then migrate each portion in a commit. We define the threshold as 30 seconds, because the application transaction times out after 30 seconds.
Migrating schema of large table
It is a little bit scary to touch a large table, since you are afraid of making damage. Nevertheless, we have to alter the schema or it can not support the application change. If the alteration is a REORG pending operation on database, then we must do a REORG on that table. No DBA would approve such a REORG on a large table of the PROD database, because it will completely block the access to the table for a long time, which violates the zero downtime policy.
At the time being, we are practicing a concept to solve this problem. What we do is to rename the large table say USER to USER_OLD, then immediately create a new empty table USER with the same name and the proper schema we wanted. After that we incrementally transport the old data from table USER_OLD to the new table USER. The trick is that we must do the rename and creation in a single transaction. That way the schema migration is gracefully done, and the application will not even notice the table has changed.
Flyway helps us manage the complex database migration on all staging environments in a controllable way. We develop the migration scripts and include them in the delivery package. Then, the DevOps take care of the rest on the staging environments. While we still have to work with the DBA very closely, it is not a burden any more for a developer to take care of the database migration. If an database issue shows up, the Flyway schema history is a convenient instrument to check the database status.
Besides the routine database migration with Flyway, the following points are the adjustments due to the challenges in our project.
- The merged hotfix scripts must be idempotent, while merging the hotfix back to main stream.
- We apply the blue-green deployment and 2-phase database migration so that the application has zero-downtime.
- Migrate the large tables separately