Programming Interview Questions and Answers: Data migration strategies and design patterns

Friday, March 15, 2013

Data migration strategies and design patterns

Data migration is an extremely common operation in software design and development. Whenever a new system is introduced or a legacy system is redesigned, existing data has to be moved from the legacy system to the new target system.

From an interview perspective, you should know and understand the following well understood design patterns and guidelines that make the process of data migration smoother, faster and accurate with least amount of surprises during go-live.

Develop with Production Data: Use real data from the production system for tests during the development of the migration code.
Migrate along Domain Partitions: Divide and conquer the migration effort by migrating largely independent parts of the domain model one after another.
Measure Migration Quality: Implement code that collects and stores all sorts of information about the outcome of the migration during every run.
Periodic Quality Reports: Generate detailed reports about the measured quality of the migrated data and make it available to all affected stake holders.
Robust Processing: To prevent the migration process to halt from unexpected failure, apply extensive exception handling to cope with all kinds of problematic input data.
Data Cleansing: To prevent the new application from being swamped with useless data right from the start, enhance your transformation processes with data cleansing mechanisms.
Incremental Transformation: Perform an initial data migration before the new application goes live. Migrate data that has changed since then immediately before the new application is launched.

A few other salient points to keep in mind are:

During migration (extraction and load), the processes should log the following items to facilitate the monitoring, debugging and verification process:

Global and per entity start time
Global and per entity end time
Number of entities to process
Entity
Source table(s) and schema
Destination schema
Number of records processed
Errors/Warnings encountered

Data attributes should be anonymized before being processed for testing purposes. The following data elements should be anonymized before they are processed by “load” systems. Anonymization process should use well defined values that the test team is aware of.

Email
Phone numbers
Passwords
IP Addresses

Build automated tools for analyzing the logs generated during Migration. The advantages of logging and analyzing the logs have huge advantages:

Potential migration issues are discovered earlier and can be fixed thereby reducing the cost and effort
Data inconsistencies issues can be fixed sooner
Facilitates the coordination between factories and provides a common language for analyzing and debugging

Lastly, you might be required to draw your thoughts on a whiteboard on how you would model an ETL (Extract-Transform-Load) strategy. Here is one model that has worked for many common data migration scenarios. Since data migration is a one-time operation, the GOAL should be to reduce the number of transformation (and load) processes. C can be selectively used based on if multiple passes are required to transform the data.

What other strategies have you used? What has worked well for you in the past? What did not go well? The readers and would love to hear your thoughts and experiences.