Data Migration: Moving the Actual Data

There are two key elements to the exchange of any information between databases. One is the data structure used for the exchange, and the other is the transformation required to reach those structures.

Some of these are driven by the source database, others by the target database. For example, when moving data from RDBMS to NoSQL database generally requires constructing documents from what might be tabular, or joined-tabular data. This may involve both join elements on the relational side, as well as formatting on the NoSQL side. The eventual aim is to ensure that the data reaches the target database in both the right format, and without corruption, and also in a format that is most appropriate or efficient. That ultimately depends on what you are using the transferred data for.

The other aspect is the difference between source and target data types – that is the format and construction of the individual fields or components of the data. Document databases and Big Data stores tend not to care about the data type, whereas RDBMS cannot live without them.

Some important considerations for how we use this information:

  • The data structure must be retained (i.e., we must be able to identify the fields, columns or other elements of the data).
  • The data format and integrity must be maintained (the data should not be corrupted, shortened or reduced in any way).
  • The data must be able to be efficiently transferred (sending a 1GB file that only contains 15KB of valid information is not efficient).

In this chapter we’ll examine some of the key differences and problems with transferring data that transcend the mechanics of the process, and how to deal with them effectively. Although the three primary aspects, basic formatting, structural comparisons and data type limitations are handled separately here, there are few occasions when you can truly treat these elements individually. We’ll see some examples of this as we go through.

Basic Interchange Formats

When you start to move data efficiently between the different database types that you exist you will find a number of different possible interchange formats, and the primary issue with all of them is exactly how efficiently, and more importantly accurately, they enable the information to be exchanged.

First, the data will need to have been encoded with some kind of encapsulation format. This format is what describes the individual structure, and is entirely dependent on the underlying data that is being exchanged. For example, if the data has a very rigid structure then that will obviously normally enforce the format of the information. Row-based data, for example, can normally be encoded using CSV or even a fixed-width record format.

The second aspect is the encoding and formatting of the information itself. Exchanging data using CSV is fine, providing that you can correctly identify the format of the text itself. Should it be encoded in UTF-8? Or UTF-32? Is plain ASCII better? What if it contains a mix of these characters, should UTF-8 be used as the standard and the actual encoding handled by the database target when the data is imported?

In fact, many of the principles about translating information between different databases also rely on basic best practice for how you design and structure the data in the target system to begin with. Normalisation of the data structure for information can normally be abpplied to any database, even those that might have a loose or undefined structure have conventions. It’s unlikely, for example, that you will call a recipe title field ‘title’ in one record and ‘recipename’ in another record of the same database because your application will be a mess.

Of course, there are times when you may be merging, combining or otherwise consolidating data from a wide variety of different documents, records or blocks of information. There it is up to you to ultimately pick a standardisation for it to be useful to you once it’s been moved into the target system.

As a rough guide for the types of operation and translation that might take place, the following table highlights the kind of structural transformation and changes you might need to make when moving between some of the most common database environments.

Table 2-1: Structural Mappings between database environments

RDBMS Columnar Store Document Database Freetext/unstructured data store
RDBMS Vendor specific only Vendor specific only Field mappings only Application specific
Columnar Store Vendor specific only Vendor specific only Field mappings only Application specific
Document Database Field mappings only Field mappings only Vendor specific only Application specific
Freetext/unstructured data store Application specific Application specific Application specific Application specific

Notes:

  • Vendor specific only changes are those that are directly related to the capabilities of the source or target database. MySQL for example supports the ENUM and SET compound field types, whereas Oracle, PostgreSQL and SQL Server do not. Moving from one to the other may required changes.
  • Field mappings only refers to how you map the source fields or columns to the target fields/columns. Depending on the target this may include compound and/or JOIN based translation. For example, when moving from a document database to an RDBMS you might convert a compound field into a single field, or a lookup table. When translating from an RDBMS to a document store, the data might be combined using a JOIN into a single target field.
  • Application specific changes are those that will entirely depend on how you to use the information. Translating document data into freetext databases is unlikely to require any changes. But converting freetext info into an RDBMS format is going to require some significant identification and translation.

Let’s dig into some more of the specific challenges.

Row-Based Data

For row-based data, the information can generally represented and formatted as one of the regularly used and displayed formats, such as Character Separated Values (i.e. CSV), or in a fixed width format. Row-based data (which includes the column-based data used in big data stores) is probably one of the easiest formats of data to exchange. In nearly all cases the list of columns is usually pretty well fixed and the format of the data is well known because the columns are fixed.

Character Separated Values (not Comma-separated values) is one of the oldest methods of exchanging fixed format data like this, it was often used as the only available method for exchanging information in a reliable fashion. Historically most tabulated data like this tended to be financial information, and so the content and format of the information was relatively simple. As such, the most common format was to use carriage-returns (or the operating system equivalent, which could be newlines or carriage-return and newline characters) to separate the records, while the individual fields in each row of data were separated by a comma (hence Comma-Separated Values as the CSV).

For example:

1085,Creamy egg and leek special,,4,1,0,0,0.0,0.0,0
87,Chakchouka,A traditional Arabian and North African dish and often accompanied with slices of cooked meat                      ,4,1,0,0,0.0,0.0,0
347,Mozzarella and olive pizza,A very simple pizza base made without yeast topped with traditional Italian ingredients. Look out for a low fat type of Mozzarella or Cheddar cheese if a low fat diet is being followed.,4,1,0,0,0.0,0.0,0
720,Savoury pancakes,Basic pancake recipe. Allow 30 min standing time for batter.,8,1,0,0,0.0,0.0,0
477,Seafood filling for pancakes,,8,1,0,0,0.0,0.0,0

The problem with commas and carriage-return characters is that, as computers got more complex, and the data they stored got equally more complex, how do you determine between a comma in some text, and a comma separating a field? What if you transfer a text string that contains a newline or carriage return. You don’t want that interpreted as the end of the record if it happens to part of the field. The initial solution is to use some kind of further delimiter. For example, using double-quotes:

"1085","Creamy egg and leek special","","4","1","0","0","0.0","0.0","0"
"87","Chakchouka","A traditional Arabian and North African dish and often accompanied with slices of cooked meat                      ","4","1","0","0","0.0","0.0","0"
"347","Mozzarella and olive pizza","A very simple pizza base made without yeast topped with traditional Italian ingredients. Look out for a low fat type of Mozzarella or Cheddar cheese if a low fat diet is being followed.","4","1","0","0","0.0","0.0","0"
"720","Savoury pancakes","Basic pancake recipe. Allow 30 min standing time for batter.","8","1","0","0","0.0","0.0","0"
"477","Seafood filling for pancakes","","8","1","0","0","0.0","0.0","0"

This doesn’t fix the problem, it just diverts your attention for long enough to realize that now what happens if one of the delimiting characters needs to be used in the text? We could escape it, by prefixing it with a backslash:

"700","Garlic mushroom kebabs","The longer you leave these mushrooms to marinate, the better they will taste.\nGood for barbecue.","8","1","0","0","0.0","0.0","0"

But now we’re getting complex, both to read and write the information, the level of complexity is increasing to point of introducing further possible methods of corrupting the data as it gets transferred.

The alternative is to use a different delimiter that is unlikely to be used within the text in any form. Hadoop in fact follows this model, using the hex characters, 0x01 and 0x0A to delimit records and fields. As binary characters these are unlikely to be used in what is normally human-readable text. Of course, once you start transferring binary data, you need to find another method, such as hex-encoding binary data.

"700","Garlic mushroom kebabs",VGhlIGxvbmdlciB5b3UgbGVhdmUgdGhlc2UgbXVzaHJvb21zIHRvIG1hcmluYXRlLCB0aGUgYmV0dGVyIHRoZXkgd2lsbCB0YXN0ZS5cbkdvb2QgZm9yIGJhcmJlY3VlLgo=,"8","1","0","0","0.0","0.0","0"

The other alternative is to use a fixed width format. This has the advantage that providing you know the exact widths of the individual fields, encoding errors are eliminated because the characters are no longer significant in the format of the information.

The downside of the fixed-width format is that the size of the fields, records, and ultimately files, can become prohibitively large if you are exchanging potentially large or unlimited fields. For example, BLOB types in most databases can be MB or GB in size; expressing that in fixed width format is obviously not a practical solution.

Record-Based Data

Record based is information that may not necessarily be identifiable or resolvable by an easy to use row specific format or structure such as that used in CSV exchange. Complex table data, or information that that is made up of a combination of fixed fields and BLOB fields, for example, is unlikely to reliably, or efficiently, transferred. The problems of character and structural formats will ultimately make using that information difficult or computationally expensive when actively sharing the information – for example by making the file sizes too big to be practically exchanged.

A very typical example of record based information is either information from a document-based database, a free texrt database, or where the information that makes up the majority of the content is in fact really an attachment or noit inline field and database data. Think of an email message; the message, the address, from, subject are all examples of easily identifiable and classifiable database information. But what do you do with an attachment that might have been sent along with the recored?

How about documents generally? Metadata about those documents could be represented in a typical RDBMS row, but not the document itself. But the combination of the two – the metadata and the actual document together make up a ‘record’ that you may want to effectively share all or part of with another database.

When transferring record-based data, think first about what constitutes the record and how that can be represented in the different environments. Then move on to understand how the fields and individual data points can be translate into a format suitable for the target database. With record-based data, it may be that you have a massive volume of data and compound records that when move from a document store to a transactional RDBMS require 20, 30 or 200 rows of data to be represented properly; this is not a problem, providing you find a method for identifying all of the row data that refers to the record is handled correctly.

In general with a record based database the easiest approach is to actually translate the data at the source into something that can be imported directly into the target format. For example, from a record-based environment you can generate four different blocks of CSV import data, one for each table and portion of the source information.

The opposite is actually also true; when converting data from a column or table store into a record based format, it normally makes sense to do this on the basis of translating the key data into the new structure before doing the transfer. As a general rule, either use the native target format if you can, or make use of representable formats such as JSON to do the actual encapsulation of the information. Many record or document-based environments already use JSON, or a format similar to this.

{
    "title": "Fried chilli potatoes",
    "preptime": "5"
    "servings": "4",
    "totaltime": "10",
    "subtitle": "A new way with chips.",
    "cooktime": "5",
    "ingredients": [
        {
            "ingredtext": "chilli powder",
            "ingredient": "chilli powder",
            "meastext": "3-6 tsp"
        },
        {
            "ingredtext": "potatoes, peeled and cut into wedges",
            "ingredient": "potatoes",
            "meastext": "900 g"
        },
        {
            "ingredtext": "vegetable oil for deep frying",
            "ingredient": "vegetable oil for deep frying",
            "meastext": ""
        }
    ],
}

One final consideration is those situations where there is no structure – or the structure is so ephemeral or complex that there is no way to map information. You cannot, indeed should not, either impose a structure just for the sake of one, or inversely rely on sub-standard features in the target database just because it makes your life easier.

Some examples of this include trying to extract key fields or information from unstructured data that are complicated either to identify, or to map back to the original. Extracting a telephone number from a massive text string just because you can does not mean that the telephone number you have extracted is really the one that should be associated with this record in your database. Equally, relying on full-text searching engines within RDBMS environments can be problematic.

Is that a Column or a Field?

Not all fields and columns are created equal, and many of these difficulties come from the complexities or abilities of the database environment being used. Specifically, at which point do you treat a fragment of the data that you are dealing with as a column, or a field, or even just a uniquely identifiable piece of information?

As you move between different environments, the differences become more pronounced or more difficult to identify. True columnar stores, for example, tend to have a significantly reduced number of datatypes and support, and that often limits your ability to store certain values and information. For example, compound values, or specialist types, such as XML, GeoData and spatial points may be storable in one database but not another.

Consider this record, taken from a CouchDB (JSON document) database:

{
   "title" : "Chicken Curry",
   "Keywords" : [
      "Indian",
      "Chicken",
      "Okra",
      "Spicy"
   ],
   "id" : "8023754"
}

Now is the ‘Keywords’ compound object in the document a field, or is it a column? In MySQL we could translate this into a SET datatype, a special datatype, providing we knew what all the possible values for that column are. In Oracle, a field that has multiple possible values like this would normally either be split into separate columns as a bit or boolean value, or it would converted to a lookup table, as in the diagram below.

 

Depending on your use case, within a strict columnar environment such as Cassandra or HP Vertica you might actually consider going the other way and repeating the information with the keyword data in individual rows, like this:

dbadmin=> select * from recipes_kwbytext;
   id    |        title         |     kw
---------+----------------------+------------
 8023754 | Chicken Curry        | Indian
 8023754 | Chicken Curry        | Chicken
 8023754 | Chicken Curry        | Okra
 8023754 | Chicken Curry        | Spicy
(4 rows)

With a column store this can be more efficient if what you are looking for is patterns in the data, because repeated data like this is easy to extract and identify. In this case, what we’ve done is convert something that is a compound field in a document store into multiple rows with the same ID in a column store. This solution can also be used in environments where there are no JOIN operations, or a JOIN is expensive, but where the information is still required at each level. Good examples here are many of the document stores and structureless environments such as Hadoop.

Can you Bypass Datatypes?

Another temptation when translating data between very different database formats is simply to ignore the formatting, and especially the very strict datatypes, that might normally define the data being stored.

This is particularly a problem within those environments where there may be a very limited set of datatypes to work with and can be application specific. For example, the Hive database environment within Hadoop is reqally little more than a thin veneer over a text format used to store the actual data. When you define a table within Hive and then select the rows from the table, Hive parses each row and uses that to display the value in the corresponding format.

This can cause problems for certain data, for example, numbers that are too big, dates that don’t match the very limited set of date formats supported by the Hive parser. In the long term, this causes corruption of the data that you have transferred.

For this reason, some people choose to create tables within Hive that use the Text datatype to display the information rather than the true underlying Integer or Floating Point value as it ensuires the raw value, not the interpreted value will be used.

The same process can be used when moving data; extract the ‘raw’ value rather than hope the source or target database will interpret, store and display the information in the right format.

If you are only sharing or displaying the information in the new target database then there is probably no reason to worry. If you start processing or actively using the data, this is where corruption can occur if you are not storing the information correctly. For example, if an integer is stored and then incremented, you want 10,000 to become 10,001, not 100001.

The bottom line, you can bypass the datatype, but probably shouldn’t if you hope to use the information in the format in which you’ve decided to store it. If you have a datatype, and can identify it from the source material, then use it if the target environment can handle and interpret it correctly. See the notes later in this chapter on limitations in different environments.

Optimization and Performance

Irrespective of the the database environment and the reason for you moving the data, the end goal should always be to move data into a target in a format that will be efficient to use at the target end.

In many document or unstructured systems, or those with very flexible storage mechanism such as Hadoop, the performance will often be predicated not on the structure of the information, but what information is closest to you, or what can be pre-mapped or organized through a map reduce or index generation exercise.

Conversely, RDBMS require highly structured and organized data structures both with and without indexing to provide the best performance. Columnar stores are often much more efficient if you can logically group or sort information together. Some will handle this automatically for you, otherwise are more efficient if you can pre-determine the distribution of the data on which you are most likely to sort and query on. That might mean that when you transfer the data, you sort the generated file by that column or columns before loading. In some extreme examples it may be that you load the data in an unordered format and then move again into a new table with the right column structure.

Don’t be afraid of making the wrong decision, because you can often sort this structure out during a secondary or tertiary stage, but equally don’t ignore it. Having to parse or process large bodies of data a second or third time will be impractical if you are sharing or replicating data compared to single, isolated, dumps.

Ensure Two-way Validity

Without good reason, you should always try and avoid making any kind of translation of format that cannot be either reversed, undone, or translated back into it’s original format, even if that might make the process a little more complicated. Remember that data is often a living organism when being actively used and employed. Therefore doing too much to format, combine, extract or otherwise manipulate the information can then make it difficult to be used again elsewhere.

Note that this is not about normalization. Normalization in typical database parlance means finding the right, fixed, database type for the field data, making it the most efficient choice, and understanding the limits and structure of the data so that you can decide whether a field should be 10 bytes or 12 bytes long. Doing this normally results in identifying the data structure, lookup tables, relations and joins so that you have the right structure. In this context, normalization is really about making the data look like a typical table structure in an RDBMS; normalization for document databases is entirely different. Normalisation for data interoperability is another level still, and we’ve already seen a number of different examples of that.

Instead, think about the quality of the data and how it should be used, while keeping in mind that the structure required for efficiency in an RDBMS may be completely different to the efficient storage of the same information in a document DB, or when resolved down to a text fragment in an unstructured data store.

To return to a well-trodden example, in the section ‘Is that a Column or a Field?’ we looked at the translation of compound values into single or multiple fields. Using a comma to separate the potential values means that we could split the value back out. If the field had been correctly translated either to boolean columns or a linked table is easier to translate back again into a whole variety of formats.

When representing a compound type, think about how you would reverse the structure so that it could be used the other way round. For example, if you decide to dump the information out to another table or column structure, make sure that you add identifiers so that you can track the record it came from (which you’ll probably need anyway), and can reformat it back into that format. If you’ve split it out into multiple rows in a columnar store, make sure you know how to combine it back and deduplicate the information again if you need to send it the other way.

The compound types are the most complex single field type here because there are so many opportunities for you to mess up the translation, but the same is also true for basic structural information, and even more so if you decide that you only want to transfer a smaller number of fields of data from one database to another. Either transfer everything, or transfer what you need and include information (like a unique identifier) so that you can associate it back with the data you extracted it from. Once you’ve lost the context of the information, it can be impossible to get it back and the result is a useless dataset.

Database Metadata

We’ve concentrated very heavily on the actualy data you are storing and want to work with, but what about metadata:

  • Sequences, current auto-increment values?
  • Whether you transfer or record these is going to depend on exactly how you want to use the information when it reaches your destination (or comes back).
  • How about the definition of the structure that you are using? Do you want to be able to share and use that? What happens when the structure changes. Do you want to track and identify those changes?
  • When doing a one-time export of information from one database to another you can be sure about the structure and what you expect to get from it. But what happens when you repeat the same export multiple times? Or when replicating?
  • If you are sharing data between two different systems and integrating them, knowing the sequence number may be irrelevant unless you can synchronize the generation of the number so that it can be used by the multiple databases in an effective manner. Perhaps your databases could use a better unique identification method, rather than relying on a monotonically increasing sequence number, such as UUIDs or using a central identifier registry?

How to address these different problems will be covered in later chapters, but it’s important to think about it here as it has a knock on effect to other areas. For example, when moving unstructured or document based databases into multiple separate tables, you need to identify and tie that information together, where a UUID is important, and it therefore becomes a critical part of the data structure that you swap.