Data Migration: Mapping the Data

Posted by

When moving the data between different databases the primary considering is what that’s going to look like so that it can be used in the target environment. Later chapters are going to dig deeper into this topic, but let’s fly over some of the key considerations here.

Mapping Columns to Tables

If we were replicating this data from our existing RDBMS into another, the most obvious method is for us to simply move the tables wholesale from one environment to the other. If they both support table structure, then there is no reason not to duplicate this structure on the other side.

 

But, always be conscious of how the data is going to be handled over on the other side. If your target database does not support joins between tables, as some Hadoop alternatives do not, then you will need to determine whether you are better to merge the table data together, either into a bigger table and either duplicate the information, or hide it.

For example, if you wanted to analyse which recipe has the most contents made of chicken, then you could combine the tables together from the transactional side into a pre-joined table that contains the detail. For example, converting our three tables, ingredients, recipe ingredients, and recipes, into something like this:

|     3006 | Tabboule            | <olive oil>         |
|     3006 | Tabboule            | <salt>              |
|     3006 | Tabboule            | <onion>             |
|     3011 | Minted pea puree    | <olive oil>         |
|     3011 | Minted pea puree    | <frozen peas>       |
|     3011 | Minted pea puree    | <seasoning>         |
|     3011 | Minted pea puree    | <butter>            |
|     3012 | Lamb patties        | <ground coriander>  |
|     3012 | Lamb patties        | <ground cumin>      |
|     3012 | Lamb patties        | <ground turmeric>   |

Now we have the information from the ingredients table merged with the recipe ID and title from the souce database. Over in our destination store, performing a count or summation operation will now be an easier way to enable us to do the query. In fact, with a single table structure like this some operations are quicker and provide the information we want. For example, find all the recipes with frozen peas in them is a single (possibly indexed) table:

+----------+--------------------+---------------+
| recipeid | title              | description   |
+----------+--------------------+---------------+
|      984 | Waffle fish pie    | <frozen peas> |
|      633 | Vegetable korma    | <frozen peas> |
|       27 | Spicy tuna stew    | <frozen peas> |
|     1261 | Seafood paella     | <frozen peas> |
|      902 | Choux au gratin    | <frozen peas> |
|      866 | Tomato baked rice  | <frozen peas> |
|     1971 | Spicy risotto star | <frozen peas> |
|     2741 | Cheat's jambalaya  | <frozen peas> |
|     2750 | Spicy sausage rice | <frozen peas> |
|     2778 | Quick jambalaya    | <frozen peas> |
|     3011 | Minted pea puree   | <frozen peas> |
+----------+--------------------+---------------+

In a columnar store this can be orders of magnitude faster than a join across the three source tables, and still provides us with the core of information we want to display – the recipe id and title.

Mapping Columns to Documents

Moving the data over verbatim as tables is unlikely to work as efficiently as you think. For example, in a NoSQL database, joins are normally either impossible, or computationally expensive, and so expecting to be able to reconstitute the structure in the destination database. You also want to make sure that you are using the most efficient method for the database system. Simply putting a JSON representation of each row from an existing table or column store is probably not going to work, especially as JOINs are typically unavailable. A better solution is merge the data from multiple tables itno a single document that contains all of the information you need.

Doing this for table data to be inserted into a document store normally requires the composition of a document from the constituent elements, in the case of our recipe database, the recipe, method, ingredients, and nutritional information needs to be combined together into one big document. There are a variety of ways to do this, but an obvious solution is to logically group ‘objects’ together. That is, an object that might be represented by a collection of tables. Like this:

Fig0201.png

Within our recipe data, for example, in a document store the use case is for us to extract or remove the entire recipe – base data, ingredients, and methods – as a single document that contains all the information we need. This puts all of the information in one document, and makes it easy to update and format as that entire recipe at a time. We can actually see a sample of this, first by looking at the diagrammatic example, here with some dummy data, but you can see how the tables on the right can be mapped to fragments of the document on the left.

Document and Table Mapping

We can also look at a simple script that performs this operation for me, here collecting the recipe object (which queries the underlying database) and then converting that into a JSON structure for writing into the database:

use JSON;
use lib 'cheffy.com';
use Foodware;
use Foodware::Public;
use Foodware::Recipe;

my $fw = Foodware->new();

my $recipes = $fw->{_dbh}->get_generic_multi('recipe','recipeid',{ active => 1});


foreach my $recipeid (keys %{$recipes})
{
    my $recipe = new Foodware::Recipe($fw,$recipeid,{ measgroup => 'Metric',
                                                 tempgroup => 'C',});

    my $id = $recipe->{title};
    $id =~ s/[ ',\(\)]//g;
    my $record = {
       _id => $id,
       title => $recipe->{title},
       subtitle => $recipe->{subtitle},
       servings => $recipe->{servings},
       cooktime => $recipe->{metadata_bytag}->{totalcooktime},
       preptime => $recipe->{metadata_bytag}->{totalpreptime},
       totaltime => $recipe->{metadata_bytag}->{totaltime},
       keywords => [keys %{$recipe->{keywordbytext}} ],
       method => $recipe->{method},
    };

    foreach my $ingred (@{$recipe->{ingredients}})
    {
       push(@{$record->{ingredients}},
            {
               meastext => $ingred->{'measuretext'},
               ingredient => $ingred->{'ingredonly'},
               ingredtext => $ingred->{'ingredtext'},
            }
           );
    }

    print to_json($record);
}

Finally, let’s look at how this is all translated into a full JSON representation of the same information:

{
   "subtitle" : "A good use for bananas when they're going soft.",
   "keywords" : [
      "diet@vegetarian",
      "diet@corn-free",
      "occasion@entertaining",
      "diet@citrus-free",
      "occasion@kids' parties",
      "diet@demi-veg",
      "special collections@lunchbox",
      "meal type@cakes, biscuits, sweets",
      "special collections@classic recipe",
      "diet@peanut-free",
      "cook method.hob, oven, grill@oven",
      "special collections@store cupboard",
      "diet@yeast-free",
      "special collections@budget",
      "occasion@prepare-ahead entertaining",
      "main ingredient@fruit",
      "occasion@picnic",
      "diet@shellfish-free",
      "special collections@cheffy recommended"
   ],
   "preptime" : "20",
   "servings" : "8",
   "cooktime" : "45",
   "method" : [
      {
         "_sort" : "4",
         "recipeid" : "2035",
         "step" : "4",
         "altgroup" : "0",
         "methodstep" : "Spoon into the loaf tin. Spoon the top. Bake for 45-50 min or until well risen and cooked through. ",
         "text_formatted" : "Spoon into the loaf tin. Spoon the top. Bake for 45-50 min or until well risen and cooked through. "
      },
      {
         "text_formatted" : "Slowly beat in the egg. Add the banana. Fold in the flour and bicarbonate of soda. Add the dried fruit. ",
         "methodstep" : "Slowly beat in the egg. Add the banana. Fold in the flour and bicarbonate of soda. Add the dried fruit. ",
         "_sort" : "3",
         "recipeid" : "2035",
         "altgroup" : "0",
         "step" : "3"
      },
      {
         "recipeid" : "2035",
         "_sort" : "2",
         "step" : "2",
         "altgroup" : "0",
         "text_formatted" : "Cream the butter and sugar together until pale and fluffy. ",
         "methodstep" : "Cream the butter and sugar together until pale and fluffy. "
      },
      {
         "recipeid" : "2035",
         "_sort" : "1",
         "altgroup" : "0",
         "step" : "1",
         "text_formatted" : "Preheat oven to 180&deg;C. Grease a 900 g loaf tin.",
         "methodstep" : "Preheat oven to 180.C. Grease a 900g loaf tin."
      },
      {
         "text_formatted" : "Turn out onto a wire tray. Leave to cool. ",
         "methodstep" : "Turn out onto a wire tray. Leave to cool. ",
         "_sort" : "5",
         "recipeid" : "2035",
         "altgroup" : "0",
         "step" : "5"
      }
   ],
   "totaltime" : "65",
   "_id" : "Bananacake",
   "title" : "Banana cake",
   "ingredients" : [
      {
         "ingredtext" : "butter",
         "meastext" : "75 g",
         "ingredient" : "butter"
      },
      {
         "ingredtext" : "bicarbonate of soda",
         "meastext" : "[sup]1[/sup]/[sub]2[/sub] tsp",
         "ingredient" : "bicarbonate of soda"
      },
      {
         "meastext" : "2",
         "ingredient" : "bananas",
         "ingredtext" : "ripe bananas, peeled and mashed"
      },
      {
         "ingredtext" : "white self-raising flour, sifted",
         "ingredient" : "white self-raising flour",
         "meastext" : "200 g"
      },
      {
         "ingredtext" : "salt",
         "meastext" : "1 pinch",
         "ingredient" : "salt"
      },
      {
         "ingredtext" : "egg, beaten",
         "ingredient" : "egg",
         "meastext" : "1"
      },
      {
         "ingredient" : "dried mixed fruit",
         "meastext" : "100 g",
         "ingredtext" : "dried mixed fruit"
      },
      {
         "ingredtext" : "caster sugar",
         "ingredient" : "caster sugar",
         "meastext" : "100 g"
      }
   ]
}

Looking at the output, you can see how the structure of document has been merged together into something more usable. We have a block for ingredients, keywords, method, and you can see how the ‘ingredient’ field in the ingredient block could be used as a searchable element.

The primary questions about the format come will come down to how the data will be used. In CouchDB and Couchbase for example a map/reduce like process will be used to create an index on the information. Choosing the right structure is therefore about understanding the structure and how it will be used. When I come to build an index on this information, and I want to build an index so that I can query by ingredient, is this an effective method to format the data? Is processing all of that data to determine the occurrences of turkey the most efficient method?

How about if I realize that all my vegetarian recipes are really vegan, will that change the structure? We’ll return to these questions later in the book.

Handling Compound Assignments

While we’re talking about constructing documents from tables and vice versa, I want to consider those pesky compound values again. Compound values are hugely complex when it comes to data translations because there are differences within a database type as well as between types that should be considered. Always with a compound type you should start by asking three basic questions:

  • How will it be queried in the target database?
  • Can I have multiple values within the same field?
  • Is the order of those multiple values significant?

One translation you want to avoid is to convert this relatively structured format into something that ultimately becomes hard to process. For example, the temptation is to convert this ‘field’ from the source CouchDB environment into a format that looks similar to the original, for example, by using a comma-separated list:

Indian,Chicken,Okra,Spicy

There are so many dangers with this, it’s hard not to see how this is a bad idea. Some obvious problems are:

  • How do we cope with an ever-increasing number of potential options? We’re showing just four here, what if there were 20? 30? 60?
  • What happens if more data than the width of the column are defined? If it’s truncated we lose information. Using a wider column only works for so long.
  • Indexes become unmanageable from a database adminisitration perspective.
  • How is integrity enforced? Should the options be sorted alphabetically? What happens when they get updated?
  • What happens if we want to change a value? What happens if ‘red’ becomes ‘scarlet’, do we change every row?

The biggest problem with this approach is the usability once that value is used in a transactional or columnar store, the data now becomes difficult and potentially expensive to process. It takes a lot more computational effort for a database to search even with the use of indexes a text string that contains a sequence of letters, against either a join or boolean set of columns matching the same structure.

You can actually test this quite effectively by creating sample tables that emulate this basic functionality. MySQL is being used here to do the comparisons, but the general effects are quite clear. Here’s a search against a single text column using commas:

SELECT id,title FROM recipes WHERE keywords LIKE '%CHICKEN%';

Now here’s the same query where the colours are represented a boolean columns:

SELECT id,title FROM recipes WHERE kw_chicken = 1;

A boolean index in the latter example will be much quicker than the LIKE search in the former.

Adventures in recipes

When developing an application, and particularly the database that will support it, there comes a time when you realize you may not have planned and identified all of the problems you could, and that’s when you notice that something doesn’t quite work.

The use of free-text for ingredients in the recipe database was one such moment. Searching for a recipe that contains ‘Chicken’ is fine if you look at the ingredients, you get to pick up everything from ‘whole chicken’ to ‘chicken breasts’ within the search. Unfortunately, you also pick up ‘chicken stock’. When a user searches for chicken recipes, chicken stock is used in a surprising number of recipes that otherwise contain no chicken of any variety whatever.

When migrating data around, you can see the same principles at work; if you’d relied on using a text field to store the value over separate, stricter, fields, the quality of the data is ruined. Keep this in mind when moving data around.

The recommendations for how to work out which is the best method are actually comparatively straightforward:

  • If the list of possible values is small (for example, four or five different values) and there are few of them within the overall ‘table’ that they are been moved to, use fields/boolean values.
  • If the list is larger, and likely to grow and expand with new options, then use a lookup table.

When applying the same principles the other way round, you should use the most appropriate format for the corresponding target database in a way that makes the data usable. In document databases, for example, it’s efficient to translate a compound value into an array of sub-values, just as in our original.

Mapping Documents to Columns

When mapping data from a document-based structure back into columns, the process is significantly harder. You are specifically going from a flexible multi-field format where the number and type of fields could be entirely different into one where the format is rigid and inflexible.

There are two choices available:

  • Transform the data back into a table structure that as close as possible matches the overall data structure of the underlying information. This is great if the document is relatively flat. But if, as we’ve seen, we have compound data, or variable length compound information, this method doesn’t work as well.
  • Transform the data back out into a multi-table format that has the capability for joins. This is basically the reverse of the process we just examined for converting the table-based recipe data into documents. You must remember to use a unique identifier for the parent record so that it can be linked back properly when a JOIN is used.

The flatter the document, the easier the conversion.