Upcoming Webinar, 19th July, What is New in Tungsten Replicator 5.2 and Tungsten Clustering 5.2?

Continuent Tungsten 5.2 is just around the corner. This is one of our most exciting Tungsten product releases for some time!

In this webinar we’re going to have a look at a host of new features in the new release, including
Three new Replication Applier Targets (Kafka, Cassandra and Elasticsearch)
New improvements to our core command-line tools trepctl and thl
New foundations for our filtering services, and
Improvements to the compatibility between replication and clustering

This webinar is going be a packed session and we’ll show all the exciting stuff with more in-depth follow-up sessions in the coming weeks.

 

You’ll also learn about some more exciting changes coming in the upcoming Tungsten releases (5.2.1 and 5.3), and our major Tungsten 6.0 release due out by the end of the year.

So come and join us to get the low down on everything related to Tungsten Replicator 5.2 and Tungsten Clustering 5.2. on Wednesday, July 19, 2017 9:00 AM – 9:30 AM PDT at https://attendee.gotowebinar. com/register/ 4108437731342545667

New Continuent Webinar Wednesdays and Training Tuesdays

We are just starting to get into the swing of setting up our new training and webinar schedule.
Initially, there will be one Webinar session (typically on a Wednesday) and one training session (on a Tuesday) every week from now. We’ll be covering a variety of different topics at each.
Typically our webinars will be about products and features, comparisons to other products, mixed in with product news (new releases, new features) and individual sessions based on what is going on at Continuent and the market in general.
Our training, by comparison, is going to be a hands-on, step-by-step sequence covering all of the different areas of our product. So we’ll cover everything from the basics of how the products work, how to deploy them, typical functionality (switching, start/stop, etc), and troubleshooting.
All of the sessions are going to be recorded and we’ll produce a suitable archive page so that you can go and view the past sessions. Need a refresher on re-provisioning a node in your cluster? There’s going to be a video for it and documentation to back it up.
Our first webinar is actually next Thursday (the Wednesday rule wouldn’t be a good one without an exception) and is all about MySQL Multi-Site/Multi-Master Done Right:
In this webinar, we discuss what makes Tungsten Clustering better than other alternatives (AWS RDS, Galera, MySQL InnoDB Cluster, and XtraDBCluster), especially for geographically distributed multi-site deployments, both for disaster recovery and multi-site/multi-master needs.
If you want to attend, please go ahead and register using this link: http://go.continuent.com/n0JI04Q03EAV0RD0i000Vo4
Keep your eyes peeled for the other upcoming sessions. More details soon.

Kafka Replication from MySQL and Oracle

Hello again everybody.

Well, I promised it a couple of weeks ago, and I’m sorry it has been so long (I’ve been working on other fun stuff in addition to this). But I’m pleased to say that we now have a fully working applier that takes data from an incoming THL stream, whether that is Oracle or MySQL, and converts that into a JSON document and message for distribution over a Kafka topic.

Currently, the configuration is organised with the following parameters:

  • The topic name is set according to the incoming schema and table. You can optionally add a prefix. So, for example, if you have a table ‘invoices’ in the schema ‘sales’, your Kafka topic will be sales_invoices, or if you’ve added a prefix, ‘myprefix_schema_table’.
  • Data is marshalled into a JSON document as part of the message, and the structure is to have a bunch of metadata and then an embedded record. You’ll see an example of this below. You can choose what metadata is included here. You can also choose to send everything on a single topic. I’m open to suggestions on whether it would be useful for this to be configured on a more granular level.
  • The msgkey is composed of the primary key information (if we can determine it), or the sequence number otherwise.
  • Messages are generated one row of source data at a time. There were lots of ways we could have done this, especially with larger single dumps/imports/multi-million-row transactions. There is no more sensible way. It may mean we get duplicate messages into Kafka, but these are potentially easier to handle than trying to send a massive 10GB Kafka message.
  • Since Zookeeper is a requirement for Kafka, we use Zookeeper to record the replicator status information.

Side note: One way I might consider mitigating that last item (and which may also apply to some of our other upcoming appliers, such as the ElasticSearch applier) is to actually change the incoming THL stream so that it is split into individual rows. This sounds entirely crazy, since it would separate the incoming THL sequence number from the source (MySQL binlog, Oracle, er, other upcoming extractors), but it would mean that we have THL on the applier side which is a single row of data. That means we would have a THL seqno per row of data, but would also mean that in the event of a problem, the replicator could restart from that one row of data, rather than restarting from the beginning of a multi-million-row transaction.

Anyway, what does it all look like in practice?

Well, here’s a simple MySQL instance and I’m going to insert a row into this table:

mysql> insert into sbtest.sbtest values (0,100,"Base Msg","Some other submsg");

OK, this looks like this:

mysql> select * from sbtest.sbtest where k = 100;
+--------+-----+----------+-------------------+
| id     | k   | c        | pad               |
+--------+-----+----------+-------------------+
| 255759 | 100 | Base Msg | Some other submsg |
+--------+-----+----------+-------------------+

Over in Kafka, let’s have a look what the message looks like. I’m just using the console consumer here:

{"_meta_optype":"INSERT","_meta_committime":"2017-05-27 14:27:18.0","record":{"pad":"Some other submsg","c":"Base Msg","id":"255759","k":"100"},"_meta_source_table":"sbtest","_meta_source_schema":"sbtest","_meta_seqno":"10130"}

And let’s reformat that into something more useful:

{
   "_meta_committime" : "2017-05-27 14:27:18.0",
   "_meta_source_schema" : "sbtest",
   "_meta_seqno" : "10130",
   "_meta_source_table" : "sbtest",
   "_meta_optype" : "INSERT",
   "record" : {
      "c" : "Base Msg",
      "k" : "100",
      "id" : "255759",
      "pad" : "Some other submsg"
   }
}

 

Woohoo! Kafka JSON message. We’ve got the metadata (and those field names/prefixes are likely to change), but we’ve also got the full record details. I’m still testing other data types and ensuring we get the data through correctly, but I don’t foresee any problems.

There are a couple of niggles still to be resolved:

  • The Zookeeper interface which is used to store state data needs addressing; although it works fine there are some occasional issues with key/path collisions.
  • Zookeeper and Kafka connection are not fully checked, so it’s possible to appear to be up and running when no connection is available.
  • Some further tweaking of the configuration would be helpful – for example, setting or implying specific formats for msg key and the embedded data.

I may add further configurations for other items, especially since longer term we might have a Kafka extractor and maybe we want to use that to distribute records, in which case we might want to track other information like the additional metadata and configuration (SQL mode etc) currently held within the THL. I’ll keep thinking about that though.

Anything else people would like to see here? Please email me at mc.brown@continuent.com and we’ll sort something out.

 

Replicating into ElasticSearch

So here at Continuent we are working on multiple new targets for applying data using Tungsten Replicator. There are so many potential targets out there where people want to replicate data directly into a specific system, sometimes just for a specific data set, table, database or requirements.

Yesterday afternoon, I started working on ElasticSearch – this morning I have it finished!

As with all solutions, the same basic principles apply – want to pull out of MySQL or Oracle and into something else? That’s fine. Want to replicate to HDFS and ElasticSearch? We do that too!

So what does it look like?

Installation operates just our normal appliers – you just specify the datasource type (ElasticSearch) and the EL host name and port:

tools/tpm configure alpha \
--datasource-type=elasticsearch \
--install-directory=/opt/continuent \
--master=ubuntuheterosrc \
--members=elasticsearch \
--replication-host=localhost \
--replication-password=root \
--replication-port=9200 \
--replication-user=root

There are some configurable options, but I’ll get to those later. For right now, let’s just see what happens when you insert some data. Here’s a simple table in MySQL:

mysql> describe mg;

+-------+----------+------+-----+---------+----------------+

| Field | Type     | Null | Key | Default | Extra          |

+-------+----------+------+-----+---------+----------------+

| id    | int(11)  | NO   | PRI | NULL    | auto_increment |

| msg   | char(80) | YES  |     | NULL    |                |

+-------+----------+------+-----+---------+----------------+

2 rows in set (0.00 sec)

And Let’s insert some data:

mysql> insert into mg values (99999,"Hello ElasticSearch");
Query OK, 1 row affected (0.10 sec)

Now let’s have a look what happens to that when it gets into ElasticSearch:

{
 "_id" : "99999",
 "_type" : "mg",
 "found" : true,
 "_version" : 2,
 "_index" : "msg",
 "_source" : {
 "msg" : "Hello ElasticSearch",
 "id" : "99999"
 }
}

Yay! A nice clean record into ElasticSearch so that we could be searching for the data it contains.

Incidentally, the information was written in using a Document ID made of up of the primary (more on that in a minute), and written into an index and type based on the schema and table.

Obviously we’re writing in a full record here – but keep in mind that this is the replicator and we could have filtered out columns or even tables from the information generated content. We’re trying to keep with the operational perspective of writing everything over to the target that we’ve been asked to.

Also be aware that we do this on a per-row basis. That is, every single row updated/inserted is written as a single entry into the ElasticSearch index.

That said, there are quite a few things that we can control:

  • By default, we treat the incoming schema name as the ElasticSearch ‘index’ and the incoming table name as the ElasticSearch ‘type’. So for example, with the schema ‘blog’ and the table ‘posts’ you are are going to get data written into /blog/posts/ID.You can change this behaviour by setting an explicit index and/or type name – this obviously writes everything into the target with those specific values, regardless of the incoming schema or table name, but maybe you just want one big index of all the data.So, by setting an explicit index of ‘allmybigdata’ and a type ‘rawtext’, everything gets written to /allmybigdata/rawtext/ID.
  • The difficulty with the above approach is it limits your ability to search based on some other values. Maybe the incoming data is from multiple blogs, but you want to be able to perform searches, there’s also an option to embed the schemaname and tablename into the data too:
    {
     "_source" : {
     "id" : "9",
     "source_table" : "mgg",
     "msg" : "Barneyrubble",
     "source_schema" : "msg",
     "idb" : "5",
     "committime" : "2017-05-11 11:30:04.0"
     },
     "_id" : "95",
     "_version" : 1,
     "found" : true,
     "_index" : "msg",
     "_type" : "mgg"
    }
  • You can also see in the above that embed a ‘committime’ if asked too, in case you want to search on that too.
  • Incidentally, one other thing about the above record, it’s actually a compound index from the MySQL side –  you can see that there are to ID fields, ‘id’ and ‘idb’ and the ElasticSearch _id is ’95’
  • The format of the document id is configurable, so you use:
    • The primary key (including compound ones), with everything combined into a single string. I.e key (9,5) becomes 95.
    • The primary key using underscores, I.e. (9,5) becomes 9_5;
    • The schema, table and primary key, I.e. (9,5) in msg.mg becomes msgmg95
    • The schema, table and primary key with underscores, I.e. (9,5) in msg.mg becomes msg_mg_9_5
  • Updates work exactly as you expect – they update the record directly, as we do a *proper* update, so the _version is updated appropriately
  • Deletes work as expected too
  • Document IDs can be configured so that an ElasticSearch auto generated value is used in place of an incoming primary key. However, be aware that if you use this, then we are unable to do deletes or updates, because we cannot track the generated ID and looks up would be expensive.
  • Fortunately, you can ignore errors when performing a delete or update to avoid the problem.

These are all configured through the usual properties, and the defaults look like this:

replicator.applier.dbms.ignoreDeleteErrors=false 
replicator.applier.dbms.ignoreUpdateErrors=false 
replicator.applier.dbms.docIdFormat=pkey 
replicator.applier.dbms.selfGeneratedId=false 
replicator.applier.dbms.useSchemaAsIndex=true 
replicator.applier.dbms.indexName= 
replicator.applier.dbms.useTableAsType=true 
replicator.applier.dbms.typeName= 
replicator.applier.dbms.embedSchemaTable=true 
replicator.applier.dbms.embedCommitTime=true

Currently, all of these are global settings  – I’m toying with the idea of using these as defaults, and then having a separate JSON configuration file that would be able to set these values on a per schema/table basis. I’d be interested to hear if anybody would find this useful. While I like this approach, it would add some processing overhead we might want to avoid. In reality, the better way to do this would be to configure separate services in the replicator to handle that process.

Some things that I am still checking and investigating:

  • Performance – Currently I’m seeing about 125 rows per second into ElasticSearch. This is in a VM with just 2 CPUs and 2GB RAM. I suspect we could increase this.
    I also have not in anyway done a more random workload, like Sysbench, or checked the compatibility with our own multi-threaded/parallel apply.
  • Latency – Latency right now is down in the µs, about where you’d expect. Obviously, it depends on the incoming data, but worth looking at.
  • Start/Stop/Restart – This first version contains *complete* restart ability as you would expect with the replicator. However, I haven’t added support to some of our other tools, like dsctl. I’ll address that in a future release.
  • Datatype Support – I’ve done only a few tables, and nothing substantial like textual or logging data.
  • Currently, we send individual rows as individual REST requests; I dont use the open channel and regular submissions (which might improve performance), or any kind of batching. These are only going to improve large data loads and dumps, rather than more traditional streaming replication

So there’s still some work to do, but the basic process is currently perfectly serviceable.

More important as far as I’m concerned, is that with this basic applier done and ready to be released to the public in our upcoming Tungsten Replicator 5.2.0, which is due at the end of June. That gives us about a month to complete testing and address some of the above issues.

If you would like to test out the new applier for ElasticSearch, please email me (mc.brown@continuent.com). I’m interested to get as much input and testing as possible.

 

 

 

 

Percona Live 2017

So glad to have had a successful Percona Live last week. Continuent were Diamond Sponsors and now that we are back into a company and not part of VMware we have a little more freedom to get back into the MySQL community.

I had two primary sessions, both on the replicator/ But one was looking specifically at the replicator and how we get data into Big Data targets, the other on general problems of replicating between heterogeneous sources. After the first of those, David from Percona interviewed me to understand a bit more about what I was talking about

I was also on the keynote panel where we discussed a variety of different topics and you can see the full video of that through the link.

 

Data Migration: Handling Data Types

Handling the structure, and deciding which fields, data types, or data points, relate to which fields or target elements within the destination database are one thing. Normally you can make an intelligent decision about the information that is being transferred and how that structural information should be handled.

The actual data can be a completely different problem. There are so many potential problems here with simply parsing and understanding the data that we need to look at some of the more common issues and how they should be addressed.

For most cases, the content and structure of the individual types is about understanding the two dimensions of the problem

  • Supported types – that is, whether the target database understands, or even identifies the underlying type. For example, within most RDBMS environments, most data types are very strictly enforced, but in NoSQL or datawarehouse environments they corresponding engines may either not care, or only care in specific situations.
  • Parseable information – the information may be readable or storable, but not to the same precision or definition. For example, with time values, some store hours, minutes and seconds, some store hours and minutes, and a whole variety store different lengths of fractions of seconds, to one or even 15 points of precision. At the end of the same spectrum, some dates are represented by seconds, others by a string in various formats.

Let’s look at the finer details of understanding these different types and how they can be stored, exchanged and ultimately parsed.

Basic Types

There are four basic types that make up all data and that, on the whole, get represented within a database system. Identifying these four types will help us to set the groundwork for how we handle, identify and treat the rest of the data:

  • Numeric – basically any kind of number or numeric value, from integers through to floating point and even ‘Big’ numbers.
  • Strings – Strings seem like a very simple list of characters, but they aren’t as straightforward as you might think.
  • Dates – Dates are a special type all of their own. Although they look like the worst parts of numbers and strings combined, ensuring that they are identifiable as a date in the target database relies on understanding the different formats, separators and structures.
  • Binary – any type of data that does not fall into the previous three groups is binary data. Here the issue is one of identity and the ability to interpret the content once it reachs the destination database.

Remember that in all of these situations, we are not only talking about one time formatting of the information. We could be dealing with frequent and regular exchanges of this information between the databases, and even need to perform these changes and differences regularly if the data is integrated across multiple database environments.

When combining, for example, MongoDB data with Oracle information for the processes of reporting, you need to do more than change the format once. It needs to be in a common representable format for both databases throughout the life of the data, while simultaneously ensuring that the information is best stored within each database to get the performance you need.

Strict and Relaxed Translation

Whenever you are moving data between different databases you need to determine whether the type and structure of information is important enough, or critical enough, that it must be represented in its native format.

That may sound like a completely nonsensical approach to data – surely the quality of the data is absolutely critical and should be represented as such everywhere? In theory it should, but different database storage environments treat and handle the data in different ways according to the basic type in question.

To understand why this is important, we need to look back both historically and technically why information was stored in the strict formats we described in the last section.

In any old, and particularly RDBMS-based database solution, data was stored into fixed types and with fixed lengths so that the record could be manipulated as efficient as possible. We saw some examples of this in Chapter 1. For numerical values, it is much more efficient to store a 32-bit integer as just 4 bytes of data than it is to store the string 2147483647 (which would take 9 bytes).

Similarly, with string types, the primary consideration has always been to minimize the amount of storage reserved for the string because handling bigger strings, or bigger potential blocks for strings, was more expensive in computing time, memory space, and disk space. Back when databases ran on machines with 512KB of RAM, devoting massive blocks of memory to non-usable space allocated but not used to store data just wasn’t an option. This is why 8 character filenames and two or three letter codes for a variety of databases and storage methods became common.

In modern systems of course, we actually have almost the opposite problem. Data sizes are now regularly so large that we need to be prepared to handle massive blocks of information whereas before that might have been impossible. This is fine when we are moving data from a traditional RDBMS to say Hadoop, because we move from a strict environment to a very relaxed one. But when moving in the opposite direction this is not true.

To make matters worse, in many Big Data environments, including most of the Hadoop database layers like Hive, the datatype is only significant at the time the data is queried. Within Hive you can load a CSV file that contains a variety of different types, but unless you explicitly tell Hive that the fifth column is a string or a number, Hive doesn’t really care, and let’s you export and query the data as normal.

For example, here’s a table with a mixture of different column types, here using the strict datatypes to define the actual content for each column:

hive> select * from stricttyping;
OK
1      Hello World   2014-10-04 12:00:00   439857.34
1      Hello World   2014-10-04 12:00:00   157.3457
1      Hello World   2014-10-04 12:00:00   4.8945796E7

Now we can view the same data, this loaded into a table where each column has been defined as a string:

hive> select * from relaxedtyping;
OK
1      Hello World   2014-10-04 12:00:00   439857.345
1      Hello World   2014-10-04 12:00:00   157.3457
1      Hello World   2014-10-04 12:00:00   48945797.3459845798475

The primary differences are in the handling of floating point values – the top strict table loses precision (the value was a FLOAT), and at the bottom the value is represented as a DOUBLE with a loss of precision digits. In fact, within Hive, the data is not parsed into the corresponding type until the data is used or displayed. If you examine the raw CSV file:

$ hadoop fs -cat stricttyping/sample_copy_1.csv
1,Hello World,2014-10-04 12:00:00,439857.345
1,Hello World,2014-10-04 12:00:00,157.3457
1,Hello World,2014-10-04 12:00:00,48945797.3459845798475

In fact, many people deliberately don’t explicitly load the data into fixed type columns; they define the column types as strings and then import the date and ultimately ignore the real type until they have to parse or understand it for some reason.

Similarly, in NoSQL environments, the data types may really only be for explicitly representation requirements, and have no effect on the ability to either store or display and query the information. Even in a traditional RDBMS, there is no requirement to explicitly store certain values in certain column types, but certain operations may be limited. For example, most RDBMSs will not perform a SUM() operation on a string column.

The bottom line is that you will need to think about whether to explicitly make use of these columns because you need them as specific types in the target database, or whether to ignore them completely.

  • Strict transformations – Use strict datatypes when the information you want to store must be correctly interpreted within the target database, and it provides some form of performance advantage, unless doing so reduces the validity or precision of the information.
  • Relaxed transformations – Use relaxed transformations whenever the processing or target system does not support the required precision, or in those cases where the processing of the information is irrelevant. Most t ransfers to NoSQL and Big Data environments fit this model automatically.

With this options to you in mind, let’s look at some of the more specific types available.

Handling Numeric Values

Simple, plain, integers are supported by nearly all databases as explicit and identifiable types. Even document databases such as MongoDB and Couchbase understand the significance of a numeric value over a string representation.

However, if you are transferring big integers, be conscious of the limitations of the target database. Some environments explicitly support very large integers. Hive, for example, supports the BIGDECIMAL datatype, which holds numbers with up to 10 to the power of 308. Others do not.

Floating Point Issues

The biggest problem with floating point values is one of precision and storage capability. There are large variations between the supported types, how much is stored and how precise it can be. Further more, some databases specifically differentiate between decimal and floating point values and have different rules for how these should be represented and stored, and the two are not necessarily compatible

For floating-point values, the main issues are:

  • Representation – float values are generally displayed as a decimal value, for example:
3247234.32434
  • There are no specific rules for this, but many values are based on the support of the operating systems own data type support. On modern 32-bit (and 64-bit) systems, floating-point values tend to have 7 digits of precision after the decimal point. This is due to the nature of the structure used to store and define them internally. A double has twice the precision, up to 15 or even 16 digits past the decimal point.
  • Parsing – these values properly is critical if you are storing the data; unfortunately rounding-errors, both made when the data is output, and when it is parsed back, are notoriously difficult, and not always religiously honoured.
    For this reason, some database explicitly support a DECIMAL type. Unlike the float, the DECIMAL type works more like two integers either side of the decimal.

Processing these values reliably, and storing them in a target database may lead to problems if the target system doesn’t support the datatype size, precision, or structure properly. Moving the data may lose the precision or content. On a simple movement of the data in an export/import environment might parse or store it correctly, or it may lose or truncate the precision entirely.

If you are replicating and regularly exchanging data from one database to the other and back again, these precision errors can build up to translate and convert a number from one value to one statistically significant.  If the double type within the databases environment does not support the complexity or precision of the values involved, consider using one of the big integer types and a suitable multiplier.

Finally, if the target database does not support the precision and detail that you need, consider moving the data using relaxed methods, for example by importing the data into a string, rather than a numerical type so that it can be reliabily stored.

Base-N Numbers

If you are exchanging numbers in other than base 10, for example, octal, hexadecimal, or others, ensure that the target database supports the required number format. If an explicit number format is not supported, either translate the number to decimal and handle the output and formatting of the data as the corresponding type within the target database and application, or use the relaxed method and keep it as a string.

Strings and Character Encoding

More problems are created by strings than you might think, and the most significant is usually the character set used to store, transfer, and import the data. Character sets used to refer to the difference between the byte-level encoding for things like EBCDIC and ASCII. Today, they span a much wider array of issues as the number of character sets and the use of a wider range of languages, characters, and ideographs increases.

The best way to encode strings when moving the data between databases is to use either UTF-8 (which encodes Unicode character in 8-bit bytes) or one of the high-bitrate encodings if your data requires it. For example, if you are specifically storing foreign-language, katana, or Chinese characters, using UTF-16 or UTF-32 may be more reliable, if not necessarily more efficient. UTF-8 can be used for a very wide range of different Unicode characters and is rarely a hindrance.

Also be aware that some databases identify character encoding capabilities and data types differently. For example, the VARCHAR2 type within Oracle can be used to store strings with an optional size (byte or character) declaration, but the NVARCHAR2 type is the Unicode (byte) sized datatype. The definition of the column and size can also be different. In Amazon RedShift for example, the size of VARCHAR column is defined in bytes, but in MySQL it’s defined in characters, so a VARCHAR(20) in MySQL has to be a VARCHAR(80) in RedShift. Sneaky.

A secondary issue is one of storage space. Different database environments support different representations of the different character storage types, and sometimes have wildly different performance characteristics for these different types.

Within a NoSQL or Big Data environments, the length (or size) of the string is rarely a problem, as they don’t have fixed or strict datatypes. However, for most RDBMS environments there are specific lengths and limits. Oracle supports only 4000 bytes in VARCHAR2 for example; MySQL supports 255 bytes in a CHAR, or 65535 bytes in a VARCHAR.

Finally, when transferring the information you may need to pay attention to any delimiters. Using CSV, for example, and using quotes to define the field limits only works when there aren’t quotes in the field content.

Dates and Times

Of all the different data types that we have covered up to now there have been problems with understanding and parsing the values because of differences in the types, format, or structure of the data, but all of them were largely covered within some simple limits and structure.

Dates are another problem entirely. In particular:

  • Date precision and format
  • Time precision and format
  • Dates or Epochs?
  • Time Zones

All go together to make for one of the most complicated of the all the types supported when transferring data, because there are so many times where it can go wrong.

Epochs

Epoch values are those where the data is represented as an integer counting, usually, the seconds from a specific reference point in time, from which the current date can be calculated. For example, Unix-based Epoch times are represented as the number of seconds that have elapsed since Jan 1st 1970 at 00:00:00 (12:00am) GMT. Counting forward from this enabels you to represent a date. For example, the value:

1413129365

Is in fact 12th October 2014.

There are two issues that arise from Epoch dates, time drift and date limits.

Time drift occurs if the date has been stored as an epoch that is relative to the current timezone. This can actually happen more frequently than you realize if dates are reconstituted back to an Epoch from a local time based balue into an Epoch. For example, some libraries that parse a date without an explicit timezone will assume that the date is within the current timezone of the system.

This is a particularly thorny problem when you realize that epochs have no timezones of their own. This means that the Epoch value:

1413129365

Is 15:56 BST, but 07:56 PST. If you now transfer a PST-based epoch to GMT and then use it without conversion, all your times will be out by 8 hours. If you ran batch jobs on the imported data at 1am, that time would actually refer to a completely different day.

If you must use epoch values, ensure that you either know what the timezone was, or adjust the value so that it is against GMT and you can translate to the appropriate timezone when you need to. Also see the secion on timezones below.

The secondary problem is date limits. Traditionally epoch values were stored as 32-bit integers, which limits the date between 1970 and 2038. While this is fine for current times (at least for the next 24 years or so), for any future dates, this can be an issue.

If you are porting epoch values to a target that only supports 32-bit dates, and the date is beyond 2038, don’t transfer it using the epoch value, translate it into an explicit date that can be parsed and stored in whatever local format is required for the target environment. For example, within MySQL you can use the FROM_UNIXTIME() function to translate your epoch date into something more usable.

Date Formats

When transferring dates, use a format that is unambiguous and supported by the target system. Different locations and systems have different ways of representing dates, including the different seaprators that are used, and the different orders of the components. Even the use of the prefix for some numbers differs between regions. Some examples are shown in the table below.

Location/Format Example
USA Month.Day.Year
Japan Year-Month-Day
Europe Day.Month.Year
UK Day/Month/Year

Different locations and date formats

The best format to use is usually the ISO format:

YEAR-MONTH-DAY

With a zero prefix added to each value to pad it to the correct number of characters. For example, the the 1st of January:

2014-01-01

Or the year 1:

0001-01-01

The ISO format is not only readable on just about every single platform, it also has the advantage of being sortable both numerically and by ASCII code, making a practical way of exporting and loading data in date order without having to explicitly order data by dates.

Time Formats

Time is usually restricted to a fairly obvious format, that of:

Hours:Minutes:Seconds

Or in some regions and standards:

Hours.Minutes.Seconds

Aside from the timezone issue, which we will look at next, the other problem is the level of precision. Some databases do not support any precision beyond seconds. For example, within Oracle you can store precision for eseconds up to 9 decimal points. Amazon RedShift supports only 6 digits of precision.

Also be aware that some environments may not support explicit date and time types, but only a unified datetime or timestamp type. In this case, the structure can be even more limited. For example, within Amazon RedShift, the timestamp datatype is actually formatted as follows:

YYYYMMDD HH:MM:SS

With the date in ISO format but without explicit date separators.

Time Zones

Every time represented everywhere is actually a time within a specific timezone, even if that timezone is UTC (Universal Time Coordination). The problem with timezones is that the timezone must either be explicitly stored, shared, and represented, or it should be stated or understood between the two systems that the time is within a specific known timezone. Problems occur either when the timezone is correctly represented, or assumptions are made.

For example, the following time:

2014-09-03 17:14:53

Looks clear enough. But if this has come from the BST (British Summer Time) timezone and gets imported into a database running in the IST (India Timezone) then you start to get the time stored in the wrong format if the timezone is not explicitly specified.

Another issue is when there are timezone differences when data is transferred, not because of the physical time difference, but because of the effect of daylight savings time. Transferring data from, say, GMT to PST is not a problem if you know the timezone. Transfer the data over during a daylight savings time change, and you can hit a problem. This is especially true for timezones that have different dates for when daylight savings time changes.

Finally, be prepared for databases that simply do not support the notion of timezones at all. To keep these databases in synchronization with other databases with which you might be sharing information, the easiest method is to use GMT.

In general, the easiest solution for all timezone related data is to store, transfer, and exchange the data using the UTC timezone and let the target database handle any translation to a localized timezone. If you need to explicitly record the timezone – perhaps because the data refers to a specific timezone as part of the day – then use the time type that supports it, or store a second field that contains the timezone information.

Compound Types

We’ve already looked at some of the issues in terms of the structural impact of compound types. Even if you have the ability to represent a value as a compound structure within your target data, you need to understand the limitations and impact of compound types, as not all systems are the same. Some of these limitations and effects are database specific, others are implementation specific.

For example, within MySQL and PostgreSQL, the ENUM type enables you to store a list of fixed string-like values that can be chosen from a fixed list. For example:

ENUM(‘One’,’Two’,’Three’,’Four’)

The benefit of this from a database perspective is that internally each string can be represented by a single number, but only reconstituted into the string during output. For targets that do not support it, therefore, the solution is to translate what was an ENUM column in MySQL into a string in the target database.

MySQL also supports the SET type, which is similar to ENUM, except that the value can refer to muiltiple options. For example:

SET(‘Mon’,’Tue’,’Wed’,’Thu’,’Fri’,’Sat’,’Sun’)

The SET type enables you to record not only the specific day, but maybe a group of days, for example:

INSERT INTO table VALUES (‘Mon,Wed,Fri’)

Again, interally this information is represented this time as a BIT, and so the actual data is implied and displayed as a compound type.

When translated to a database that doesn’t support the type, you may either want to create an additional column for each value to store it, or, if you are using a document database, you may want the set firled converted to an array or hash of values:

{
    ‘Mon’ => 1,
    ‘Wed’ => 1,
    ‘Fri’ => 1
}

Always consider how the data was used and will be searched on either side of the database transfer. In an SQL RDBMS queries normally take the form:

SELECT * FROM TABLE where FIND_IN_SET(‘Mon’,days)>0;

That is, return all the values where the field contains the value ‘Mon’. In a database that supports searching or indexing on individual values (MongoDB, Couchbase), the key-based transfer, where we effectively set the member of a hash to a meaningless value so that we can do key-based lookups. We’ll examine this in more detail when we examine the rnvironments of these databases.

Serialized and Embedded Formats

For a whole variety of different reasons, some people store more complex formats into their database so that they can bmore easily be manipulated and used within the depper element sof their application.

For example, serializing an internal structure, for example, a Perl object or a Java object so that it can be stored into a field or BLOB within the database is a good way of making use of complex internal structures and still have the ability to store and manipulate the the more complex data within the application environment.

If all you want is to transfer these the serialized format from one database to another, then the basics are unlikely to change. You may need to use the binary translation methods in the next section to realisitically get the data over into the new database reliably, but otherwise, the transfer should be straightforward.

However, there is also the possibility that you want to be able to query or extract dta that may have been embedded into the serialized object.

In this case, you need to change the way that you use and manipulate the information as part of the data migration process. In this case, you may want to take the information and either expand the data to expose the new fields as transferrable data.

Or, you may more simply want to change the content of the information from its serialized format into a more universal format, such as JSON.

Binary and Native Values

Binary data, that is, data that is explicitly stored and represented in a binary format is difficult to process when moving data.

  • Binary means that single-character delimiters become useless, even control characters. 0x01 is just as likely to come up in binary data as it is when used as a field separator.
  • Pure, native, binary data suffers from the problems of ‘endianess’, that is, the byte order of the individual bytes. Test and numerical translations don’t tend to suffer from this because systems know how to parse text. When exchanging binary data, the endianness of the data applies.

Binary data can also be affected by any translation or migration process that is expecting a string representation of information. For example, it is not uncommon for UTF8 to be used when reading binary data, which leads to interpretation and storage problems.

In general, the best advice for true binary information is for the data to be encoded into one of the many forms of binary-to-hex translation formats. This can include solutions such as raw hex conversion, where the data is quite literally expanded to a two-character hex string for each binary byte. For example, we can translate any byte strinf into hex values with tools like Perl:

$ perl -e "print unpack('H*','Hello World')"
48656c6c6f20576f726c64

Or use uunencode:

begin 666 HelloWorld
,2&5L;&\@5V]R;&0*
`
end

Or use the MIME64 standard that is employed in many modern email and Web environments for transferring attachments, as it ensures that even multi-byte cahracters are effectively transferred.

All of these solutions can be easily processed on the other side back into the binary format according to the endianess of the host involved.

Data Migration: Mapping the Data

When moving the data between different databases the primary considering is what that’s going to look like so that it can be used in the target environment. Later chapters are going to dig deeper into this topic, but let’s fly over some of the key considerations here.

Mapping Columns to Tables

If we were replicating this data from our existing RDBMS into another, the most obvious method is for us to simply move the tables wholesale from one environment to the other. If they both support table structure, then there is no reason not to duplicate this structure on the other side.

 

But, always be conscious of how the data is going to be handled over on the other side. If your target database does not support joins between tables, as some Hadoop alternatives do not, then you will need to determine whether you are better to merge the table data together, either into a bigger table and either duplicate the information, or hide it.

For example, if you wanted to analyse which recipe has the most contents made of chicken, then you could combine the tables together from the transactional side into a pre-joined table that contains the detail. For example, converting our three tables, ingredients, recipe ingredients, and recipes, into something like this:

|     3006 | Tabboule            | <olive oil>         |
|     3006 | Tabboule            | <salt>              |
|     3006 | Tabboule            | <onion>             |
|     3011 | Minted pea puree    | <olive oil>         |
|     3011 | Minted pea puree    | <frozen peas>       |
|     3011 | Minted pea puree    | <seasoning>         |
|     3011 | Minted pea puree    | <butter>            |
|     3012 | Lamb patties        | <ground coriander>  |
|     3012 | Lamb patties        | <ground cumin>      |
|     3012 | Lamb patties        | <ground turmeric>   |

Now we have the information from the ingredients table merged with the recipe ID and title from the souce database. Over in our destination store, performing a count or summation operation will now be an easier way to enable us to do the query. In fact, with a single table structure like this some operations are quicker and provide the information we want. For example, find all the recipes with frozen peas in them is a single (possibly indexed) table:

+----------+--------------------+---------------+
| recipeid | title              | description   |
+----------+--------------------+---------------+
|      984 | Waffle fish pie    | <frozen peas> |
|      633 | Vegetable korma    | <frozen peas> |
|       27 | Spicy tuna stew    | <frozen peas> |
|     1261 | Seafood paella     | <frozen peas> |
|      902 | Choux au gratin    | <frozen peas> |
|      866 | Tomato baked rice  | <frozen peas> |
|     1971 | Spicy risotto star | <frozen peas> |
|     2741 | Cheat's jambalaya  | <frozen peas> |
|     2750 | Spicy sausage rice | <frozen peas> |
|     2778 | Quick jambalaya    | <frozen peas> |
|     3011 | Minted pea puree   | <frozen peas> |
+----------+--------------------+---------------+

In a columnar store this can be orders of magnitude faster than a join across the three source tables, and still provides us with the core of information we want to display – the recipe id and title.

Mapping Columns to Documents

Moving the data over verbatim as tables is unlikely to work as efficiently as you think. For example, in a NoSQL database, joins are normally either impossible, or computationally expensive, and so expecting to be able to reconstitute the structure in the destination database. You also want to make sure that you are using the most efficient method for the database system. Simply putting a JSON representation of each row from an existing table or column store is probably not going to work, especially as JOINs are typically unavailable. A better solution is merge the data from multiple tables itno a single document that contains all of the information you need.

Doing this for table data to be inserted into a document store normally requires the composition of a document from the constituent elements, in the case of our recipe database, the recipe, method, ingredients, and nutritional information needs to be combined together into one big document. There are a variety of ways to do this, but an obvious solution is to logically group ‘objects’ together. That is, an object that might be represented by a collection of tables. Like this:

Fig0201.png

Within our recipe data, for example, in a document store the use case is for us to extract or remove the entire recipe – base data, ingredients, and methods – as a single document that contains all the information we need. This puts all of the information in one document, and makes it easy to update and format as that entire recipe at a time. We can actually see a sample of this, first by looking at the diagrammatic example, here with some dummy data, but you can see how the tables on the right can be mapped to fragments of the document on the left.

Document and Table Mapping

We can also look at a simple script that performs this operation for me, here collecting the recipe object (which queries the underlying database) and then converting that into a JSON structure for writing into the database:

use JSON;
use lib 'cheffy.com';
use Foodware;
use Foodware::Public;
use Foodware::Recipe;

my $fw = Foodware->new();

my $recipes = $fw->{_dbh}->get_generic_multi('recipe','recipeid',{ active => 1});


foreach my $recipeid (keys %{$recipes})
{
    my $recipe = new Foodware::Recipe($fw,$recipeid,{ measgroup => 'Metric',
                                                 tempgroup => 'C',});

    my $id = $recipe->{title};
    $id =~ s/[ ',\(\)]//g;
    my $record = {
       _id => $id,
       title => $recipe->{title},
       subtitle => $recipe->{subtitle},
       servings => $recipe->{servings},
       cooktime => $recipe->{metadata_bytag}->{totalcooktime},
       preptime => $recipe->{metadata_bytag}->{totalpreptime},
       totaltime => $recipe->{metadata_bytag}->{totaltime},
       keywords => [keys %{$recipe->{keywordbytext}} ],
       method => $recipe->{method},
    };

    foreach my $ingred (@{$recipe->{ingredients}})
    {
       push(@{$record->{ingredients}},
            {
               meastext => $ingred->{'measuretext'},
               ingredient => $ingred->{'ingredonly'},
               ingredtext => $ingred->{'ingredtext'},
            }
           );
    }

    print to_json($record);
}

Finally, let’s look at how this is all translated into a full JSON representation of the same information:

{
   "subtitle" : "A good use for bananas when they're going soft.",
   "keywords" : [
      "diet@vegetarian",
      "diet@corn-free",
      "occasion@entertaining",
      "diet@citrus-free",
      "occasion@kids' parties",
      "diet@demi-veg",
      "special collections@lunchbox",
      "meal type@cakes, biscuits, sweets",
      "special collections@classic recipe",
      "diet@peanut-free",
      "cook method.hob, oven, grill@oven",
      "special collections@store cupboard",
      "diet@yeast-free",
      "special collections@budget",
      "occasion@prepare-ahead entertaining",
      "main ingredient@fruit",
      "occasion@picnic",
      "diet@shellfish-free",
      "special collections@cheffy recommended"
   ],
   "preptime" : "20",
   "servings" : "8",
   "cooktime" : "45",
   "method" : [
      {
         "_sort" : "4",
         "recipeid" : "2035",
         "step" : "4",
         "altgroup" : "0",
         "methodstep" : "Spoon into the loaf tin. Spoon the top. Bake for 45-50 min or until well risen and cooked through. ",
         "text_formatted" : "Spoon into the loaf tin. Spoon the top. Bake for 45-50 min or until well risen and cooked through. "
      },
      {
         "text_formatted" : "Slowly beat in the egg. Add the banana. Fold in the flour and bicarbonate of soda. Add the dried fruit. ",
         "methodstep" : "Slowly beat in the egg. Add the banana. Fold in the flour and bicarbonate of soda. Add the dried fruit. ",
         "_sort" : "3",
         "recipeid" : "2035",
         "altgroup" : "0",
         "step" : "3"
      },
      {
         "recipeid" : "2035",
         "_sort" : "2",
         "step" : "2",
         "altgroup" : "0",
         "text_formatted" : "Cream the butter and sugar together until pale and fluffy. ",
         "methodstep" : "Cream the butter and sugar together until pale and fluffy. "
      },
      {
         "recipeid" : "2035",
         "_sort" : "1",
         "altgroup" : "0",
         "step" : "1",
         "text_formatted" : "Preheat oven to 180&deg;C. Grease a 900 g loaf tin.",
         "methodstep" : "Preheat oven to 180.C. Grease a 900g loaf tin."
      },
      {
         "text_formatted" : "Turn out onto a wire tray. Leave to cool. ",
         "methodstep" : "Turn out onto a wire tray. Leave to cool. ",
         "_sort" : "5",
         "recipeid" : "2035",
         "altgroup" : "0",
         "step" : "5"
      }
   ],
   "totaltime" : "65",
   "_id" : "Bananacake",
   "title" : "Banana cake",
   "ingredients" : [
      {
         "ingredtext" : "butter",
         "meastext" : "75 g",
         "ingredient" : "butter"
      },
      {
         "ingredtext" : "bicarbonate of soda",
         "meastext" : "[sup]1[/sup]/[sub]2[/sub] tsp",
         "ingredient" : "bicarbonate of soda"
      },
      {
         "meastext" : "2",
         "ingredient" : "bananas",
         "ingredtext" : "ripe bananas, peeled and mashed"
      },
      {
         "ingredtext" : "white self-raising flour, sifted",
         "ingredient" : "white self-raising flour",
         "meastext" : "200 g"
      },
      {
         "ingredtext" : "salt",
         "meastext" : "1 pinch",
         "ingredient" : "salt"
      },
      {
         "ingredtext" : "egg, beaten",
         "ingredient" : "egg",
         "meastext" : "1"
      },
      {
         "ingredient" : "dried mixed fruit",
         "meastext" : "100 g",
         "ingredtext" : "dried mixed fruit"
      },
      {
         "ingredtext" : "caster sugar",
         "ingredient" : "caster sugar",
         "meastext" : "100 g"
      }
   ]
}

Looking at the output, you can see how the structure of document has been merged together into something more usable. We have a block for ingredients, keywords, method, and you can see how the ‘ingredient’ field in the ingredient block could be used as a searchable element.

The primary questions about the format come will come down to how the data will be used. In CouchDB and Couchbase for example a map/reduce like process will be used to create an index on the information. Choosing the right structure is therefore about understanding the structure and how it will be used. When I come to build an index on this information, and I want to build an index so that I can query by ingredient, is this an effective method to format the data? Is processing all of that data to determine the occurrences of turkey the most efficient method?

How about if I realize that all my vegetarian recipes are really vegan, will that change the structure? We’ll return to these questions later in the book.

Handling Compound Assignments

While we’re talking about constructing documents from tables and vice versa, I want to consider those pesky compound values again. Compound values are hugely complex when it comes to data translations because there are differences within a database type as well as between types that should be considered. Always with a compound type you should start by asking three basic questions:

  • How will it be queried in the target database?
  • Can I have multiple values within the same field?
  • Is the order of those multiple values significant?

One translation you want to avoid is to convert this relatively structured format into something that ultimately becomes hard to process. For example, the temptation is to convert this ‘field’ from the source CouchDB environment into a format that looks similar to the original, for example, by using a comma-separated list:

Indian,Chicken,Okra,Spicy

There are so many dangers with this, it’s hard not to see how this is a bad idea. Some obvious problems are:

  • How do we cope with an ever-increasing number of potential options? We’re showing just four here, what if there were 20? 30? 60?
  • What happens if more data than the width of the column are defined? If it’s truncated we lose information. Using a wider column only works for so long.
  • Indexes become unmanageable from a database adminisitration perspective.
  • How is integrity enforced? Should the options be sorted alphabetically? What happens when they get updated?
  • What happens if we want to change a value? What happens if ‘red’ becomes ‘scarlet’, do we change every row?

The biggest problem with this approach is the usability once that value is used in a transactional or columnar store, the data now becomes difficult and potentially expensive to process. It takes a lot more computational effort for a database to search even with the use of indexes a text string that contains a sequence of letters, against either a join or boolean set of columns matching the same structure.

You can actually test this quite effectively by creating sample tables that emulate this basic functionality. MySQL is being used here to do the comparisons, but the general effects are quite clear. Here’s a search against a single text column using commas:

SELECT id,title FROM recipes WHERE keywords LIKE '%CHICKEN%';

Now here’s the same query where the colours are represented a boolean columns:

SELECT id,title FROM recipes WHERE kw_chicken = 1;

A boolean index in the latter example will be much quicker than the LIKE search in the former.

Adventures in recipes

When developing an application, and particularly the database that will support it, there comes a time when you realize you may not have planned and identified all of the problems you could, and that’s when you notice that something doesn’t quite work.

The use of free-text for ingredients in the recipe database was one such moment. Searching for a recipe that contains ‘Chicken’ is fine if you look at the ingredients, you get to pick up everything from ‘whole chicken’ to ‘chicken breasts’ within the search. Unfortunately, you also pick up ‘chicken stock’. When a user searches for chicken recipes, chicken stock is used in a surprising number of recipes that otherwise contain no chicken of any variety whatever.

When migrating data around, you can see the same principles at work; if you’d relied on using a text field to store the value over separate, stricter, fields, the quality of the data is ruined. Keep this in mind when moving data around.

The recommendations for how to work out which is the best method are actually comparatively straightforward:

  • If the list of possible values is small (for example, four or five different values) and there are few of them within the overall ‘table’ that they are been moved to, use fields/boolean values.
  • If the list is larger, and likely to grow and expand with new options, then use a lookup table.

When applying the same principles the other way round, you should use the most appropriate format for the corresponding target database in a way that makes the data usable. In document databases, for example, it’s efficient to translate a compound value into an array of sub-values, just as in our original.

Mapping Documents to Columns

When mapping data from a document-based structure back into columns, the process is significantly harder. You are specifically going from a flexible multi-field format where the number and type of fields could be entirely different into one where the format is rigid and inflexible.

There are two choices available:

  • Transform the data back into a table structure that as close as possible matches the overall data structure of the underlying information. This is great if the document is relatively flat. But if, as we’ve seen, we have compound data, or variable length compound information, this method doesn’t work as well.
  • Transform the data back out into a multi-table format that has the capability for joins. This is basically the reverse of the process we just examined for converting the table-based recipe data into documents. You must remember to use a unique identifier for the parent record so that it can be linked back properly when a JOIN is used.

The flatter the document, the easier the conversion.