Intelligent Linking and Indexing in DocBook

One of the issues I have with DocBook XML is that the links are a little forced and manual. 

By that, I mean that if I have a command, like trepctl, and I used it in a sentence or description, if I want to link trepctl back to the corresponding trepctl page, I have to manually add it like this:

<link linkend="cmdline-tools-trepctl"><command>trepctl</command></link>

Not only is that a mouthful to say, it’s a lot of keys to type. 

So I’ve fixed that. 

What I do instead is add a <remark> block into the documentation for that command-line page:

<section id="cmdline-tools-trepctl">
<title>The trepctl Command</title>
<remark role="index:canonical" condition="command:trepctl"/>
...
</section>

The ‘role’ attribute specifies the index entry, and that this is canonical, i.e., this is the canon page for <command>trepctl</command>

That’s what is defined in the ‘condition’ attribute. 

This means that when I put <command>trepctl</command>, during post-processing, the command is automatically linked to that page without me having to manually do that. 

You can see the effect of this at the top of this page.

It works for anything, and it works for longer fragments too, so I can do ‘Use the <command>trepctl status</command> command’, and the post-processing will automatically link to the canonical page for the trepctl status command. On that same page, you can see links to the field names in the output. 

This uses an extension to the original index reference format: 

<remark role="index:canonical" condition="parameter:appliedLastSeqno:thl"/>

That third argument to the condition attribute gives a ‘hint’ as to what it might apply to. This means that we can link using a commonly used DocBook element, such as parameter, and tag it to link to this canonical page, just by adding a condition attribute to the parameter element, like this:

<parameter condition="thl">appliedLastSeqno</parameter>

OK, so it’s still long, but it’s less complex than writing out <link> or <xref> elements, and it means that I don’t have to know the ID where the information is held, that’s entirely driven from marking up the content with the canonical index entry. 

Finally, and perhaps the most important point, is that you can go to any of the Continuent documentation pages and type either the partial or full command, and it will take you to the canonical page for that command, option, etc, which means not only is the content heavily linked (making it more useful), but it also makes it more easily searchable to the right place. 

MC at Percona Live San Francisco 2014

Now I’m back in the MySQL fold, I’ve got the opportunity to speak at Percona Live again. I’ve always enjoyed speaking at this conference (back when it was known by another name…), although I need to up my game and do the 6 talks I did back in 2009.

On the Tuesday afternoon, tutorials day, I’m running a half-day session with my replication colleague Linas Virbalas. This will be similar to the session I did at Percona Live London, and cover some of the more advanced content on replication, including, but not limited to:

  • Filters
  • JavaScript Filtering
  • Some fun and practical filters
  • Heterogeneous replication from MySQL out to MongoDB, Vertica, Oracle and Hadoop

I might even choose to demo Real-Time Replication from MySQL to Cassandra if we have time for it.

Then on Thursday Linas and I will be taking an even closer look at the Hadoop applier, demonstrating how it works, and how the fundamental principles of the applier can be mapped to other databases the same basic process.  Our key aim here is to show how easy and straightforward the process is, and how practical it makes the data transfer. We will be doing a live demo, and looking at why this is a better alternative than Sqoop.

Of course, I’m not the only Continuent person there, Giuseppe Maxia will be doing a tutorial session on Advanced Replication techniques, one on multi-master, and another on partitioning for performance. And of course Robert Hodges will be doing his keynote, and a talk on Cassandra, another on MongoDB, and a guide to Cloud systems and MySQL. Jeff and Neil, meanwhile, will be talking about geo-clustering, and Neil has talks on Puppet and using MySQL in the cloud.

 

MySQL to Hadoop Step-By-Step

We had a great webinar on Thursday about replicating from MySQL to Hadoop (watch the whole thing). It was great, but one of the questions at the end was ‘is there an easy way to test’.

Sadly we can’t go giving out convenient ready-to-run downloads of these things because of licensing and and other complexities, so I want to try and make it as simple and straightforward as possible by giving you the directions to complete. I’m going to be point to the Continuent Documentation every now and then so this is not too crowded, but we should get through it pretty easily.

Major Decisions

For this to work: 

  • We’ll setup two VMs, one the master (running MySQL), the other the slave (Running Cloudera)
  • The two VMs must be able to reach each other on the network. It doesn’t matter whether they are running Internal, NAT, or bridge-mode network, they just need to be able to ping and SSH each other. Switch off firewalls to prevent port weirdness.
  • For convenience, update your /etc/hosts to have a host1 (the master) and host2 (the slave)
  • The master must have followed the prereqs; for the slave it’s optional, but highly recommended

With that in mind, let’s get started.

Step 1: Setup your Master Host

There are a number of ways you can do this. If you want to simplify things and have VirtualBox, try downloading this VM. It’s a 1.5GB download containing and OVF VM, and is a Ubuntu host, with our prerequisites followed. To use this :

  1. Uncompress the package.
  2. Import the VM into your VirtualBox.
  3. If you want, change the network type from Internal to a bridged or NAT environment.

Using internal networking, you can login to this using:

shell> ssh -p2222 tungsten@localhost

Passwords are ‘password’ for tungsten and root.

If you don’t want to follow this but want your own VM:

  1. Create a new VM with 1-2GB of RAM, and 8GB or more of disk space
  2. Install your OS of choice, either Ubuntu or CentOS
  3. Follow our prerequisite instructions
  4. Make sure MySQL is setup and running, and that the binary logging is enabled
  5. Make sure it has a valid IP address

Step 2: Setup Tungsten Replicator

Download the latest Tungsten Replicator binary from this page

Unpack the file:

shell> tar zxf tungsten-replicator-3.0.tar.gz

Change into the directory:

shell> cd tungsten-replicator-3.0

Create a new replicator installation, this will read from the binary log into THL:

shell> ./tools/tpm install alpha \
--install-directory=/opt/continuent \
--master=host1 \
--members=host1 \
--java-file-encoding=UTF8 \
--java-user-timezone=GMT \
--mysql-enable-enumtostring=true \
--mysql-enable-settostring=true \
--mysql-use-bytes-for-string=false \
--svc-extractor-filters=colnames,pkey \
--property=replicator.filter.pkey.addColumnsToDeletes=true \
--property=replicator.filter.pkey.addPkeyToInserts=true \
--replication-password=password \
--replication-user=tungsten \
--skip-validation-check=HostsFileCheck \
--skip-validation-check=ReplicationServicePipelines \
--start-and-report=true

For a full description of what’s going on here, see this page and click on the magnifying glass. You’ll get the full description of each option.

To make sure everything is OK, you should get a status from trepctl generated. If it’s running and it shows the status as online, we’re ready.

Step 3: Get your Cloudera Host Ready

There are lots of ways to get Cloudera’s Hadoop solution installed. The ready-to-run VM is the simplest by far.

  1. Download the Cloudera VM quick start host from here; there are versions for VirtualBox and VMware and KVM.
  2. Set the networking type to match the master.
  3. Start the host
  4. Set the hostname to host2
  5. Update the networking to an IP address that can talk to the master.
  6. Update /etc/hosts to add the IP address of host1 and host2 e.g.:

192.168.0.2 host1

Add a ‘tungsten’ user which we will use to install Tungsten Replicator.

Step 4: Install your Hadoop Slave

Download the latest Tungsten Replicator binary from this page

Unpack the file:

shell> tar zxf tungsten-replicator-3.0.tar.gz

Change into the directory:

shell> cd tungsten-replicator-3.0

Create a new replicator installation, this will read the information from the master (host1) and apply it to this host (host2)

shell> ./tools/tpm install alpha \
--batch-enabled=true \
--batch-load-language=js \
--batch-load-template=hadoop \
--datasource-type=file \
--install-directory=/opt/continuent \
--java-file-encoding=UTF8 \
--java-user-timezone=GMT \
--master=host1 \
--members=host2 \
'--property=replicator.datasource.applier.csv.fieldSeparator=\\u0001' \
--property=replicator.datasource.applier.csv.useQuotes=false \
--property=replicator.stage.q-to-dbms.blockCommitInterval=1s \
--property=replicator.stage.q-to-dbms.blockCommitRowCount=1000 \
--replication-password=secret \
--replication-user=tungsten \
--skip-validation-check=DatasourceDBPort \
--skip-validation-check=DirectDatasourceDBPort \
--skip-validation-check=HostsFileCheck \
--skip-validation-check=InstallerMasterSlaveCheck \
--skip-validation-check=ReplicationServicePipelines \
--start-and-report=true

For a description of the options, visit this page and click on the second magnifying glass to get the description.

As before, we want everything to be running and for the replicator to be online, run:

shell> trepctl status

This should tell you everything is running – if you get an error about this not being found, source the environment to populate your PATH correctly:

shell> source /opt/continuent/share/env.sh

We want everything to be online and running. If it isn’t, use the docs to help determine the reason, or use our discussion group to ask questions.

Step 5: Generating DDL

For your chosen MySQL database schema, you need to generate the staging and live table definitions for Hive.

A tool, ddlscan, is provided for this. You need to run it and provide the JDBC connect string for your database, and your user and password. If you followed the prereqs, use the one for the tungsten user.

First create the live table DDL:

shell> ddlscan -user tungsten -url 'jdbc:mysql://host1:3306/test' -pass password -template ddl-mysql-hive-0.10.vm -db test > schema.sql

Now apply it to Hive:

shell> cat schema.sql | hive

To create Hive tables that read the staging files loaded by the replicator, use the ddl-mysql-hive-0.10-staging.vm template on the same database:

shell> ddlscan -user tungsten -url 'jdbc:mysql://host:3306/test' -pass password -template ddl-mysql-hive-0.10-staging.vm -db test > schema-staging.sql

Now apply it to Hive again:

shell> cat schema-staging.sql | hive

Step 6: Start Writing Data

Hopefully by this point you’ve got two VMs, one running MySQL and the master replicator extracting info from the MySQL binary log. On the other, you have a basic Cloudera instance with a slave replicator writing changes. Both replicator should be online (use ‘trepctl status’ to check).

All you need to do is start writing data into the tables you selected when creating the DDL. That should be it – you should see data start to stream into Hadoop.

Real-Time Replication from MySQL to Cassandra

Earlier this month I blogged about our new Hadoop applier, I published the docs for that this week (http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html) as part of the Tungsten Replicator 3.0 documentation (http://docs.continuent.com/tungsten-replicator-3.0/index.html). It contains some additional interesting nuggets that will appear in future blog posts.

The main part of that functionality that performs the actual applier for Hadoop is based around a JavaScript applier engine – there will eventually be docs for that as part of the Batch Applier content (http://docs.continuent.com/tungsten-replicator-3.0/deployment-batchloading.html). The core of this system is that it    takes the information from the data stream of the THL and the CSV file that was written by the batch applier system, and runs the commands necessary to load it into Hadoop and perform any necessary merges.

I wanted to see how easy it would be to use the same system to use that same flexible system and bend it to another database system, in my case, I chose Cassandra.

For the record, it took me a couple of hours to have this working, and I’m guessing another hour will file down some of the rough edges.

Cassandra is interesting as a database because it mixes a big distributed key/value store with a close-enough to SQL like interface in the form of CQL. And that means we can make use of the CQL to help us perform the merging into the final tables in a manner not dissimilar to the method we use for loading into Vertica.

Back to the Javascript batch loader, the applier provides five different implementable functions (all are technically optional) that you can use at different stages of the applier process. These are:

  • prepare() – called once when the applier goes online and can be used to create temporary directories or spaces
  • begin() – called at the start of each transaction
  • apply() – called at the end of the transaction once the data file has been written, but before the commit
  • commit() – called after each transaction commit has taken place; this where we can consolidate info.
  • release() – called when the applier goes offline

We can actually align these functions with a typical transaction – prepare() happens before the statements even start, begin() is the same as BEGIN, apply() happens immediately before COMMIT and commit() happens just after. release() can be used to do any clean up afterwards.

So let’s put this into practice and use it for Cassandra.

The basic process for loading is as follows:

  1. Write a CSV file to load into Cassandra
  2. Load the CSV file into a staging table within Cassandra; this is easy through CQL using the ‘COPY tablename FROM filename’ CQL statement.
  3. Merge the staging table data with a live table to create a carbon copy of our MySQL table content.

For the loading portion, what we’ll do is load the CSV into a staging table, and then we’ll merge the staging table and live table data together during the commit stage of our batch applier. We’ll return to this in more detail.

For the merging, we’ll take the information from the staging table, which includes the sequence number and operation type, and then write the ‘latest’ version of that row and put it into the live table. That gives us a structure like this:

Cassandra Loader

Tungsten Replicator is going to manage this entire process for us – all we need to do ins install the replicators, plug in these custom bits, and let it run.

As with the Hadoop applier, what we’re going to do is use the batch applier to generate only insert and delete rows; UPDATE statements will be converted into a delete of the original version and insert of the new version. So:

INSERT INTO sample VALUES (1,’Message’)

Is an insert…

DELETE sample WHERE id  = 1

Is a delete, and:

UPDATE sample SET message = ’Now you see me’ WHERE id = 1

is actually:

DELETE sample WHERE id  = 1
 INSERT INTO sample VALUES (1,’Now you see me’)

This gets round the problem of doing updates (which in big data stores are expensive, particularly Hadoop which doesn’t support updating existing data), into a more efficient delete and insert.

In the CSV data itself, this is represented by prefix every row with three fields:

optype, sequence number, unique id

Optype is ‘D’ for a delete and ‘I’ for an insert and is used to identify what needs to be done. The sequence number is the unique transaction ID from the replicator THL. This number increases by one for every transaction, and this means we can always identify the ‘latest’ version of a row, which is important to us when processing the transaction into Cassandra. the unique ID is the primary key (or compound key) from the source data. We need this to ensure we update the right row. To replicate data in this way, we must have a primary key on the data. If you don’t have primary keys, you are probably in a world of hurt anyway, so it shouldn’t be a stretch.

One difficulty here is that we need to cope with an idiosyncracy of Cassandra, which is that by default, Cassandra orders fields in the ‘tables’ (really collections of key/values) so that integers and numbers appear first in the table, and text appears last. This is an optimisation that Cassandra makes that complicates things for us, but only in a very small way. For the moment, we’ll handle it by assuming that we are loading only one table with a known format into Cassandra. We could handle multiple tables by using a simple IF statement in the JS and using different formats for that, or we could actually extract the info from the incoming data; I’m going to skip that because it keeps us away from the cool element of actually getting the data in.

Within Cassandra then we have two tables, the table we are loading data into, and the staging table that we load the CSV data into. For our sample, the live schema is ‘sample’, the live table is ‘sample’ and the staging table is ‘staging_sample’.

The definitions for these in Cassandra are for the sample live table:

 CREATE TABLE sample (
 id int,
 message text,
 PRIMARY KEY (id)
 ) WITH
 bloom_filter_fp_chance=0.010000 AND
 caching='KEYS_ONLY' AND
 comment='' AND
 dclocal_read_repair_chance=0.000000 AND
 gc_grace_seconds=864000 AND
 index_interval=128 AND
 read_repair_chance=0.100000 AND
 replicate_on_write='true' AND
 populate_io_cache_on_flush='false' AND
 default_time_to_live=0 AND
 speculative_retry='99.0PERCENTILE' AND
 memtable_flush_period_in_ms=0 AND
 compaction={'class': 'SizeTieredCompactionStrategy'} AND
 compression={'sstable_compression': 'LZ4Compressor'};

And for the staging_sample table:

CREATE TABLE staging_sample (
 optype text,
 seqno int,
 fragno int,
 id int,
 message text,
 PRIMARY KEY (optype, seqno, fragno, id)
 ) WITH
 bloom_filter_fp_chance=0.010000 AND
 caching='KEYS_ONLY' AND
 comment='' AND
 dclocal_read_repair_chance=0.000000 AND
 gc_grace_seconds=864000 AND
 index_interval=128 AND
 read_repair_chance=0.100000 AND
 replicate_on_write='true' AND
 populate_io_cache_on_flush='false' AND
 default_time_to_live=0 AND
 speculative_retry='99.0PERCENTILE' AND
 memtable_flush_period_in_ms=0 AND
 compaction={'class': 'SizeTieredCompactionStrategy'} AND
 compression={'sstable_compression': 'LZ4Compressor'};

I’ve put both tables into a ‘sample’ collection.

Remember that that idiosyncrasy I mentioned? Here it is, a bare table loading from CSV will actually order the data as:

seqno,uniqno,id,optype,message

This is Cassandra’s way of optimising integers over text to speed up lookups, but for us is a minor niggle. Right now, I’m going to handle it by assuming we are replicating only one schema/table and we we not what the structure of that looks like. Longer term, I want to pull it out of the metadata, but that’s a refinement.

So let’s start by having a look at the basic JS loader script, it’s really the component that is going to handle the core element of the work, managing the CSV files that come in from the batch engine and applying them into Cassandra. Remember, there are five functions that we can define, but for the purposes of this demonstration we’re going to use only two of them, apply(), which will load the CSV file into Cassandra, and the commit() function, which will perform the steps to merge the stage data.

The apply() function does two things, it identifies the table and schema, and then runs the command to load this data into Cassandra through the cqlsh command-line tool. We actually can’t run CQL directly from this command line, but I wrote a quick shell script that pipes CQL from the command-line into a running cqlsh.

The commit() function on the other hand is simpler, although it does a much more complicated job using another external script, this time written in Ruby.

So this gives us a cassandra.js script for the batch applier that looks like this:

function apply(csvinfo)
{
   sqlParams = csvinfo.getSqlParameters();
   csv_file = sqlParams.get("%%CSV_FILE%%");
   schema = csvinfo.schema;
   table = csvinfo.table;
  runtime.exec("/opt/continuent/share/applycqlsh.sh " + schema + ' "copy staging_' + table + " (optype,seqno,uniqno,id,message) from '" + csv_file + "';\"");
}

function commit()
{
  runtime.exec("/opt/continuent/share/merge.rb " + schema);
}

So, the apply() function is called for each event as written into the THL from the MySQL binary log, and the content of the CSV file generated at that point contains the contents of the THL event; if it’s one row, it’s a one-row CSV file; if it’s a statement or transaction that created 2000 rows, it’s a 2000 row CSV file.

The csvinfo object that is provided contains information about the batch file that is written, including, as you can see here, the schema and table names, and the sequence number. Note that we could, at this point, pull out table info, but we’re going to concentrate on pulling a single table here just for demo purposes.

The CQL for loading the CSV data is:

COPY staging_tablename (optype,seqno,uniqno,id,message) from ‘FILENAME’;

This says, copy the the specific columns in this order from the file into the specified table.  As I mentioned, currently this is hard coded into the applier JS, but would be easy to handle for more complex schemas and structures.

The commit() function is even simpler, because it just calls a script that will do the merging for us – we’ll get to that in a minute.

So here’s the script that applies an arbitrary CQL statement into Cassandra:

 #!/bin/bash
SCHEMA=$1;shift
echo "$*" |cqlsh -k $SCHEMA tr-cassandra2

Really simple, but gets round a simple issue.

The script that does the merge work is more complex; in other environments we might be able to do this all within SQL, but CQL is fairly limited with no sub-queries. So we do it long-hand using Ruby. The basic sequence is quite simple, and is in two phases:

  1. Delete every row mentioned in the staging table with an optype of D with a matching unique key
  2. Insert the *last* version of an insert for each unique ID – the last version will be the latest one in the output. We can pick this out by just iterating over every insert and picking the one with the highest Sequence number as generated by the THL transaction ID.
  3. Delete the content from the staging table because we’ve finished with it. That empties the staging table ready for the next set of transactions.

That file looks like this:

#!/usr/bin/ruby

require 'cql'

client = Cql::Client.connect(hosts: ['192.168.1.51'])
client.use('sample')

rows = client.execute("SELECT id FROM staging_sample where optype = 'D'")

deleteids = Array.new()

rows.each do |row|
puts "Found ID #{row['id']} has to be deleted"
deleteids.push(row['id'])
end

deleteidlist = deleteids.join(",")

client.execute("delete from sample where id in (#{deleteidlist})");
puts("delete from sample where id in (#{deleteidlist})");
rows = client.execute("SELECT * FROM staging_sample where optype = 'I'");

updateids = Hash.new()
updatedata = Hash.new()

rows.each do |row|
id = row['id']
puts "Found ID #{id} seq #{row['seqno']} has to be inserted"
if updateids[id]
if updateids[id] < row['seqno']
updateids[id] = row['seqno']
row.delete('seqno')
row.delete('fragno')
row.delete('optype')
updatedata[id] = row
end
else
updateids[id] = row['seqno']
row.delete('seqno')
row.delete('fragno')
row.delete('optype')
updatedata[id] = row
end
end

updatedata.each do |rowid,rowdata|
puts "Should update #{rowdata['id']} with #{rowdata['message']}"
collist = rowdata.keys.join(',')
colcount = rowdata.keys.length
substbase = Array.new()
#  (1..colcount).each {substbase.push('?')}
rowdata.values.each do |value|
if value.is_a?(String)
substbase.push("'" + value.to_s + "'")
else
substbase.push(value)
end
end

substlist = substbase.join(',')

puts('Column list: ',collist)
puts('Subst list: ',substlist)
cqlinsert = "insert into sample ("+collist+") values ("+substlist+")"
puts("Statement: " + cqlinsert)
client.execute(cqlinsert)
end

client.execute("delete from staging_sample where optype in ('D','I')")

Again, currently, this is hard coded, but I could easily of got the schema/table name from the JS batch applier – the actual code is table agnostic and will work with any table.

So, I’ve setup two replicators – one uses the cassandra.js rather than hadoop.js but works the same way, and copied the applycqlsh.sh and merge.rb into /opt/continuent/share.

And we’re ready to run. Let’s try it:

mysql> insert into sample values (0,'First Message’);
Query OK, 1 row affected (0.01 sec)

We’ve inserted one row. Let’s go check Cassandra:

cqlsh:sample> select * from sample;

id  | message
-----+---------------
489 | First Message

Woohoo – data from MySQL straight into Cassandra.

Now let’s try updating it:

mysql> update sample set message = 'Updated Message' where id = 489;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

And in Cassandra:

cqlsh:sample> select * from sample;

id  | message
-----+-----------------
489 | Updated Message

Bigger woohoo. Not only am I loading data directly into Cassandra, but I can update it as well. Now I can have a stream of update and information within MySQL replicated over to Cassandra for whatever analysis or information that I need without any issues.

Cool huh? I certainly think so (OK, but I’m biased).

Now I haven’t tested it, but this should just as easily work from Oracle; I’ll be testing that and let you know.

Any other database destinations people would like to see for replicating into? If so, let me know and I’ll see what I can do.

Getting Data into Hadoop in real-time

Moving data between databases is hard. Without ever intending it, I seem to have spent a lifetime working on solutions for getting data into and out of databases, but more frequently between. In fact, my first job out of university was migrating data from BRS/Text, a free-text database (probably what we would call a NoSQL) into a more structured Oracle.

Today I spend some of my time working in Big Data, more often than not, migrating information from existing data stores into Big Data so that they can be analysed, something I covered in more detail here:

http://www.ibm.com/developerworks/library/bd-sqltohadoop1/index.html
http://www.ibm.com/developerworks/library/bd-sqltohadoop2/index.html
http://www.ibm.com/developerworks/library/bd-sqltohadoop3/

The problem with the current techniques, Sqoop included, is that they rely on a relatively manual, even basic, transfer process. Dump your data out, reload it back again into Hadoop.

Even with Sqoop, although it automates much of the process, it is not entirely reliable, especially if you want to do more than simple dump and load. Serial loading, or incrementally transferring data from MySQL or Oracle, is fraught with problems, not least of which is that it requires adding a timestamp to your data structure to get the best results out of it.

Perhaps worse though is that Sqoop is an intermittent, even periodic transfer system. Incremental loading works by copying all the changed records since a specific point in time. Running it too frequently is counter productive, which means you end up using a 15-minute or every-couple-of-hour period, depending on your database activity.

Most databases have some kind of stream of changes that enables you to see everything that has happened on the database. With MySQL, that’s the binary log. And with the Open Source Tungsten Replicator tool we take advantage of that so that we can replicate into MySQL, and indeed into Oracle, MongoDB and Vertica, among others.

51c6a7b5abcca6c30f7d79ea8eba17f0

Reading the data out from MySQL is lightweight since the master just reads the contents of the binary log; especially compared to Sqoop, which uses read locks and SELECT * with and without LIMIT clauses.

Right now we’re working on an applier that writes that data into Hadoop in real time from MySQL. Unlike Sqoop, we provide a continuous stream of changes from MySQL into the immutable store of Hadoop.

But the loading and nature of Hadoop presents some interesting issues, not least of which (if you’ve been following my other articles) is the fact that data written into Hadoop is immutable. For data that is constantly changing, an immutable store is not the obvious destination.

We get round that by using the batch loading system to create CSV files that contain the data, changes and sequence numbers, and then loading that information into Hadoop. In fact, Robert has updated the batch loader to use a new JavaScript based system (of which more in a future blog post) that simplifies the entire process, without requiring a direct connection or interface to Hadoop (although we can write directly into HDFS).

For example, the MySQL row:


| 3 | #1 Single | 2006 | Cats and Dogs (#1.4) |

Is represented within the staging files generated as:


I^A1318^A3^A3^A#1 Single^A2006^ACats and Dogs (#1.4)

That’s directly accessible by Hive. In fact, using our ddlscan tool, we can even create the Hive table definitions for you:


ddlscan -user tungsten -url 'jdbc:mysql://host1:13306/test' -pass password \
-template ddl-mysql-hive-0.10.vm -db test

Then we can use that record of changes to create a live version of the data, using a straightforward query within Hive. In fact, Hive provides the final crucial stage of the loading process by giving us that live view of the change data, and we simplify that element by providing the core data, and ensuring that the CSV data is in the right format for Hive to use the files without changes.

The process is quite remarkable; speed-wise for direct dumps, Tungsten Replicator is comparable to Sqoop, but when it comes to change data, the difference is that we have the information in real time. You don’t have to wait for the next Sqoop load, or for the incremental loading and row selection of Sqoop, instead, we just apply the changes written into the binary log.

Of course, we can fine tune the intervals of the writes of the CSV change data into Hadoop using the block commit properties (see http://docs.continuent.com/tungsten-replicator-2.2/performance-block.html). For example, this means by default we commit into Hadoop every 10s or 10,000 rows, but we can change it to commit every 5s or 1,000 rows if your data is critical and busy.

We’re still optimising and improving the system, but I can tell you that in my own tests we can handle GB of change data and information in a live fashion, both across single-table and multi-table/multi-schema datasets. What’s particularly cool is that if you are using Hadoop as a concentrator for all of your MySQL data for analysis, we can transfer from multiple MySQL servers into Hadoop simultaneously and take advantage of the multi-node Hadoop environment to cope with the load.

Anonymizing Data During Replication

If you happen to work with personal data, chances are you are subject to SOX (Sarbanes-Oxley) whether you like it or not.

One of the worst aspects of this is that if you want to be able to analyse your data and you replicate out to another host, you have to find a way of anonymizing the information. There are of course lots of ways of doing this, but if you are replicating the data, why not anonymize it during the replication?

Of the many cool features in Tungsten Replicator, one of my favorites is filtering. This allows you to process the stream of changes that are coming from the data extracted from the master and perform operations on it. We use it a lot in the replicator for ignoring tables, schemas and columns, and for ensuring that we have the correct information within the THL.

Given this, let’s use it to anonymize the data as it is being replicated so that we don’t need to post-process it for analysis, and we’re going to use JavaScript to do that.

For the actual anonymization, we’re going to use a simple function that devolves the content into an anonymizer. For this, I’m going to use the md5 in JavaScript function provided by Paul Johnston here: http://pajhome.org.uk/crypt/md5, although others are available. The main benefit of using md5 is that the same string of text will always be hashed into a consistent value. This is important because it means that we can still run queries with joins between the data, knowing that, for example, the postcode ‘XQ23 1LD’ will be based into ‘d41d8cd98f00b204e9800998ecf8427e’ in every table. Joins still work, and data analysis is entirely valid.

Better still, because it’s happening during replication I always have a machine with anonymised information that I can use to query my data without worrying about SOX tripping me up.

Within the JS filter environment there are two key functions we need to use. One is prepare(), which is called when the replicator goes online, and the other is filter() which processes each event within the THL.

In the prepare() function, I’m going to identify from the configuration file which fields from the configuration file we are going to perform the actual hashing operation on. We do that by creating a hash structure within JavaScript that maps the schema, table name, and field. For example, to anonymise the field ‘postcode’ in ‘address’ in the schema ‘customers’:

stcspec=customers.address.postcode

For this to work, we must have the colnames filter enabled.

The function itself just splits the stcspec parameter from the configuration file into a hash of that combo:

var stcmatch = {};
function prepare()
{
  logger.info("anonymizer: Initializing...");  

    stcspec = filterProperties.getString("stcspec");
    stcarray = stcspec.split(",");
    for(i=0;i<stcarray.length;i++) { 
        stcmatch[stcarray[i]] = 1;
    }

}

The filter() function is provided one value, the event object from the THL. We operate only on ROW-based data (to save us parsing the SQL statement), and then supply the event to an anonymize() function for the actual processing:

function filter(event)
{
  data = event.getData();
  if(data != null)
  {
    for (i = 0; i < data.size(); i++)
    {
      d = data.get(i);

      if (d != null && d instanceof com.continuent.tungsten.replicator.dbms.StatementData)
      {
          // Ignore statements
      }
      else if (d != null && d instanceof com.continuent.tungsten.replicator.dbms.RowChangeData)
      {
          anonymize(event, d);
      }
    }
  }
}

Within the anonymise event, we extract the schema and table name, and then look at each column, and if it exists in our earlier stcspec hash, we change the content of the THL on the way past to be the hashed value, in place of the original field value. To do this we iterate over the rowChanges, then over the columns, then over the individual rows:

function anonymize(event, d)
{
  rowChanges = d.getRowChanges();

  for(j = 0; j < rowChanges.size(); j++)
  {
    oneRowChange = rowChanges.get(j);
    var schema = oneRowChange.getSchemaName();
    var table = oneRowChange.getTableName();
    var columns = oneRowChange.getColumnSpec();

    columnValues = oneRowChange.getColumnValues();
    for (c = 0; c < columns.size(); c++)
    {
      columnSpec = columns.get(c);
          columnname = columnSpec.getName();

      rowchangestc = schema + '.' + table + '.' + columnname;

      if (rowchangestc in stcmatch) {

        for (row = 0; row < columnValues.size(); row++)
        {
            values = columnValues.get(row);
            value = values.get(c);
            value.setValue(hex_md5(value.getValue()));

        }
      }
    }
  }
}

Append the md5() script from Paul Johnston (or indeed whichever md5 / hashing algorithm you want to use) to the end of the entire script text:

var stcmatch = {};
function prepare()
{
  logger.info("anonymizer: Initializing...");  

    stcspec = filterProperties.getString("stcspec");
    stcarray = stcspec.split(",");
    for(i=0;i<stcarray.length;i++) { 
        stcmatch[stcarray[i]] = 1;
    }

}

function filter(event)
{
  data = event.getData();
  if(data != null)
  {
    for (i = 0; i < data.size(); i++)
    {
      d = data.get(i);

      if (d != null && d instanceof com.continuent.tungsten.replicator.dbms.StatementData)
      {
          // Ignore statements
      }
      else if (d != null && d instanceof com.continuent.tungsten.replicator.dbms.RowChangeData)
      {
          anonymize(event, d);
      }
    }
  }
}

function anonymize(event, d)
{
  rowChanges = d.getRowChanges();

  for(j = 0; j < rowChanges.size(); j++)
  {
    oneRowChange = rowChanges.get(j);
    var schema = oneRowChange.getSchemaName();
    var table = oneRowChange.getTableName();
    var columns = oneRowChange.getColumnSpec();

    columnValues = oneRowChange.getColumnValues();
    for (c = 0; c < columns.size(); c++)
    {
      columnSpec = columns.get(c);
          columnname = columnSpec.getName();

      rowchangestc = schema + '.' + table + '.' + columnname;

      if (rowchangestc in stcmatch) {

        for (row = 0; row < columnValues.size(); row++)
        {
            values = columnValues.get(row);
            value = values.get(c);
            value.setValue(hex_md5(value.getValue()));

        }
      }
    }
  }
}

Hint

Depending on your configuration, datatypes and version, the return from getValue() is a byte array, not a character string; in that case, add this function:

function byteArrayToString(byteArray) 
{ 
   str = ""; 
   for (i = 0; i < byteArray.length; i++ ) 
   { 
      str += String.fromCharCode(byteArray[i]); 
   } 
   return str; 
}

And change:

value.setValue(hex_md5(value.getValue()));

to:

value.setValue(hex_md5(byteArrayToString(value.getValue())));

That will correctly convert it into a string.

If the error hits, Tungsten Replicator will just stop on that event, not apply bad data. Putting it ONLINE again after changing the script will re-read the event, re-process it through the filter, and then apply the data.

end Hint

Now we need to manually update the configuration. On a Tungsten Replicator slave, open the static-SERVICENAME.properties file in /opt/continuent/tungsten/tungsten-replicator/conf and then add the following lines within the filter specification area (about 90% of the way through):

replicator.filter.anonymize=com.continuent.tungsten.replicator.filter.JavaScriptFilter 
replicator.filter.anonymize.script=/opt/continuent/share/anonymizer.js 
replicator.filter.anonymize.stcspec=customers.address.postcode

The first line defines a filter called anonymize that uses the JavaScriptFilter engine, the second line specifies the location of the JavaScript file, and the third contains the specification of which fields we will change, separated by a comma.

Now, find the line containing “replicator.stage.q-to-dbms.filters” around about line 200 or so and add ‘anonymize’ to the end of the filter list.

Finally, make sure you copy the anonymizer.js script into the /opt/continuent/share directory (or wherever you want to put it that matches the paths specified above).

Now restart the replicator:

$ replicator restart

On your master, make sure you have the colnames filter-enabled. You can do this in master’s static-SERVICENAME.properties like this:

replicator.stage.binlog-to-q.filters=colnames,pkey

Now restart the master replicator:

$ replicator restart

Double check that the replicator is online using trepctl; it will fail if the config is wrong or the JavaScript isn’t found. If everything is running, go to your master and make sure you enable row-based logging (my.cnf: binlog-format=’ROW’), and then try inserting some data into the table that we are anonymizing:

mysql> insert into address values(0,'QX17 1LG');

Now check the value on the slave:

| 711 | dc889465b382 |

Woohoo!

Anonymized data now exists on the slave without having to manually run the process to clean the data.

If you want to extend the fields this is applied to, add them to the stcspec in the configuration, separating each one by a comma, and make sure you restart the replicator.

Developing Applications for use with Continuent Tungsten and Tungsten Replicator in SDJ

I’ve just had a new article published with the Software Developers Journal talking about how you can write applications to take full advantage of Continuent Tungsten and Tungsten Replicator.

As a developer of an application there really isn’t a problem better than finding that you have to scale up the application and the database that supports it to handle the increased load. The main bottleneck to most expansion is the database server and in many modern environments that replication is based around MySQL. Application servers are easy to add on to the front-end of your environment.

Read: Qt5 – How to Become a Professional Developer- RELEASED | News | Magazine for software developers, programmers and designers – Software Developers Journal (registration/purchase required)

Percona Live 2013, MySQL, Continuent and an ever-healthy Ecosystem

I’m sitting here in the lounge at SFO thinking back on the last week, the majority of which has been spent meeting my new workmates and attending the Percona MySQL conference.

For me it has been as much of a family reunion as it has been about seeing the wonderful things going on in MySQL.

Having joined Continuent last month after an ‘absence’ in NoSQL land of almost 2.5 years, joining the MySQL community again just felt like coming home after a long absence. And that’s no bad thing. On a very personal level it was great to see so many of my old friends, many of whom were not only pleased to see me, but pleased to see me working back in the MySQL fold. Evidently many people think this is where I belong.

What was great to see is that the MySQL community is alive and well. Percona may be the drivers behind the annual MySQL conference that we have come to know, but behind the name on the passes and over the doors, nothing has changed in terms of the passion behind the core of the project.

Additionally, it’s great to see that despite all of the potential issues and tragedies that were predicted when Oracle took over the reins of MySQL, as Baron says, they are in fact driving and pushing the project forward. The features in 5.6 are impressive and useful, rather than just a blanket cycling of the numbers. I haven’t had the time to look at 5.7, but I doubt it is just an annual increment either. When I left Oracle, people were predicting MySQL would be dead in two years as an active project at Oracle, but in fact what seems to have happened is that the community has rallied round it and Oracle have seen the value and expertly steered it forward.

It’s also interesting to me – as someone who moved outside the MySQL fold – to note that other databases haven’t really supplanted the core of the MySQL foothold. Robert Hodge’s Keynote discussed that in more depth, and I see no reason to disagree with him.

I’m pleased to see that my good friend Giuseppe had his MySQL Sandbox when application of the year 2013 – not soon enough in my eyes, given that as a solution for running MySQL it has been out there for more years than I care to remember.

I’m also delighted of course that Continuent won Corporate contributor of the year. One of the reasons I joined the company is because I liked what they were doing. Replication in MySQL is unnecessarily hard, particularly when you get more than one master, or want to do clever things with topologies beyond the standard master/slave. I used Federated tables to do it years ago, but Tungsten makes the whole process easier.  What Continuent does is provide an alternative to MySQL native replication which is not only more flexible, but also fundamentally very simple. In my experience, simple ideas are always very powerful, because their simplicity makes them easy to adapt, adopt and add to.

Of course, Continuent aren’t the only company producing alternatives for clustering solutions with MySQL, but to me that shows there is a healthy ecosystem willing to build solutions around the powerful product at the centre. It’s one thing to choose an entirely different database product, but another to use the same product with one or more tools that produces an overall better solution to the problem. *AMP solutions haven’t gone away, we’ve just extended and expanded on top of them. That must mean that MySQL is just as powerful and healthy a core product as it ever was.

That’s my key takeaway from this conference – MySQL is alive and well, and the ecosystem it produced is as organic and thriving as it ever has been, and I’m happy to be back in the middle of that.

Joining Continuent

I’ve just completed my first month here at Continuent, strangely back into the MySQL ecosystem which I have been working in for some time before I joined CouchOne, and then Couchbase, two and half years ago. Making the move back to MySQL is both an experience, and somehow, comfortable…

Continuent produce technology that makes for easier replication between MySQL servers and, more importantly, more flexible solutions when you need to scale out by providing connector and management functionality for your MySQL cluster. That means that you can easily backup, add slaves, and create complex replication scenarios such as multi-master, and even multiple-site, multiple-master topologies. This functionality is split over two products, Continuent Tungsten, which is the cluster management product, and the open source Tungsten Replicator, which provides the basic replication functionality.

Those who know me well will know that I am no fan of the native MySQL replication, and that’s almost entirely because of the complexities of first of all getting it to work, followed by keeping it working, and ultimately because of the variability of the replication in the first place. There’s no reliable way to know if the replication stream has successfully been applied to the slaves or, from a client perspective, how far behind slaves are so that you can make an educated guess about which slave you should be talking to. Let’s not even get into the complexities of having to handle the read/write splitting required to make the master/slave relationship work in the first place.

Continuent solves this problem by using the binary log stream from MySQL, but handling the transfer and application of those bin log entries. Using this method enables Continuent to monitor and manage the replication process. Continuent knows when a statement has been applied to a slave, and it can work to ensure that the application of the changes is applied. With Continuent Tungsten, we go a stage further and provide a connector that sits between your application servers and your MySQL servers. Because Continuent Tungsten handles the replication, it knows the cluster topology, and can redirect queries to the master or the slave, and handle failures by directing the client queries to working slaves. Like all good software, it’s simple, but very very effective, and ergo very powerful.

So what am I doing at Continuent? Building out the documentation and helping users to help themselves, in combination with working with the developers to make sure that the software is as easy, intuitive, and foolproof to use as possible. In the short term, that means ensuring we have the core documentation required to get Continuent working for you.

If there’s more information you need, or something you specifically want in the documentation, let me know.