Nicolas Tobias has written an awesome guide to setting up replication from MariaDB to Hadoop/HDFS using Tungsten Replicator, in Spanish! He’s planning more of these so if you like what you see, please let him know!
Semana santa y yo con nuevas batallas que contar.
Me hayaba yo en el trabajo, pensando en que iba a invertir la calma que acompa;a a los dias de vacaciones que libremente podemos elegir trabajar y pense: No seria bueno terminar esa sincronizacion entre los servidores de mariaDB y HIVE?
Ya habia buscado algo de info al respecto en Enero hasta tenia una PoC montada con unas VM que volvi a encender, pero estaba todo podrido: no arrancaba, no funcionba ni siquiera me acordaba como lo habia hecho y el history de la shell er un galimatias. Decidi que si lo rehacia todo desde cero iba a poder dejarlo escrito en un playbook y ademas, aprenderlo y automatizarlo hasta el limite de poder desplegar de forma automatica on Ansible.
The session is at 7pm Tuesday night – look for rooms D135 and/or D137/138.
Correction: We are now in E144 on Tuesday with the Hadoop get together first at 7pm, and the Data Migration to follow at 8pm.
I’m actually going to be joined by Gwen Shapira from Cloudera, who has a BoF session on Hadoop next door at the same time, along with Eric Herman from Booking.com. We’ll use the opportunity to talk all things Hadoop, but particularly the ingestion of data from MySQL and other databases into the Hadoop datastore.
As always, it’d be great to meet anybody interested in Hadoop at the BoF, please come along and introduce yourselves, and hopefully I’ll see you next week!
I’m pleased to say that Continuent will be at the Hadoop Summit in San Jose next week (3-5 June). Sadly I will not be attending as I’m taking an exam next week, but my colleagues Robert Hodges, Eero Teerikorpi and Petri Versunen will be there to answer any questions you have about Continuent products, and, of course, Hadoop replication support built into Tungsten Replicator 3.0.
If you are at the conference, please go along and say hi to the team. And, as always, if there are any questions please let them or me know.
An article about moving data into Hadoop in real-time has just been published over at DBTA, written by me and my CEO Robert Hodges.
In the article I talk about one of the major issues for all people deploying databases in the modern heterogenous world – how do we move and migrate data effectively between entirely different database systems in a way that is efficient and usable. How do you get the data you need to the database you need it in. If your source is a transactional database, how does that data get moved into Hadoop in a way that makes the data usable to be queried by Hive, Impala or HBase?
So I’ve submitted my talks for the Tech14 UK Oracle User Group conference which is in Liverpool this year. I’m not going to give away the topics, but you can imagine they are going to be about data translation and movement and how to get your various databases talking together.
I can also say, after having seen other submissions for talks this year (as I’m helping to judge), that the conference is shaping up to be very interesting. There’s a good spread of different topics this year, but I know from having talked to the organisers that they are looking for more submissions in the areas of Operating Systems, Engineered Systems and Development (mobile and cloud).
If you’ve got a paper, presentation, or idea for one that you think would be useful, please go ahead and submit your idea.
I’m also pleased to say that I’ll be at OSCON in Oregon in July, handling a Birds of a Feather (BOF) session on the topic of exchanging data between MySQL, Oracle and Hadoop. I’ll be there with my good friend Eric Herman from Booking.com where we’ll be providing advice, guidance, experiences, and hoping to exchange more ideas, wishes and requirements for heterogeneous environments.
It’d be great to meet you if you want to come along to either conference.
A new article on has been published on IBM developerWorks, looking at the basics of processing machine data using Hadoop, from extracting the core data, storing it, and then determining the baselines and trigger points required to identifying worrying trends and points. From the intro:
Machine data can come in many different formats and quantities. Weather sensors, fitness trackers, and even air-conditioning units produce massive amounts of data, which begs for a big data solution. But how do you decide what data is important, and how do you determine what proportion of that information is valid, worth including in reports, or valuable in detecting alert situations? This article covers some of the challenges and solutions for supporting the consumption of massive machine data sets that use big data technology and Hadoop.
One of the key platforms I’ve been testing on for the MySQL to Hadoop replication has been Cloudera, largely driven by customer requirements, but it’s also one of the easiest way to get started with Hadoop.
What I’m even more pleased about is the fact that we are proud to announce that Tungsten Replicator 3.0 is certified for use on the new Cloudera Enterprise 5 platform. That means that we’re sure that replicating your data from MySQL to Cloudera 5 and have it work without causing problems or difficulties on the Hadoop loading and materialisation.
Cloudera is a great product, and we’re very happy to be working so effectively with the new Cloudera Enterprise 5. Cloudera certainly makes the core operation of managing and monitoring your Hadoop cluster so much easier, while still providing core functionality from the Hadoop family like Hive, HBase and Impala.
What I’m really interested in is the support for Spark, which will allow much easier live-querying and access to data. That should make some data processing and live data views much easier to build and query further down the line.
Hopefully by now you have already seen that we are working on Hadoop replication. I’m happy to say that it is going really well. I’ve managed to push a few terabytes of data and different data sets through into Hadoop on Cloudera, HortonWorks, and Amazon’s Elastic MapReduce (EMR). For those who have been following my long association with the IBM InfoSphere BigInsights Hadoop product, and I’m pleased to say that it’s working there too. I’ve had to adapt Robert’s original script to work with the different versions of the underlying Hadoop tools and systems to make it compatible. The actual performance and process is unchanged; you just use a different JS-based batchloader script to work with different tools.
Robert has also been simplifying some of the core functionality, such as configuring some fixed pre-determined formats, so you no longer have to explicitly set the field and record separators.
I’ve also been testing the key feature of being able to integrate the provisiong of information using Sqoop and merging that original Sqooped data into Hadoop, and then following up with the change data that the replicator is effectively transferring over. The system works exactly as I’ve just described – start the replicator, Sqoop the data, materialise the view within Hadoop. It’s that easy; in fact, if you want a deeper demonstration of all of these features, we’ve got a video from my recent webinar session:
To follow-up and describe some of the methods and techniques behind replicating into Hadoop from MySQL in real-time, and how this can be combined into your data workflow, Continuent are running a webinar with me presenting that will go over the details and provide a demo of the data replication process.
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0
Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. The new Continuent Tungsten Replicator 3.0 provides enterprise-quality replication from MySQL to Hadoop. Tungsten Replicator 3.0 is 100% open source, released under a GPL V2 license, and available for download at https://code.google.com/p/tungsten-replicator/. Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems as well as Hadoop itself.
This webinar covers the following topics:
– How Hadoop works and why it’s useful for processing transaction data from MySQL
– Setting up Continuent Tungsten replication from MySQL to Hadoop
– Transforming MySQL data within Hadoop to enable efficient analytics
– Tuning replication to maximize performance.
You do not need to be an expert in Hadoop or MySQL to benefit from this webinar. By the end listeners will have enough background knowledge to start setting up replication between MySQL and Hadoop using Continuent Tungsten.