A new article on has been published on IBM developerWorks, looking at the basics of processing machine data using Hadoop, from extracting the core data, storing it, and then determining the baselines and trigger points required to identifying worrying trends and points. From the intro:
Machine data can come in many different formats and quantities. Weather sensors, fitness trackers, and even air-conditioning units produce massive amounts of data, which begs for a big data solution. But how do you decide what data is important, and how do you determine what proportion of that information is valid, worth including in reports, or valuable in detecting alert situations? This article covers some of the challenges and solutions for supporting the consumption of massive machine data sets that use big data technology and Hadoop.
One of the key platforms I’ve been testing on for the MySQL to Hadoop replication has been Cloudera, largely driven by customer requirements, but it’s also one of the easiest way to get started with Hadoop.
What I’m even more pleased about is the fact that we are proud to announce that Tungsten Replicator 3.0 is certified for use on the new Cloudera Enterprise 5 platform. That means that we’re sure that replicating your data from MySQL to Cloudera 5 and have it work without causing problems or difficulties on the Hadoop loading and materialisation.
Cloudera is a great product, and we’re very happy to be working so effectively with the new Cloudera Enterprise 5. Cloudera certainly makes the core operation of managing and monitoring your Hadoop cluster so much easier, while still providing core functionality from the Hadoop family like Hive, HBase and Impala.
What I’m really interested in is the support for Spark, which will allow much easier live-querying and access to data. That should make some data processing and live data views much easier to build and query further down the line.