Building flexible apps from big data sources

My article on how to build flexible apps on top of the BigInsights platform has been published. This demonstrates a cool way to combine some client-end JavaScript and existing technologies to build a Big Data query interface without developing a specialised application for the purpose.

It’s no secret that a significant proportion of the needs for big data have come from the explosion in Internet technologies. Up until 10-20 years ago, the idea of a public-facing application having more than a few million users was unheard of. Today, even a modest website can have millions of users, and if it’s active, can generate millions of data items every day. The irony is that the very infrastructure and systems that create big data can also work in reverse, and provide some of the better ways to integrate and work with that data. Usefully, InfoSphere® BigInsights™ comes with support for managing and executing data jobs through a simple REST API. And through the Jaql interface, we can run queries and get information directly from a Hadoop cluster. This article looks at how these systems work together to give you a rich basis for capturing data and provide an interface to get the information back out again.

Building flexible apps from big data sources.

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

I’ve got a new article, which is part of a new three-part series, on moving data between SQL and Hadoop, both the export to Hadoop and importing processed content back into an SQL store.

In this first one, we look at the basic mechanics and considerations before you start the migration of data, such as the data format, content, and export techniques.

Read: SQL to Hadoop and back again, Part 1: Basic data interchange techniques

Developing Applications for use with Continuent Tungsten and Tungsten Replicator in SDJ

I’ve just had a new article published with the Software Developers Journal talking about how you can write applications to take full advantage of Continuent Tungsten and Tungsten Replicator.

As a developer of an application there really isn’t a problem better than finding that you have to scale up the application and the database that supports it to handle the increased load. The main bottleneck to most expansion is the database server and in many modern environments that replication is based around MySQL. Application servers are easy to add on to the front-end of your environment.

Read: Qt5 – How to Become a Professional Developer- RELEASED | News | Magazine for software developers, programmers and designers – Software Developers Journal (registration/purchase required)

Percona Live 2013, MySQL, Continuent and an ever-healthy Ecosystem

I’m sitting here in the lounge at SFO thinking back on the last week, the majority of which has been spent meeting my new workmates and attending the Percona MySQL conference.

For me it has been as much of a family reunion as it has been about seeing the wonderful things going on in MySQL.

Having joined Continuent last month after an ‘absence’ in NoSQL land of almost 2.5 years, joining the MySQL community again just felt like coming home after a long absence. And that’s no bad thing. On a very personal level it was great to see so many of my old friends, many of whom were not only pleased to see me, but pleased to see me working back in the MySQL fold. Evidently many people think this is where I belong.

What was great to see is that the MySQL community is alive and well. Percona may be the drivers behind the annual MySQL conference that we have come to know, but behind the name on the passes and over the doors, nothing has changed in terms of the passion behind the core of the project.

Additionally, it’s great to see that despite all of the potential issues and tragedies that were predicted when Oracle took over the reins of MySQL, as Baron says, they are in fact driving and pushing the project forward. The features in 5.6 are impressive and useful, rather than just a blanket cycling of the numbers. I haven’t had the time to look at 5.7, but I doubt it is just an annual increment either. When I left Oracle, people were predicting MySQL would be dead in two years as an active project at Oracle, but in fact what seems to have happened is that the community has rallied round it and Oracle have seen the value and expertly steered it forward.

It’s also interesting to me – as someone who moved outside the MySQL fold – to note that other databases haven’t really supplanted the core of the MySQL foothold. Robert Hodge’s Keynote discussed that in more depth, and I see no reason to disagree with him.

I’m pleased to see that my good friend Giuseppe had his MySQL Sandbox when application of the year 2013 – not soon enough in my eyes, given that as a solution for running MySQL it has been out there for more years than I care to remember.

I’m also delighted of course that Continuent won Corporate contributor of the year. One of the reasons I joined the company is because I liked what they were doing. Replication in MySQL is unnecessarily hard, particularly when you get more than one master, or want to do clever things with topologies beyond the standard master/slave. I used Federated tables to do it years ago, but Tungsten makes the whole process easier.  What Continuent does is provide an alternative to MySQL native replication which is not only more flexible, but also fundamentally very simple. In my experience, simple ideas are always very powerful, because their simplicity makes them easy to adapt, adopt and add to.

Of course, Continuent aren’t the only company producing alternatives for clustering solutions with MySQL, but to me that shows there is a healthy ecosystem willing to build solutions around the powerful product at the centre. It’s one thing to choose an entirely different database product, but another to use the same product with one or more tools that produces an overall better solution to the problem. *AMP solutions haven’t gone away, we’ve just extended and expanded on top of them. That must mean that MySQL is just as powerful and healthy a core product as it ever was.

That’s my key takeaway from this conference – MySQL is alive and well, and the ecosystem it produced is as organic and thriving as it ever has been, and I’m happy to be back in the middle of that.

Joining Continuent

I’ve just completed my first month here at Continuent, strangely back into the MySQL ecosystem which I have been working in for some time before I joined CouchOne, and then Couchbase, two and half years ago. Making the move back to MySQL is both an experience, and somehow, comfortable…

Continuent produce technology that makes for easier replication between MySQL servers and, more importantly, more flexible solutions when you need to scale out by providing connector and management functionality for your MySQL cluster. That means that you can easily backup, add slaves, and create complex replication scenarios such as multi-master, and even multiple-site, multiple-master topologies. This functionality is split over two products, Continuent Tungsten, which is the cluster management product, and the open source Tungsten Replicator, which provides the basic replication functionality.

Those who know me well will know that I am no fan of the native MySQL replication, and that’s almost entirely because of the complexities of first of all getting it to work, followed by keeping it working, and ultimately because of the variability of the replication in the first place. There’s no reliable way to know if the replication stream has successfully been applied to the slaves or, from a client perspective, how far behind slaves are so that you can make an educated guess about which slave you should be talking to. Let’s not even get into the complexities of having to handle the read/write splitting required to make the master/slave relationship work in the first place.

Continuent solves this problem by using the binary log stream from MySQL, but handling the transfer and application of those bin log entries. Using this method enables Continuent to monitor and manage the replication process. Continuent knows when a statement has been applied to a slave, and it can work to ensure that the application of the changes is applied. With Continuent Tungsten, we go a stage further and provide a connector that sits between your application servers and your MySQL servers. Because Continuent Tungsten handles the replication, it knows the cluster topology, and can redirect queries to the master or the slave, and handle failures by directing the client queries to working slaves. Like all good software, it’s simple, but very very effective, and ergo very powerful.

So what am I doing at Continuent? Building out the documentation and helping users to help themselves, in combination with working with the developers to make sure that the software is as easy, intuitive, and foolproof to use as possible. In the short term, that means ensuring we have the core documentation required to get Continuent working for you.

If there’s more information you need, or something you specifically want in the documentation, let me know.

Developing with Couchbase Server

catI’ve just completed my latest book, this time looking at the development side of using Couchbase Server for building applications. The book goes through the basics of the Couchbase Server data store, the mechanics of storing and using data, the API and operations available, and a quick overview of the different client libraries available for building applications.

With the core details out of the way, I move on to building a sample application using the PHP client library as the base, showing the different operations in context, and then looking at the indexing and query system for searching for data from Couchbase Server.

You can read more, and get the table of contents and description here: Developing with Couchbase Server

Data Mining in a Document World

As databases evolve, learning how to get the best out of the different solutions out there is the key to understanding and extracting the data in the way you need from your required data store. Document databases, like MongoDB, CouchDB, Couchbase Server and many others provide a completely different model and set of problems for interfacing and extracting data.

You need to be able to understand your structure, how you can query the information, and how to perform different data mining techniques on what is very obviously a completely different structure of information.

In this article, I try to take you through the basics of data mining when using a document database.

Read: Data mining in a document world

Document Databases in Predictive Modeling

My latest article on performing predictive modeling using document databases is now available on IBM developerWorks. The abstract:

Predictive analytics relies on processing, analyzing data from many different sources, collating, and then processing that through several stages into usable data. This involves recording and storing data in different formats, and may require translating information into PMML. Despite the complexities and structure of the information, and the sources often involving data from traditional RDBMS data sources, other solutions offer some advantages. We can use the recent range of document-based NoSQL databases to help collate the information in a structured format, while coping with the flexible structure of the individual data points. Many NoSQL environments also provide support for extensive map reduce type queries and processing that makes them ideal for processing large volumes of data into a summary format. In this article, we’ll look at the transfer, exchange, and formatting of information in NoSQL environments.

Read Document databases in predictive modeling