Speaking at CommunityOne West

Sorry for the (relatively) short notice, but I will be talking at Sun’s CommunityOne conference in San Francisco on June 1st.I’ll be talking about, and demonstrating, the DTrace probes we have put into MySQL in a joint presentation with Robert Lor who will be doing the same for Postgres.


CommunityOne West Badge
Our presentation is on the Monday afternoon. Check out the CommunityOne West Conference Site for more details and registration.

ZFS Replication for MySQL data

At the European Customer Conference a couple of weeks back, one of the topics was the use of DRBD. DRBD is a kernel-based block device that replicates the data blocks of a device from one machine to another. The documentation I developed for that and MySQL is available here. Fundamentally, with DRBD, you set up a physical device, configure DRBD on top of that, and write to the DRBD device. In the background, on the primary, the DRBD device writes the data to the physical disk and replicates those changed blocks to the seconday, which in turn writes the data to it’s physical device. The result is a block level copy of the source data. In an HA solution, which means that you can switch over from your primary host to your secondary host in the event of system failure and be sure pretty certain that the data on the primary and seconday are the same. In short, DRBD simplifies one of the more complex aspects of the typical HA solution by copying the data needed during the switch. Because DRBD is a Linux Kernel module you can’t use it on other platforms, like Mac OS X or Solaris. But there is another solution: ZFS. ZFS supports filesystem snapshots. You can create a snapshot at any time, and you can create as many snapshots as you like.Let’s take a look at a typical example. Below I have a simple OpenSolaris system running with two pools, the root pool and another pool I’ve mount at /opt:

Filesystem             size   used  avail capacity  Mounted onrpool/ROOT/opensolaris-1                       7.3G   3.6G   508M    88%    //devices                 0K     0K     0K     0%    /devices/dev                     0K     0K     0K     0%    /devctfs                     0K     0K     0K     0%    /system/contractproc                     0K     0K     0K     0%    /procmnttab                   0K     0K     0K     0%    /etc/mnttabswap                   465M   312K   465M     1%    /etc/svc/volatileobjfs                    0K     0K     0K     0%    /system/objectsharefs                  0K     0K     0K     0%    /etc/dfs/sharetab/usr/lib/libc/libc_hwcap1.so.1                       4.1G   3.6G   508M    88%    /lib/libc.so.1fd                       0K     0K     0K     0%    /dev/fdswap                   466M   744K   465M     1%    /tmpswap                   465M    40K   465M     1%    /var/runrpool/export           7.3G    19K   508M     1%    /exportrpool/export/home      7.3G   1.5G   508M    75%    /export/homerpool                  7.3G    60K   508M     1%    /rpoolrpool/ROOT             7.3G    18K   508M     1%    /rpool/ROOTopt                    7.8G   1.0G   6.8G    14%    /opt

I’ll store my data in a directory on /opt. To help demonstrate some of the basic replication stuff, I have other things stored in /opt as well:

total 17drwxr-xr-x  31 root     bin           50 Jul 21 07:32 DTT/drwxr-xr-x   4 root     bin            5 Jul 21 07:32 SUNWmlib/drwxr-xr-x  14 root     sys           16 Nov  5 09:56 SUNWspro/drwxrwxrwx  19 1000     1000          40 Nov  6 19:16 emacs-22.1/lrwxrwxrwx   1 root     root          48 Nov  5 09:56 uninstall_Sun_Studio_12.class -> SUNWspro/installer/uninstall_Sun_Studio_12.class

To create a snapshot of the filesystem, you use zfs snapshot, and then specify the pool and the snapshot name:

# zfs snapshot opt@snap1

To get a list of snapshots you’ve already taken:

# zfs list -t snapshotNAME                                         USED  AVAIL  REFER  MOUNTPOINTopt@snap1                                       0      -  1.03G  -rpool@install                               19.5K      -    55K  -rpool/ROOT@install                            15K      -    18K  -rpool/ROOT/opensolaris-1@install            59.8M      -  2.22G  -rpool/ROOT/opensolaris-1@opensolaris-1       100M      -  2.29G  -rpool/ROOT/opensolaris-1/opt@install            0      -  3.61M  -rpool/ROOT/opensolaris-1/opt@opensolaris-1      0      -  3.61M  -rpool/export@install                          15K      -    19K  -rpool/export/home@install                     20K      -    21K  -

The snapshots themselves are stored within the filesystem metadata, and the space required to keep them will vary as time goes on because of the way the the snapshots are created. The initial creation of a snapshot is really quick, because instead of taking an entire copy of the data and metadata required to hold the entire snapshot, ZFS merely records the point in time and metadata of when the snaphot was created.As you make more changes to the original filesystem, the size of the snapshot increases because more space is required to keep the record of the old blocks. Furthermore, if you create lots of snapshots, say one per day, and then delete the snapshots from earlier in the week, the size of the newer snapshots may also increase, as the changes that make up the newer state have to be included in the more recent snapshots, rather than being spread over the seven snapshots that make up the week. The result is that creating snapshots is generally very fast, and storing snapshots is very efficient. As an example, creating a snapshot of a 40GB filesystem takes less than 20ms on my machine. The only issue, from a backup perspective, is that snaphots exist within the confines of the original filesystem. To get the snapshot out into a format that you can copy to another filesystem, tape, etc. you use the zfs send command to create a stream version of the snapshot. For example, to write out the snapshot to a file:

# zfs send opt@snap1 >/backup/opt-snap1

Or tape, if you are still using it:

# zfs send opt@snap1 >/dev/rmt/0

You can also write out the incremental changes between two snapshots using zfs send:

# zfs send opt@snap1 opt@snap2 >/backup/opt-changes

To recover a snapshot, you use zfs recv which applies the snapshot information either to a new filesytem, or to an existing one. I’ll skip the demo of this for the moment, because it will make more sense in the context of what we’ll do next. Both zfs send and zfs recv work on streams of the snapshot information, in the same way as cat or sed do. We’ve already seen some examples of that when we used standard redirection to write the information out to a file. Because they are stream based, you can use them to replicate information from one system to another by combining zfs send, ssh, and zfs recv. For example, let’s say I’ve created a snapshot of my opt filesystem and want to copy that data to a new system into a pool called slavepool:

# zfs send opt@snap1 |ssh mc@slave pfexec zfs recv -F slavepool

The first part, zfs send opt@snap1, streams the snapshot, the second, ssh mc@slave, and the third, pfexec zfs recv -F slavepool, receives the streamed snapshot data and writes it to slavepool. In this instance, I’ve specified the -F option which forces the snapshot data to be applied, and is therefore destructive. This is fine, as I’m creating the first version of my replicated filesystem. On the slave machine, if I look at the replicated filesystem:

# ls -al /slavepool/total 23drwxr-xr-x   6 root     root           7 Nov  8 09:13 ./drwxr-xr-x  29 root
 root          34 Nov  9 07:06 ../drwxr-xr-x  31 root     bin           50 Jul 21 07:32 DTT/drwxr-xr-x   4 root     bin            5 Jul 21 07:32 SUNWmlib/drwxr-xr-x  14 root     sys           16 Nov  5 09:56 SUNWspro/drwxrwxrwx  19 1000     1000          40 Nov  6 19:16 emacs-22.1/lrwxrwxrwx   1 root     root          48 Nov  5 09:56 uninstall_Sun_Studio_12.class -> SUNWspro/installer/uninstall_Sun_Studio_12.class

Wow – that looks familiar!Once you’ve snapshotted once, to synchronize the filesystem again, I just need to create a new snapshot, and then use the incremental snapshot feature of zfs send to send the changes over to the slave machine again:

# zfs send -i opt@snapshot1 opt@snapshot2 |ssh mc@ pfexec zfs recv slavepool

Actually, this operation will fail. The reason is that the filesystem on the slave machine can currently be modified, and you can’t apply the incremental changes to a destination filesystem that has changed. What’s changed? The metadata about the filesystem, like the last time it was accessed – in this case, it will have been our ls that caused the problem. To fix that, set the filesystem on the slave to be read-only:

# zfs set readonly=on slavepool

Setting readonly means that we can’t change the filesystem on the slave by normal means – that is, I can’t change the files or metadata (modification times and so on). It also means that operations that would normally update metadata (like our ls) will silently perform their function without attempting to update the filesystem state. In essence, our slave filesystem is nothing but a static copy of our original filesystem. However, even when enabled to readonly, a filesystem can have snapshots applied to it. Now it’s read only, re-run the initial copy:

# zfs send opt@snap1 |ssh mc@slave pfexec zfs recv -F slavepool

Now we can make changes to the original and replicate them over. Since we’re dealing with MySQL, let’s initialize a database on the original pool. I’ve updated the configuration file to use /opt/mysql-data as the data directory, and now I can initialize the tables:

# mysql_install_db --defaults-file=/etc/mysql/5.0/my.cnf --user=mysql

Now, we can synchronize the information to our slave machine and filesystem by creating another snapshot and then doing an incremental zfs send:

# zfs snapshot opt@snap2

Just to demonstrate the efficiency of the snapshots, the size of the data created during initialization is 39K:

# du -sh /opt/mysql-data/  39K        /opt/mysql-data

If I check the size used by the snapshots:

# zfs list -t snapshotNAME                                         USED  AVAIL  REFER  MOUNTPOINTopt@snap1                                     47K      -  1.03G  -opt@snap2                                       0      -  1.05G  -

The size of the snapshot is 47K. Note, by the way, that it is 47K in snap1, because currently snap2 should be more or less equal to our current filesystem state.Now, let’s synchronize this over:

# zfs send -i opt@snap1 opt@snap2|ssh mc@ pfexec zfs recv slavepool

Note we don’t have to force the operation this time – we’re synchronizing the incremental changes from what are identical filesystems, just on different systems. And double check that the slave has it:

# ls -al /slavepool/mysql-data/

Now we can start up MySQL, create some data, and then synchronize the information over again, replicating the changes. To do that, you have to create a new snapshot, then do the send/recv to the slave to synchronize the changes. The rate at which you do it is entirely up to you, but keep in mind that if you have a lot of changes then doing it as frequently as once a minute may lead to your data becoming behind the because of the time taken to transfer the filesystem changes over the network – running snapshot with MySQL running in the background still takes comparatively little time. To demonstrate that, here’s the time taken to create a snapshot mid-way through a 4 million row insert into an InnoDB table:

# time zfs snapshot opt@snap3real    0m0.142suser    0m0.006ssys     0m0.027s

I told you it was quick :)However, the send/recv operation took a few minutes to complete, with about 212MB of data transferred over a very slow network connection, and the machine was busy writing those additional records.Ideally you want to set up a simple script that will handle that sort of snapshot/replication for you and run it past cron to do the work for you. You might also want to try ready-made tools like Tim Foster’s zfs replication tool, which you can find out about here. Tim’s system works through SMF to handle the replication and is very configurable. It even handles automatic deletion of old, synchronized, snapshots. Of course, all of this is useless unless once replicated from one machine to another we can actually use the databases. Let’s assume that there was a failure and we needed to fail over to the slave machine. To do:

  1. Stop the script on the master, if it’s still up and running.
  2. Set the slave filesystem to be read/write:
    # zfs set readonly=off slavepool
  3. Start up mysqld on the slave. If you are using InnoDB, Falcon or Maria you should get auto-recovery, if it’s needed, to make sure the table data is correct, as shown here when I started up from our mid-INSERT snapshot:
    InnoDB: The log sequence number in ibdata files does not matchInnoDB: the log sequence number in the ib_logfiles!081109 15:59:59  InnoDB: Database was not shut down normally!InnoDB: Starting crash recovery.InnoDB: Reading tablespace information from the .ibd files...InnoDB: Restoring possible half-written data pages from the doublewriteInnoDB: buffer...081109 16:00:03  InnoDB: Started; log sequence number 0 1142807951081109 16:00:03 [Note] /slavepool/mysql-5.0.67-solaris10-i386/bin/mysqld: ready for connections.Version: '5.0.67'  socket: '/tmp/mysql.sock'  port: 3306  MySQL Community Server (GPL)

Yay – we’re back up and running. On MyISAM, or other tables, you need to run REPAIR TABLE, and you might even have lost some information, but it should be minor. The point is, a mid-INSERT ZFS snapshot, combined with replication, could be a good way of supporting a hot-backup of your system on Mac OS X or Solaris/OpenSolaris. Probably, the most critical part is finding the sweet spot between the snapshot replication time, and how up to date you want to be in a failure situation. It’s also worth pointing out that you can replicate to as many different hosts as you like, so if you want wanted to replicate your ZFS data to two or three hosts, you could.

Podcast Producer Variables

The first of a new series of articles on using and extending the functionality of Apple’s Podcast Producer has just been published (see Podcast Producer: Anatomy of a Workflow). One of the things that you might find useful when working with Workflows in Podcast Producer are the properties that are defined automatically when a podcast is submitted for processing. These runtime properties are used to specify information such as the source file name and job name. You need these within the action specification to select the input file, Podcast title, description and other parameters to process the content. The combination of standard properties, and job specific properties, are combined together into a file called properties.plist that becomes part of the Workflow specification that is submitted for processing. Because global properties can change or be modified, by copying the specification into the properties.plist file during assembly, the system can ensure that the configuration at the time of submission of the podcast is used. This helps to prevent problems if the job gets queued and the configuration changes between submission and processing. The dynamic properties submitted as part of the job will differ depending on the submission type, but the main properties generated are shown in the table below.

Property Description
Base Directory The base directory for the Podcast submission. The directory is automatically created within the shared filesystem when a new job is submitted to Podcast Producer. A new universally unique ID (UUID) is created and used as the directory name. All of the resources for the submission are then placed into that directory. This information is required so that actions canaccess the raw contents.
Content File Basename The basename (filename without extension) of the source content.
Content File Extensions The extension of the source content.
Content File Name The full filename (basename and extension) of the source content.
Date_YYYY-MM-DD The date of submission for the podcast. The property demonstrates the format of the date (year, month, day).
Global Resource Path The path to the global resource for this instance of Podcast Producer. The directory holds all of the global resources (such as organization specific videos, preambles, and introductions) that can be used during processing.
Podcast Producer URL The URL of the Podcast Producer server. This is used when communicating information back to the Podcast Producer instance.
Recording Started At The date/time when the Podcast was started. This information is represented as the number of seconds since the epoch.
Recording Stopped At The date/time when the Podcast was stopped. This information is represented as the number of seconds since the epoch.
Server UUID The UUID of the server to which the podcast was submitted.
Shared Filesystem The base directory that holds all Podcast information. This directory is set within the General Settings portion of the Podcast Producer section of Server Admin.
Title The title of the podcast, as set by the user when the job was submitted using Podcast Capture.
User Full Name The full name of the user that submitted the job. When a job is submitted by Podcast Capture, the user must login to the Podcast Capture application. It is these credentials that are used to identify the user.
User Home Directory The home directory configured for the user.
User ID The user ID of the user that submitted the podcast.
User Short Name The shortname (login) of the user that submitted the podcast.
Workflow Bundle Path The path to the Workflow Bundle that was selected when the job was submitted. This will be one of the Workflows configured in the system and selected at the point of submission from within Podcast Capture.
Workflow Resource Path The path to the Resources directory for the Workflow selected when the job was submitted.
Xgrid Job Name The Xgrid job name. By default, the job name is a combination of the job title, the user?s full name, and the name of the Workflow that was selected. You can control this within the individual workflow, but often the standard configuration is enough for you to be able to identify the job as it progresses through the Xgrid processing stage.

These dynamic properties are vital to the execution of an individual action as they provide the unique properties required to process an individual podcast processing request.

Acorn, Pixelmator and Iris alternatives to Photoshop

I’ve had Acorn in my list of applications to review for months, and I’ve only just got round to it. I wish I’d got there earlier. Acorn is quick and powerful, and that’s because it employs your GPU to do soe of the processing, and it includes a number of filters (based on OS X’s CoreImage interface), all of which is wrapped up into a nice little application. If you can’t find what you want, there are ObjectiveC and Pythong plugin interfaces, but I haven’t investigated it yet. Of the alternatives, the most talked about is Pixelmator, closely followed by Iris. Pixelmator is a closer approximation to the way that Photoshop operates, and in some respects I prefer the functionality and the feel of Pixelmator if I was looking for a Photoshop replacement, but there are other elements I don’t appreciate. The flashy graphics and animations when you do different operations seem superfluous to me.There are nice touches in both applications – the stamp tool in Pixelmator is particularly good (although I prefer Photoshop), while in Acorn the crop and select tools provide much better feedback during the select operation than even Photoshop.Iris is less polished, but shows some promise. There are some annoying oddities (I used 1.0b2, 367), like the image opening at pixel resolution, rather than being scaled to screen size, and the lack of specialized cursors can make identifying what you are doing and the potential effects of that process difficult, but the image editing and manipulation is very quick (particularly on stamp and touch up operations). It is, however, a bit memory hungry at the moment. Any of these solutions would make a good alternative Photoshop and Photoshop Elements if you don’t want to go down the Adobe route.Of these I currently prefer Acorn – it’s small and lightweight and the interface feels much more polished and easy to use. Certainly I’d consider it as an alternative to the larger packages on a laptop if you wanted something while you were traveling. I can’t get by without Photoshop because of the image scanning and editing I do, but occasionally I want something more extensive than Preview when I’m on the move. Of course, this could change – all of these tools are being actively developed and so it’s likely that there will be some leapfrogging along the way.

Mysterious crashes? – check your temporary directory settings

Just recently I seem to have noticed an increased number of mysterious crashes and terminations of applications. This is generally on brand new systems that I’m setting up, or on existing systems where I’m setting up a new or duplicate account. Initially everything is fine, but then all of a sudden as I start syncing over my files, shell profile and so on applications will stop working. I’ve experienced it in MySQL, and more recently when starting up Gnome on Solaris 10 9/07. Sometimes the problem is obvious, other times it takes me a while to realize what is happening and causing the problem. But in all cases it’s the same problem – my TMPDIR environment variable points to a directory that doesn't exist. That's because for historical reasons (mostly related to HP-UX, bad permissions and global tmp directories) I've always set TMPDIR to a directory within my home directory. It's just a one of those things I've had in my bash profile for as long as I can remember. Probably 12 years or more at least. This can be counterproductive on some systems - on Solaris for example the main /tmp directory is actually mounted on the swap space, which means that RAM will be used if it’s available, which can make a big difference during compilation. But any setting is counterproductive if you point to a directory that doesn’t exist and then have an application that tries to create a temporary file, fails, and then never prints out a useful trace of why it had a problem (yes, I mean you Gnome!). I’ve just reset my TMPDIR in .bash_vars to read:

case $OSTYPE in (solaris*) export set TMPDIR=/tmp/mc;mkdir -m 0700 -p $TMPDIR ;; (*) export set TMPDIR=~/tmp;mkdir -m 0700 -p $TMPDIR ;;esac

Now I explicitly create a directory in a suitable location during startup, so I shouldn’t experience those crashes anymore.

MySQL and DBD::mysql on Mac OS X

I’ve been investigating some recent issues with installations of MySQL on Mac OS X and the Perl DBD::mysql module for accessing MySQL from within Perl through DBI. The problem exists only with binary distributions of MySQL and is related to the installation location of the libraries for the MySQL client that DBD::mysql uses. By default these are installed into /usr/local/mysql/lib, but the dynamic libraries are configured to be located within /usr/local/mysql/lib/mysql. It’s possible for DBD::mysql to build and link correctly, but trying to use the library will fail because it can’t find the library in the latter directory, even though it linked to the library in the former location. To get round this, the easiest method is to create a link within the directory that points to the parent. For example:

$ cd /usr/local/mysql/lib$ ln -s . mysql

That should fix the problem whether you run the commands before the DBD::mysql build or after it.

Setting up the developer stack issues

There’s a great post on Coding Horror about Configuring the Stack.Basically the gripe is with the complexity of installing the typical developer stack, in this case on Windows, using Visual Studio. My VS setup isn’t vastly different to the one Jeff mentions, and I have similar issues with the other stacks I use. I’ve just set up the Ultra3 mobile workstation again for building MySQL and other stuff on, and it took about 30 packages (from Sun Freeware) just to get the basics like gcc, binutils, gdb, flex, bison and the rest set up. It took the best part of a day to get everything downloaded, installed, and configured. I haven’t even started on modules for Perl yet. The Eclipse stack is no better. On Windows you’ll need the JDK of your choice, plus Eclipse. Then you’ll have to update Eclipse. Then add in the plugins and modules you want. Even though some of that is automated (and, annoyingly some of it is not although it could be), it generally takes me a few hours to get stuff installed. Admittedly on my Linux boxes it’s easier – I use Gentoo and copy around a suitable make.conf with everything I need in it, so I need only run emerge, but that can still take a day or so to get everything compiled.Although I’m sure we can all think of easier ways to create the base systems – I use Parallels for example and copy VM folders to create new environments for development – even the updating can take a considerable amount of time. I suggest the new killer app is one that makes the whole process easier.

Setting a remote key through ssh

One of the steps I find myself doing a lot is distributing round an ssh key so that I can login and use different machines automatically. To help in that process I created a small function in my bash profile script (acutally for me it’s in .bash_aliases):

function setremotekey{ OLDDIR=`pwd` if [ -z "$1" ] then echo Need user@host info fi cd $HOME if [ -e "./.ssh/id_rsa.pub" ] then cat ./.ssh/id_rsa.pub |ssh $1 'mkdir -p -m 0700 .ssh && cat >> .ssh/authorized_keys' else ssh-keygen -t rsa cat ./.ssh/id_rsa.pub |ssh $1 'mkdir -p -m 0700 .ssh && cat >> .ssh/authorized_keys' fi cd $OLDDIR}

To use, whenever I want to copy my public key to a remote machine I just have to specify the login and machine:

$ setremotekey mc@narcissus

Then type in my password once, and the the function does the rest. How? Well it checks to make sure I’ve entered a user/host (or actually just a string of some kind). Then, if I haven’t created a public key before (which I might not have on a new machine), I run the ssh-keygen to create it. Once the key is in place, I output the key text and then use ssh to pipe append that to the remote authorized_keys file, creating the directory along the way if it doesn’t exist. Short and sweet, but saves me a lot of time.

Geekbench results for iMac 24

I’ve just completed running Geekbench results for my 24″ iMac (3GB, Intel T7600, 2.33GHz) and the Sun Ultra 20M2 I have on test (4GB, AMD Opteron 1200 2.8GHz).The overall rates are interesting: iMac: 246*U20M2: 273.5*The U20M2 is slightly faster, although in use I think it’s much faster. I’m still completing some tests on the U20 under different operating systems on the U20 to see whether there is some advantage to different OS on the U20M2. *: The iMac is updated to the latest BIOS and latest updates, with other applications not running*: The U20M2 is updated to the BIOS and drivers (from the 1.4 driver update CD), with other applications not running

Controlling OS X volume through Cron

One of the biggest annoyances of working from home is that with the computers in the room next door, the volume of your computers can cause a problem if someone suddenly calls you on Skype, or your backup software suddenly kicks in and starts beeping. I never remember to mute the volume, so I started looking for a way to this automatically through cron at specific times. I also wanted to be sure that rather than setting a specific volume (and having to remember it), that I could just use the OS X mute function. The solution is to combine Applescript, which you can run from the command line using the osascript command, with the command line limitations of cron. There are three components, the two Applescripts that mute and unmute the volume, and the lines in a crontab to run the scripts. To mute the volume with Applescript:

set volume with output muted

To unmute:

set volume without output muted

Save both these into Applescripts (use the Applescript editor so they are compiled). Then we can just set the scripts to execute when required:

0 9 * * * osascript /usr/local/mcslp/volume-unmute.scpt0 19 * * * osascript /usr/local/mcslp/volume-mute.scpt

I’ve set this on the three machines and now we get a silent night!