How I Use Jug & IPython Notebooks

How I Use Jug & IPython Notebooks

Having just released Jug 1.0 and having recently started using Ipython notebooks for data analysis, I want to describe how I combine these two technologies:

  1. Jug is for heavy computation. This should run in batch mode so that it can take advantage of a computer cluster.
  2. The IPython notebook is for visualization of the results. From the notebook, I will load the results of the jug run and plot them.

I am going to use, as an example, the sort of work I did for classifying images with local features that I did for my Bioinformatics paper last year That code did not use IPython notebook, but I already used a split between heavy computation and plotting[1].

I write a jugfile.py with my heavy computation, in this case, feature computation and classification [2]:

from jug import TaskGenerator
from features import computefeatures
from ml import classification

# computefeatures takes an image path and returns features
computefeatures = TaskGenerator(computefeatures)

# crossvalidation returns a confusion matrix
crossvalidation = TaskGenerator(crossvalidation)

images,labels = load_images() # This loads all the images
features = [computefeatures(im) for im in images]
results = crossvalidation(features, labels)

This way, if I have 1000 images, the computefeatures step can be run in parallel and use many cores.

§

When the computation is finished, I will want to look at the results and display them. For example, graphically plot a confusion matrix.

The only non-obvious trick is how to load the results from jug:

from jug import value, set_jugdir
import jugfile
set_jugdir('jugfile.jugdata')
results = value(jugfile.results)

And, boom!, results is a variable in our notebook with all the data from the computations (if the computation is not finished, an exception will be raised). Let’s unpack this one by one:

from jug import value, set_jugdir

Imports from jug. Nothing special. You are just importing jug in a Python notebook.

import jugfile

Here you import your jugfile.

set_jugdir('jugfile.jugdata')

This is the important step! You need to tell jug where its data is. Here I assumed you used the defaults, otherwise just pass a different argument to this function.

results = value(jugfile.results)

You now use the value function to load the value from disk. Done.

Now, use a second cell to plot:

from matplotlib import pyplot as plt
from matplotlib import cm

plt.imshow(results, interpolation='nearest', cmap=cm.OrRd)

confusion.matrix

§

I find this division of labour to maximize the value of each tool: jug does well for long computations and ensures that the results are consistent while making it easy to use the cluster; ipython is nicer at exploring the results and tweaking the graphical outputs.

[1] I would save the results from jug to a text file and load it from another script.
[2] This is a very simplified form of what the original actually looked like. I started to write this post trying to make it realistic, but the complexity was too much. The plot is from real data, though.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s