How I Use Jug & IPython Notebooks

How I Use Jug & IPython Notebooks

Having just released Jug 1.0 and having recently started using Ipython notebooks for data analysis, I want to describe how I combine these two technologies:

  1. Jug is for heavy computation. This should run in batch mode so that it can take advantage of a computer cluster.
  2. The IPython notebook is for visualization of the results. From the notebook, I will load the results of the jug run and plot them.

I am going to use, as an example, the sort of work I did for classifying images with local features that I did for my Bioinformatics paper last year That code did not use IPython notebook, but I already used a split between heavy computation and plotting[1].

I write a jugfile.py with my heavy computation, in this case, feature computation and classification [2]:

from jug import TaskGenerator
from features import computefeatures
from ml import classification

# computefeatures takes an image path and returns features
computefeatures = TaskGenerator(computefeatures)

# crossvalidation returns a confusion matrix
crossvalidation = TaskGenerator(crossvalidation)

images,labels = load_images() # This loads all the images
features = [computefeatures(im) for im in images]
results = crossvalidation(features, labels)

This way, if I have 1000 images, the computefeatures step can be run in parallel and use many cores.

§

When the computation is finished, I will want to look at the results and display them. For example, graphically plot a confusion matrix.

The only non-obvious trick is how to load the results from jug:

from jug import value, set_jugdir
import jugfile
set_jugdir('jugfile.jugdata')
results = value(jugfile.results)

And, boom!, results is a variable in our notebook with all the data from the computations (if the computation is not finished, an exception will be raised). Let’s unpack this one by one:

from jug import value, set_jugdir

Imports from jug. Nothing special. You are just importing jug in a Python notebook.

import jugfile

Here you import your jugfile.

set_jugdir('jugfile.jugdata')

This is the important step! You need to tell jug where its data is. Here I assumed you used the defaults, otherwise just pass a different argument to this function.

results = value(jugfile.results)

You now use the value function to load the value from disk. Done.

Now, use a second cell to plot:

from matplotlib import pyplot as plt
from matplotlib import cm

plt.imshow(results, interpolation='nearest', cmap=cm.OrRd)

confusion.matrix

§

I find this division of labour to maximize the value of each tool: jug does well for long computations and ensures that the results are consistent while making it easy to use the cluster; ipython is nicer at exploring the results and tweaking the graphical outputs.

[1] I would save the results from jug to a text file and load it from another script.
[2] This is a very simplified form of what the original actually looked like. I started to write this post trying to make it realistic, but the complexity was too much. The plot is from real data, though.

Jug 1.0-release candidate 0

New release of Jug: 1.0 release candidate

I’ve put out a new release of jug, which I’m calling 1.0-rc0. This is a release candidate for version 1.0 and if there are no bugs, in a few days, I’ll just call it 1.0!

There are few changes from the previous version, but this has reached maturity now.

For the future, I want to start developing the hooks interface and use hooks for more functionality.

In the meanwhile, download jug from PyPI, watch the video or read the tutorials or the full documentation.

Jug now outputs metadata on the computation

This week-end, I added new functionality to jug (previous related posts). Jug can now output the final result of a computation including metadata on all the intermediate inputs!

For example:

from jug import TaskGenerator
from jug.io import write_metadata # <---- THIS IS THE NEW STUFF

@TaskGenerator
def double(x):
    return 2*x

x = double(2)
x2 = double(x)write_metadata(x2, 'x2.meta.yaml')

When you execute this script (with jug execute), the write_metadata function will write a YAML description of the computation to the file x.meta.yaml!

This file will look like this:

args:
- args: [2]
  meta: {completed: 'Sat Aug  3 18:31:42 2013', computed: true}
  name: jugfile.double
meta: {completed: 'Sat Aug  3 18:31:42 2013', computed: true}
name: jugfile.double

It tells you that a computation named jugfile.double was computated at 18h31 on Saturday August 3. It also gives the same information recursively for all intermediate results.

§

This is the result of  a few conversations I had at BOSC 2013.

§

This is part of the new release, 0.9.6, which I put out yesterday.

Building Machine Learning Systems with Python

I wrote a book. Well, only in part. Willi Richert and I wrote a book.

It is called Building Machine Learning Systems With Python and is now available from Amazon (or Amazon.co.uk), although it has already been partially available directly from the publisher for a while (in a form where you get chapters as editing is finished).

§

The book is an introduction to using machine learning in Python.

We mostly rely on scikit-learn, which is the most complete package for machine learning in Python. I do prefer my own code for my own projects, but milk is not as complete. It has stuff that scikit-learn does not (and stuff they have, correctly, appropriated).

We try to cover all the major modes in machine learning and, in particular, have:

  1. classification
  2. regression
  3. clustering
  4. dimensionality reduction
  5. topic modeling

and also, towards the end, three more applied chapters:

  1. classification of music
  2. pattern recognition in images
  3. using jug for parallel processing (including in the cloud).

§

The approach is tutorial-like, without much math but lots of code examples.

This should get people started and will be more than enough if the problem is easy (and there are still many easy problems out there). With good features (which are problem-specific, anyway) knowing how to run an SVM will very often be enough.

Lest you fear we are giving people enough just enough knowledge to be dangerous, we stress correct evaluation of the results throughout the book. We warn repeatedly against mixing up your training and testing data. This simple principle is, unfortunately, still often disregarded in scientific publications. [1]

§

There is an aspect that I really enjoyed about this whole process:

Before starting the book, I had already submitted two papers, neither of which is out already (even though, after some revisions, they are in accepted state). In the meanwhile, the book has been written, edited (only a few minor issues are still pending) and people have been able to buy parts of it for a few months now.

I have now a renewed confidence in the choice to stay in science (also because I moved from a place where things are completely absurd to a place where the work very well). But the delay in publications that is common in the life sciences is an emotional drag. In some cases, the bulk of the work was finished a few years before the paper is finally out.

Update (July 26 2013): Amazon is now shipping the book! I changed the wording above to reflect this.

[1] It is rare to see somebody just report training accuracy and claim their algorithm does well. In fact, I have never seen it in a recent paper. However, performing feature selection or parameter tuning on the whole data prior to cross-validating on the selected features with the tuned parameters is pretty common still today (there are other sins of evaluation too: “we used multiple parameters and report the best”). This leads to inflated results all around. One of the problems is that, if you do things correctly in this environment, you risk that reviewers of your work will say “looks great, but so-and-so got better results” because so-and-so tuned on the testing set and seems to have “beaten” you. (Yes, I’ve had this happen, multiple times; but that is a rant for another day.)

Segmenting Images In Parallel With Python & Jug

On Friday, I posted an introduction to Jug. The usage was very basic, however. This is a slightly more advanced usage.

Let us imagine you are trying to compare two image segmentation algorithms based on human-segmented images. This is a completely real-world example as it was one of the projects where I first used jug [1].

We are going to build this up piece by piece.

First a few imports:

import mahotas as mh
from jug import TaskGenerator
from glob import glob

Here, we test two thresholding-based segmentation method, called method1 and method2. They both (i) read the image, (ii) blur it with a Gaussian, and (iii) threshold it [2]:

@TaskGenerator
def method1(image):
    # Read the image
    image = mh.imread(image)[:,:,0]
    image  = mh.gaussian_filter(image, 2)
    binimage = (image > image.mean())
    labeled, _ = mh.label(binimage)
    return labeled@TaskGenerator
def method2(image):
    image = mh.imread(image)[:,:,0]
    image  = mh.gaussian_filter(image, 4)
    image = mh.stretch(image)
    binimage = (image > mh.otsu(image))
    labeled, _ = mh.label(binimage)
    return labeled

Just to make sure you see what we are talking about. Here is one possible input image:
image_stretched
What you see is cell nuclei. The very bright areas are noise or unusually bright cells. The results of method 1 can be seen as follows:
image_method1Each color represents a different region. You can see this is not very good as many cells are merged. The reference (human segmented image looks like this):

image_reference

Running over all the images looks exactly like Python:

results = []
for im in glob('images/*.jpg'):
    m1 = method1(im)
    m2 = method2(im)
    ref = im.replace('images','references').replace('jpg','png')
    v1 = compare(m1, ref)
    v2 = compare(m2, ref)
    results.append( (v1,v2) )

But how do we get the results out?

A simple solution is to write a function which writes to an output file:

@TaskGenerator
def print_results(results):
    import numpy as np
    r1, r2 = np.mean(results, 0)
    with open('output.txt', 'w') as out:
        out.write('Result method1: {}\nResult method2: {}\n'.format(r1,r2))
print_results(results)

§

Except for the “TaskGenerator“ this would be a pure Python file!

With TaskGenerator, we get jugginess!

We can call:

jug execute &
jug execute &
jug execute &
jug execute &

to get 4 processes going at once.

§

Note also the line:

print_results(results)

results is a list of Task objects. This is how you define a dependency. Jug picks up that to call print_results, it needs all the results values and behaves accordingly.

Easy as Py.

§

You can get the full script above including data from github

§

Reminder

Tomorrow, I’m giving a short talk on Jug for the Heidelberg Python Meetup.

If you miss it, you can hear it in Berlin at the BOSC2013 (Bioinformatics Open Source Conference) in July (19 or 20).

[1] The code in that repository still uses a pretty old version of jug, this was 2009, after all. TaskGenerator had not been invented yet.
[2] This is for demonstration purposes; the paper had better methods, of course.
[3] Again, you can do better than Adjusted Rand, as we show in the paper; but this is a demo. This way, we can just call a function in milk

Introduction to Jug: Parallel Tasks in Python

Next Tuesday, I’m giving a short talk on Jug for the Heidelberg Python Meetup.

If you miss it, you can hear it in Berlin at the BOSC2013 (Bioinformatics Open Source Conference) in July. I will take this opportunity to write a couple of posts about jug.

Jug is a cross between the venerable make and Python. In Make tradition, you write a jugfile.py. Perhaps, this is best illustrated by an example.

We are going to implement the dumb algorithm for finding all primes under 100. We write a function to check whether a number is prime:

def is_prime(n):
    from time import sleep
    # Sleep a little bit so that this does not run ridiculously fast
    sleep(1.)
    for j in xrange(2,n-1):
        if (n % j) == 0:
            return False
    return True

Then we build tasks out of this function:

from jug import Task
primes100 = [Task(is_prime, n) for n in range(2,101))

Each of these tasks is of the form call “is_prime“ with argument “n“. So far, we have only built the tasks, nothing has been executed. One important point to note is that the tasks are all independent.

You can run jug execute on the command line and jug will start executing tasks:

jug execute &

The nice thing is that it is fine to run multiple of these at the same time:

jug execute &
jug execute &
jug execute &
jug execute &

They will all execute in parallel. We can use jug status to check what is happening:

jug status

Which prints out:

Task name                                    Waiting       Ready    Finished     Running
----------------------------------------------------------------------------------------
primes.is_prime                                    0          74          20           5
........................................................................................
Total:                                             0          74          20           5

74 is_prime tasks are still in the Ready state, 5 are currently running (which is what we expected, right?) and 20 are done.

Wait a little bit and check again:

Task name                                    Waiting       Ready    Finished     Running
----------------------------------------------------------------------------------------
primes.is_prime                                    0           0          99           0
........................................................................................
Total:                                             0           0          99           0

Now every task is finished. If we now run jug execute, it will do nothing, because there is nothing for it to do!

§

The introduction above has a severe flaw: this is not how you should compute all primes smaller than 100. Also, I have not shown how to get the prime values. On Monday, I will post a more realistic example.

It will also include a processing pipeline where later tasks depend on the results of earlier tasks.

§

(Really weird thing: as I am typing this, WordPress suggests I link to posts on feminism and Australia. Probably some Australian reference that I am missing here.)