Interview with João Carriço

Happy New Year! I am starting what will hopefully become a (semi-)regular feature: interviews! The first one is with João Carriço.

I’ve known João Carriço for a few years, but last year, we were both at Instituto de Medicina Molecular in Lisbon and we got to interact more and talked a lot of science and non-science.

https://i0.wp.com/kdbio.inesc-id.pt/~jcarrico/JAC2.png

Below is a little conversation we had on science, twitter, and bioinformatics. My questions are bolded:

You started at the bench and moved to the laptop. I’d say that this is the opposite of the typical path for a computational biologist. Do you agree?

As I see it there are two possible paths for computational biologist: you are either someone with Computational Science/Physics background or from Biology/Biochemistry background. I am closer to this second category since my degree was in Applied Chemistry, Biotechnology. However, at this point, I don’t believe there is a “typical” path. IF you like computer science and biology you can become a computational biologist with more or less work on either (or both) areas.

Does this impact the way you work when compared with someone who came in the other way, from pure computer science?

Indeed, it does. There are huge differences in how these two communities tackle the problems. Coming from a biology-like background I just focus how can I get the answer and visualize it, and try not to worry too much about how the algorithm is implemented. My first main goal is to get an answer out and then worry about the details of the algorithm.

On a related note, how do you see computational biology evolving? Will it remain “interdisciplinary” or become itself a discipline, so that people are trained as “computational biologists” as college students?

I believe that the second scenario you put forward is what is happening already. I can only hope that it won’t be the only way that the field will have to allow intake of new people, but will be for sure the largest contributor.

Is this a good development or would losing the interdisciplinary flavour be a loss?

There will be losses for sure. Fresh ideas from both fields will take longer to get integrated in bioinformatics/computational biology as the majority of the people will be taught the ropes of 2 different fields and how they can be knotted together and will be focused on that learning. Having something seemingly unrelated and, at first look, very specific for a field and somehow applying it to the other field will be more difficult. On the other hand it will create the much needed number of practical problem solvers for everyday tasks.

Have you already seen changes in how institutions handle computational biology (and bioinformatics)? How do you rate these changes?

Unfortunately I can’t say that I have seen it, at least with a serious commitment level. Now, with the new next generation sequencing approaches some people are starting to realize the need to have specialised personnel and infrastructure, but the institutions (at least the ones oriented in biological research) still don’t recognise the scientific merit or research needs of the field.

Let’s switch tracks a bit and talk about science per se, namely your projects: Your work seem to be a mixture of method development and applications that are very close to the clinic. Is this a fair characterisation?

You can say that. Not as close to the clinical practice as I would like though.

I work basically in data analysis and management of microbial typing methods, which are used to identify bacterial pathogens at strain level. This is important since we know that for several bacteria only a few lineages in a certain species are the cause of the majority of diseases while others are commensal to us. The aim of my research is to understand what is the best way to do this classification based on the available methods and how can that be useful to track and predict the appearance of problematic strains, i.e., the ones that can be spread faster or be resistant to antibiotics.

How does one go about managing a project where you may have someone who is a bench biologist, another person who lives inside the command line, and a medical doctor?

Well, as always the first step is finding a common language and trying to restrict our conversations to that. In the beginning, we don’t want a computer scientist explaining algorithmic details to an MD or an MD explaining in detail how the infection advances in a patient to a computer scientist as both will be at loss. Eventually this will come and the questions will rise and the common language will encompass it. The important thing is remind everyone to have an open mind about the subjects to be discussed.

You always have to be the “hinge” people as someone told me some time ago. For the biologists I am the computer scientist and for the computer scientist I am the biologist, thus being the hinge between to fields. But in the end I can say that we can get very interesting and productive results!

Do you think hospitals will start using next-generation-sequence-based tools to either analyse samples or monitor for antibiotic resistant outbreaks? If so, what’s the timeline: in 5 years or 20 years?

This will happen for sure in less than 5 years. The problem in that example is that the presence of a antibiotic resistance gene doesn’t always correlate to the strain being non-susceptible to that antibiotic. So in the next 2-5 years, if the price of sequencing continues to drop, I can see NGS being done routinely as first approach as you can get results in less than 12 hours and that could eventually guide or advise the antimicrobial therapeutic that should be prescribed or even detect some relevant virulence factors/toxins that the strain can produce .

Tell us about an interesting recent project of yours (published or unpublished).

In the last years I have been working a lot with ontologies and RESTful interfaces applied to microbial typing and NGS.

As I usually say, I’m not a big fan of this type of work as I prefer working in algorithms and visualisation, but I see it as the base for all my other work. Again, without a common language with which databases can communicate or simply reduce the overhead for data integration, having the best algorithms and result visualisation can be meaningless with you don’t get the data integrated correctly or even if you don’t get the data. It is the part of the big puzzle that most people recognise its necessity but shun away from doing it. I decided to bite the bullet and push (slowly) forward in this field and I’m being rewarded with some good collaborations and interest from the community. Hopefully, next year we will have a couple of publication to show the results.

You are an active science twitterer [João is @jacarrico]. Is this something that’s an to your work, a distraction, or both?

Being a very important addition to my work largely outweighs the distraction! Twitter gives me the ability to stay in contact with the worldwide bioinformatics community, since most of them are active twitterers/bloggers. Twitter gives me the ability to post a question and often minutes after the fact I get a couple of answers that save me several hours of reading through software manuals, FAQs or papers!

Do you think this is a fad that’ll go away or something that will stay? Perhaps in a modified form, but would you say that the idea of fast and unfiltered science “gossip” is here to stay?

The science gossip with all its juicy bits through twitter is here to stay. As I see it, the filtering of twitter gossip happens at blog level as tweets get blogged and commented upon, so the good parts will stay afloat from the noisy background.

Thank You!

João Carriço is a researcher at the Molecular Microbiology and Infection Unit of the Instituto de Medicina Molecular. He can be found on twitter as @jacarrico. A full list of his publications can be found in his google scholar profile

Paper Review: Automated prior knowledge-based quantification of neuronal patterns in the spinal cord of zebrafish

Automated prior knowledge-based quantification of neuronal patterns in the spinal cord of zebrafish by Johannes Stegmaier, Maryam Shahid, Masanari Takamiya, Lixin Yang, Sepand Rastegar, Markus Reischl, Uwe Strähle, and Ralf Mikut. in Bioinformatics (2013) [DOI]

It’s been a while since I’ve had a paper review, even though one of my goals is to give more space to bioimage informatics. So, I will try to make up for it in the next few weeks. This is a paper which is not exactly hot off the press (it came out two months ago), but still very recent.

The authors are working with zebrafish. Unfortunately, I am unable to evaluate the biological results as I do now know much about zebrafish, but I can appreciate the methodological contributions. I will illustrate some of the methods based on a Figure (Fig 2) from the paper:

Figure 2

The top panel is the data (a fish spinal coord, cropped out of a larger field), the next two a binarization of the same data and a line fit (in red). Finally, the bottom panel shows the effect of straightening the image to a line. This allows for comparison between different images by morphing them all to a common template. The alignment is performed on only one of the channels, while the others can carry complementary information.

§

This is very similar to work that has been done in straightening C. elegans images (e.g., Peng et al., 2008) in both intent and some of the general methods (although there you often morph the whole space instead of just a band of interest). It is a bit unfortunate that the bioimage informatics literature sometimes aggregates by model system when many methods can profitably be used across problems.

§

Finally, I really like this visualization, but I need to give you a bit of background to explain it (if I understood it correctly). Once a profile has been straightened (panel D in the figure above), you can summarize it by averaging along the horizontal dimension to get the average intensity at each location (where zero is the centre of the spinal coord) [1]. You can then stack these profiles (analogously to what you’d do to obtain a kinograph) as a function of your perturbation (in this case, a drug concentration):

Figure 6

This is Figure 6 in the paper.

The effect of the drug (and saturation) become obvious.

§

As a final note, I’ll leave you with this quote from the paper, which validates some of what I said before: the quality of human evaluation is consistently over-estimated:

Initial tests unveiled intra-expert and inter-expert variations of the extracted values, leading to the conclusion that even a trained evaluator is not able to satisfactorily reproduce results.

[1] The authors average a different marker than the one used for straightening, but since I know little about zebrafish biology, I focus on the methods.

FAQ: How Many Clusters Did You Use?

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [Advanced Access]  [Previous discussion on this blog]

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220  [Previous discussion on this blog]

Both of my Bioinformatics papers above use the concept of bag of visual words. The first for classification, the second for pattern unmixing.

Visual words are formed by clustering local appearance descriptors. The descriptors may have different origins (see the papers above and the references below) and the visual words are used differently, but the clustering is a common intermediate step.

A common question when I present this work is how many clusters do I use? Here’s the answer: it does not matter too much.

I used to just pick a round number like 256 or 512, but for the local features paper, I decided to look at the issue a bit closer. This is one of the panels from the paper, showing accuracy (y-axis) as a function of the number of clusters (x-axis):

profile-field-dna+-RT-widefield-gs

As you can see, if you use enough clusters, you’ll do fine. If I had extended the results rightwards, then you’d see a plateau (read the full paper & supplements for these results) and then a drop-off. The vertical line shows N/4, where N is the number of images in the study. This seems like a good heuristic across several datasets.

One very interesting result is that choosing clusters by minimising AIC can be counter-productive! Here is the killer data (remember, we would be minimizing the AIC):

accuracy-aic-rt-widefield-gs

Minimizing the AIC leads to lower accuracy! AIC was never intended to be used in this context, of course, but it is often used as a criterion to select the number of clusters. I’ve done it myself.

Punchline: If doing classification using visual words, minimsing AIC may be detrimental, try using N/4 (N=nr of images).

Other References

This paper (reviewed before on this blog) presents supporting data too:

Noa Liscovitch, Uri Shalit, & Gal Chechik (2013). FuncISH: learning a functional representation of neural ISH images Bioinformatics DOI: 10.1093/bioinformatics/btt207

Unsupervised subcellular pattern unmixing. Part II

On Friday, I presented the pattern unmixing problem. Today, I’ll discuss how we solved it.

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220

The first step is to extract objects. For this, we use a combination of global & local thresholding: this means that a pixel is on if it is both above a global threshold which identifies the cells from the background and a local threshold (which identifies subcellular objects [1]).

We then group the objects found using k-means clustering. Here is what we obtain for a lysosomal picture (different colours mean different clusters) [2].

unmixing-colours-1

and the equivalent for the mitochondrial image:

unmixing-colours-0

You will see that the mitochondrial image has many green things and less dark purple objects, but both mitochondrial and lysosomal images have all of the groups. Now (and this is an important point): we do not attempt to classify each individual object, only to estimate the mixture.

§

Of course, if we had the identity of each object, the mixture would be trivially estimated. But we do not need to identify each object. In fact, to attempt to do so would be a gross violation of Vapnik’s Dictum (which says do not solve, as an intermediate step, a harder problem than the one you are trying to solve). It is easier to just estimate the mixtures [3].

In this formulation it might not even matter much that some of the objects we detect correspond to multiple biological objects!

§

How do we solve the mixture problem? Latent Dirichlet allocation or basis pursuit. The details are in the paper, but I will jump to the punchline.

We tested the method using a dataset where we had manipulated the cell tagging so we know the ground truth (but the algorithm, naturally, does not see it). On the graph below, the x-axis is the (hidden) truth and the y-axis is the automated estimate. In green, the perfect diagonal; and each dot represents one condition:

unmixing_corrcoef_lda

§

I will note that each individual dot in the above plot represents several images from each condition. On a single image (or single cell) level the prediction is not so accurate. Only by aggregating a large number of objects can the model predict well.

This also points out why it may be very difficult for humans to perform this task (nobody has tried to do it, actually).

[1] A global threshold did not appear to be sufficient for this because there is a lot of in-cell background light (auto-fluorescence and auto-focus light).
[2] For this picture, I used 5 clusters to get 5 different colours. The real process used a larger number, obtained by minimising BIC.
[3] Sure, we can then reverse engineer and obtain a probability distribution for each individual object, but that is not the goal.

Old Work: Unsupervised Subcellular Pattern Unmixing

Continuing down nostalgia lane, here is another old paper of mine:

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220

I have already discussed the subcellular location determination problem. This is Given images of a protein, can we assign it to an organelle?

This is, however, a simplified version of the world: many proteins are present in multiple organelles. They may move between organelles in response to a stimulus or as part of the cell cycle. For example, here is an image of mitochondria in green (nuclei in red):

img17

Here is one of lysosomes:

img71

And here is a mix of both!:

img77-2

This is a dataset constructed for the purpose of this work, so we know what is happening, but it simulates the situation where a protein is present in two locations simultaneously.

Thus, we can move beyond simple assignment of a protein to an organelle to assigning it to multiple organelles. In fact, some work (both from the Murphy group and others) has looked at subcellular location classification using multiple labels per image. This, however, is still not enough: we want to quantify this.

This is the pattern unmixing problem. The goal is to go from an image (or a set of images) to something like the following: This is 30% nuclear and 70% cytoplasmic, which is very different from 70% nuclear and 30% cytoplasmic. The basic organelles can serve as the base patterns [1].

Before our paper, there was some work in approaching this problem from a supervised perspective: Given examples of different organelles (ie, of markers that locate to a single organelle), can we automatically build a system which when given images of a protein which is distributed in multiple organelles, can figure out which fraction comes from each organelle?

Our paper extended this to work to the unsupervised case: can you learn a mixture when you do not know which are the basic patterns?

References

Determining the distribution of probes between different subcellular locations through automated unmixing of subcellular patterns Tao Peng, Ghislain M. C. Bonamy, Estelle Glory-Afshar, Daniel R. Rines, Sumit K. Chanda, and Robert F. Murphy PNAS 2010 107 (7) 2944-2949; published ahead of print February 1, 2010, doi:10.1073/pnas.0912090107

Object type recognition for automated analysis of protein subcellular location T Zhao, M Velliste, MV Boland, RF Murphy Image Processing, IEEE Transactions on 14 (9), 1351-1359

[1] This is still a limited model because we are not sure even how many base patterns we should consider, but it will do for now.

Paper review: Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis

Paper review:

Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis by L. Shamir in Journal of Microscopy, 2011 [DOI]

This is an excellent simple paper [1]. I will jump to the punchline (slightly edited by me for brevity):

This paper demonstrates that microscopy images that were previously used for developing and assessing the performance of bioimage classification algorithms can be classified even when the biological content is removed from the images [by replacing them with white squares], showing that previously reported results might be biased, and that the computer analysis could be driven by artefacts rather than by the actual biological content.

Here is an example of what the author means:

white_squares

Basically, the author shows that even after modifying the images by drawing white boxes where the cells are, classifiers still manage to do apparently well. Thus, they are probably picking up on artefacts instead of signal.

This is (and this analogy is from the paper, although not exactly in this form) like a face recognition system which seems to work very well because all of the images it has of me have me wearing the same shirt. It can perform very well on the training data, but will be fooled by anyone who wears the same shirt.

§

This is a very important work as it points to the fact that many previous results were probably overinflated. Looking at the dates when this work was done, this was probably at the same time that I was working on my own paper on evaluation of subcellular location determination (just that it took a while for that one to appear in print).

I expect that my proposed stricter protocol for evaluation (train and test on separate images) would be more protected against this sort of effect [2]: we are now modeling the real problem instead of a proxy problem.

§

I believe two things about image analysis of biological samples:

  1. Computers can be much better than humans at this task.
  2. Some (most? much of?) published literature overestimates how well computers do with the method being presented.

Note that there is no contradiction between the two, except that point 2, if widely believed, can make it harder to convince people of point 1.

(There is also a third point which is most people overestimate how well humans do.)

[1] Normally, I’d review recent papers only, but this not only had this one escaped my attention when it came out (in my defense, it came out just when I was trying to finish my PhD thesis), but it deals with themes I have blogged about before.
[2] I tried a bit of testing around here, but it is hard to automate the blocking of the cells. Automatic thresholding does not work because it depends on the shape of the signal! This is why the author of this paper drew squares by hand.

Seeing is Believing. Which is Dangerous.

One of the nice things about being at EMBL is that, if you just wait, eventually you can hear the important people in your field speak. Today, I’m quite excited about the Seeing is Believing conference

But ever since I saw this advertised, I dislike the name Seeing is Believing.

Grey_square_optical_illusion

  1. Seeing is believing. This is unquestionable.
  2. But seeing is not always justified believing. Our seeing apparatus will often lead us astray. This is especially true on images which do not look like the ones we evolved for (and grew up looking at).
  3. The fact that seeing is believing is actually often a cognitive problem which needs to be overcome!

§

I can no longer find who said it a BOSC, but someone pointed out, insightfully, that a visualization is already an interpretation of the data, it may be wrong.

More often than not, I show you a picture of a cell, this is rarely raw data. The raw data is a big pixel array. By the time I’m showing it to you I’ve done the following:

  1. Chosen an example to show.
  2. Often projected the data from 3D to a 2D representation
  3. Tweaked contrast.

Point 1 is the biggest culprit here: the selection of which cell to image and show can be an incredibly biased process (even unconsciously biased, of course).

However, even tweaks to the way that the projection is performed and to the contrast can highlight or hide important details (as someone with a lot of experience playing with images, I can tell you that there is a lot of space for “highlighting what you want to show”). In the newer methods (super-resolution type methods), this is even worse: the “picture” you see is already the output of a big processing pipeline.

§

I’m not even thinking about the effects of the tagging protocols, which introduces their own artifacts. But we, humans, often make the mistake of saying things like “this is an image of protein A in cell type B” instead of “this is an image of a chimeric protein which includes the sequence of A, with a strong promoter in cell type B”.

§

We know that these artifacts and biases are there, of course. But we believe the images. And this can be a problem because humans are not actually all that great at image analysis.

Seeing is believing, which too often means that we suspend our disbelief (or, as we scientists, like to say: we suspend our skepticism). This is not a recipe for good science.

Update: On twitter, Jim Procter (@foreveremain), points out a great example: the story of the salmon fMRI: we can see it, but we shouldn’t believe it.

Is Cell Segmentation Needed for Cell Analysis?

Having just spent some posts discussing a paper on nuclear segmentation (all tagged posts), let me ask the question:

Is cell segmentation needed? Is this a necessary step in an analysis pipeline dealing with fluorescent cell images?

This is a common FAQ whenever I give a talk on my work which does not use segmentation, for example, using local features for classification (see the video). It is a FAQ because, for many people, it seems obvious that the answer is that Yes, you need cell segmentation. So, when they see me skip that step, they ask: shouldn’t you have segmented the cell regions?

Here is my answer:

Remember Vapnik‘s dictum [1]do not solve, as an intermediate step, a harder problem than the problem you really need to solve.

Thus the question becomes: is your scientific problem dependent on cell segmentation? In the case, for example, of subcellular location determination, it is not: all the cells in the same field display the same phenotype, your goal being the find out what it is. Therefore, you do not need to have an answer for each cell, only for the whole field.

In other problems, you may need to have a per-cell answer: for example in some kinds of RNAi experiment only a fraction of the cells in a field display the RNAi phenotype and the others did not take up the RNAi. Therefore, segmentation may be necessary. Similarly, if a measurement such as distance of fluorescent bodies to cell membrane is meaningful, by itself (as opposed to being used as a feature for classification), then you need segmentation.

However, sometimes you can get away without segmentation.

§

An important point to note is the following: while it may be good to have access to perfect classification, imperfect classification (i.e., the type you actually get), may not help as much as the perfect kind.

§

Just to be sure, I was not the first person to notice that you do not need segmentation for subcellular location determination. I think this is the first reference:

Huang, Kai, and Robert F. Murphy. “Automated classification of subcellular patterns in multicell images without segmentation into single cells.” Biomedical Imaging: Nano to Macro, 2004. IEEE International Symposium on. IEEE, 2004. [Google scholar link]

[1] I’m quoting from memory. It may a bit off. It sounds obvious when you put it this way, but it is still often not respected in practice.

To reproduce the paper, you cannot use the code we used for the paper

Over the last few posts, I described my nuclear segmentation paper.

It has a reproducible research archive.

§

If you now download that code, that is not the code that was used for the paper!

In fact, the version that generates the tables in the paper does not run anymore, because it only runs with old versions of numpy!

In order for it to compute the computation in the paper, I had to update the code. In order to run the code in the paper, you need to get old versions of software.

§

To some extent, this is due to numpy’s frustrating lack of forward compatibility [1]. The issue at hand was the changed semantics of the histogram function.

In the end, I think I completely avoided that function in my code for a few years as it was toxic (when you write libraries for others, you never know which version of numpy they are running).

§

But as much as I can gripe about numpy breaking code between minor versions, they would eventually be justified in changing their API with the next major version change.

In the end, the half-life of code is such that each year, it becomes harder to reproduce older papers even if the code is available.

[1] I used to develop for the KDE Project where you did not break user’s code ever and so I find it extremely frustrating to have to explain that you should not change an API on esthetical grounds in between minor versions.

Why Pixel Counting is not Adequate for Evaluating Segmentation

Let me illustrate what I was trying to say in a comment to João Carriço:

Consider the following three shapes:

segmentatioin

If the top (red) image is your reference and green and blue are two candidate solutions, then pixel counting (which forms the basis of the Rand and Jaccard indices) will say that green is worse than blue. In fact, green differs by 558 pixels, while blue only by 511 pixels.

However, the green image is simply a fatter version of red (with a circa 2 pixel boundary). Since boundaries cannot be really drawn at pixel level anyway (it is a fuzzy border between background and foreground), it is not an important difference. The blue image, however, has an extra blob and so is qualitatively different.

The Hausdorff distance or my own normalized sum of distances, on the other hand, would say that green is very much like red, while blue is more different. Thus they capture the important differences better than pixel counting. I think this is why we found that these are better measures than Rand or Jaccard (or Dice) for evaluation of segmentation.

(Thanks João for prompting this example. I used this when I gave a talk or two about this paper, but it was lost in the paper because of page limits.)

Reference

NUCLEAR SEGMENTATION IN MICROSCOPE CELL IMAGES: A HAND-SEGMENTED DATASET AND COMPARISON OF ALGORITHMS by Luis Pedro Coelho, Aabid Shariff, and Robert F. Murphy in Biomedical Imaging: From Nano to Macro, 2009. ISBI ’09. IEEE International Symposium on, 2009. DOI: 10.1109/ISBI.2009.5193098 [Pubmed Central open access version]