Why Python is Better than Matlab for Scientific Software

Why Python is Better than Matlab for Scientific Software

This is an argument I made at EuBIAS when arguing for the use of mahotas and the Python numpy-stack for bioimage informatics. I had a few conversations around this and decided to write a longer post summarizing the whole argument.

0. My argument mostly applies for new projects

If you have legacy code in MATLAB, then it may make sense to just continue using it. If it works, don’t fix it. However, if your Matlab code keeps causing you pain, Python might be a solution.

Note too that porting code is not the same as writing from scratch. You can often convert code from MATLAB to Python in a small fraction of the time it would take you to start from scratch.

1. Python has caught up with Matlab and is in the process of overtaking it.

This is my main argument: the momentum is in Python’s direction. Even two or three years ago, Python was behind. Now, it’s sailing past Matlab.

nr_lines_python

This graph shows the number of lines of code in some important projects for bioimage informatics/science in general (numpy, matplotlib, mahotas, skimage, and sklearn). As you can see, the base projects on the top (numpy and matplotlib) have been stable for some years, while the more applied packages at the bottom have exploded in recent years.

Depending on what you are doing, Python may even better support it. It is now, Matlab which is playing catch-up with open source software (for example, Matlab is now introducing their own versions of Dataframe, which Python has through Pandas [ itself, a version of R’s Dataframe object]).

The Python projects are also newer and tend, therefore, to be programmed in a more modern way: it is typical to find automated testing, excellent and complete documentation, a dedicated peer-reviewed publication, &c. This ain’t your grandfather’s open source with a dump on sourceforge and a single README file full of typos.

As an example of the large amount of activity going on in the Python world, just this week, Yhat released ggplot for Python [1]. So, while last week, I was still pointing to plotting as one of the weakneses of Python, it might no longer be true.

2. Python is a real programming language

Matlab is not, it is a linear algebra package. This means that if you need to add some non-numerical capabilities to your application, it gets hairy very fast.

For scientific purposes, when writing a small specialized script, Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them.

In every area, specialized languages are the best choice, but Python is the second best in more areas [2].

3. Python can easily interface with other languages

Python can interfact with any language which can be interacted through C, which is most languages. There is a missing link to some important Java software, but some work is being done to address that too. Technically, the same is true of Matlab.

However, the Python community and especially the scientific Python community has been extremely active in developing tools to make this as easy as you’d like (e.g., Cython).  Therefore, many tools/libraries in C-based languages are already wrapped in Python for you. I semi-routinely get sent little snippets of R code to do something. I will often just import rpy2 and use it from Python without having to change the rest of my code.

4. With Python, you can have a full open-source stack

This means that you are allowed to, for example, ship a whole virtual machine image with your paper. You can also see look at all of the code that your computation depends on. No black boxes.

5. Matlab Licensing issues are a pain. And expensive.

Note that I left the word expensive to the end, although in some contexts it may be crucial. Besides the basic MATLAB licenses, you will often need to buy a few more licenses for specific toolboxes. If you need to run the code on a cluster, often that will mean more licenses.

However, even when you do have the money, this does not make the problem go away: now you need admin for the licensing. When I was at CMU, we had campus-wide licenses and, yet, it took a while to configure the software on every new user’s computer (with some annoying minor issues like the fact that the username on your private computer needed to match the username you had on campus), you couldn’t run it outside the network (unless you set up a VPN, but this still means you need network access to run a piece of software), &c. Every so often, the license server would go down and stop everybody’s work. These secondary costs can be as large as the licensing costs.

Furthermore, using Python means you can more easily collaborate with people who don’t have access to Matlab. Even with a future version of yourself who decided to economize on Matlab licenses (or if the number of users shrinks and your institution decides to drop the campus licensing, you will not be one of the few groups now forced to buy it out-of-pocket).

By the way, if you do want support, there are plenty of options for purchasing it [3]: larger companies as well as individual consultants available in any city. The issue of support is orthogonal to the licensing of the software itself.

§

Python does have weaknesses, of course; but that’s for another post.

[1] Yhat is a commercial company releasing open-source software, by the way; for those keeping track at home.
[2] I remember people saying this about C++ back in the day.
[3] However, because science is a third-world economy, it may be easier to spend 10k on Matlab licenses which come with phone support to spending 1k on a local Python consultant.
Advertisement

The False Hope of Usable Data Analysis

I changed the regular schedule of the posts because I wanted to write down these ideas.

A few days ago, in a panel at EuBIAS, I argued again that scientists should learn how to programme. I also argued that usability of bioimage analysis was a false hope.

Now, to be sure: usability is great, but usability does not mean usable without programming skills. Good usable programming environments can be the most usable way to achieve something[1]. I find the Python environment one of the most usable for data analysis currently, although there is still a lot of work which could improve it.

§

road_signs

We can build communication systems without words, but only if the vocabulary is very limited. Otherwise, people need to learn how to read [2]. I think this a good analogy for non-programming environments.

§

The problem is that image analysis (or data analysis) is not a closed goal. Whatever we are doing today, will probably be packaged into simple-to-use tools, but the problems will grow in size and complexity.

For a fixed target, like sending email or writing a blog, we can build nice tools that don’t require programming. Any modern email client basically does email well enough. There is probably only a small set of behaviours we want our blogs to do (like scheduling a post) and I think we can get a small set of features that covers 95%+ of uses. There might be a need for a few hundred plugins, but not constant innovation. There is no constant pressure to do 10 times more.

But data analysis is not in the same category as sending email. It’s an open-ended problem, which will grow continuously, which has been growing continuously. Only a full-blown artificial intelligence system will be able to deal with the sort of analyses that we will want to do in 10 years. There are even analyses that we already want to do, but do not yet have the right code and tools.

§

If anything, as time has passed, I have felt more and more of a need to think in low-level terms [3].

A few years ago, push-button analysis was sufficient for most problems. Load your data into Excel, select the rows, and plot. Fit a line, compute some stats. STATA gave you a bit more power if Excel did not suffice. Now, the problems grew and push-button solutions do not scale. Not only do we have more data, we have more complex, more unstructured data.

Afew years ago, pointing out that Excel can only handle 1 million rows would have made you seem like a technically-obsessed weirdo, now it is a serious limitation.

A few years ago, people were writing things like feel free to use interpreted languages, it doesn’t matter that you’re losing performance compared to C; computers are super-fast, waste them. Now, there is much more interest in building implementations that are as fast as C (normally using Just-in-time compilation).

This will not get better and just saying that tools should be easier for non-programmers is missing the point.

§

Programming is like writing: a general purpose technological skill which transforms all activities. And this means that, eventually, it becomes useful (or even necessary) for many activities which are outside the core of programming (who’d have thought a salesperson would have to know how to read and write? A firefighter?).

Almost any job that does not require programming is one which can be done by a robot. Except entertainment and those jobs that Tyler Cowen, for lack of a better word, calls marketing. Tyler calls them marketing, but prostitution might be just as accurate, as it is about providing not a specific service or product, which could be provided by a machine, but the general positive feeling that comes from human contact [4].

Related

Bayes and Big Data by Cosma Shalizi

The Average is Over by Tyler Cowen

[1] If you wish, read scripting for programming. I never cared much for this division.
[2] If you google for traffic signs you’ll see that actually most images have at least one sign with words or images.
[3] The need to managing parallelism (as our cores multiply, but not get faster) and memory access patterns as data grows faster than RAM have forced me to think about exactly what is happening in my machines.
[4] Obviously, Tyler is right to use the word marketing even if it’s not a good fit. Prostitution has a strong negative charge..

Mahotas and the Python Scientific Ecosystem for Bioimage Analysis

This week, I’m in Barcelona to talk about mahotas at the EuBIAS 2013 Meeting.

You can get a preview of my talk here. The title is “Mahotas and the Python Scientific Ecosystem for Bioimage Analysis” and you’ll see it does not exclusively talks about mahotas, but the whole ecosystem. Comments are welcome (especially if they come in the next 24 hours).

In preparation, I released new versions of mahotas (and the sister mahotas-imread) yesterday:

There are a few bugfixes and small improvements throughout.

*

If you like and use mahotas, please cite the software paper.