The Ecosystem of Unix and the Difficulty of Teaching It

Plos One published an awful paper comparing Word vs LaTeX where the task was to copy down a piece of text. Because Word users did better than LaTeX users at this task, the authors conclude that Word is more efficient.

First of all, this fits perfectly with my experience: Word [1] is faster for single page documents, where I don’t care about precise formatting, such as a letter. It says nothing about how it performs on large documents which are edited over months (or years). The typical Word failure mode are “you paste some text here and your image placement is now screwed up seven pages down” or “how do I copy & paste between these two documents without messing up the formatting?” This does not happen so much with a single page document.

Of course, the authors are not happy with the conclusion that Word is better for copying down a short piece of predefined text and instead generalize to “that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems.” This is a general failure mode of psychological research: here is a simple, well-defined experimental result in a very artificial setting. Now, let me completely overgeneralize to the real world. The authors of the paper actually say this in their defense: “We understand that people who are not familiar with the experimental methods of psychology (and usability testing) are surprised about our decision to use predefined texts.” That is to say, in our discipline, we are always sort of sloppy, but reviewers in the discipline do the same, so it’s fine.

§

Now, why waste time bashing a Plos One paper in usability research?

Because one interesting aspect of the discussion is that several people have pointed out that Word is better for collaboration because of the Track Changes features. For many of us, this is laughable because one of the large advantages of LaTeX is that you can use version control on the files. You can easily compare the text written today with a version from two months ago, it makes it easier to have multiple people working, &c.[2] In Word, using Track Changes is still “pass the baton” collaboration, whereby you email stuff around and say “now it’s your turn to edit it” [3].

However, this is only valid if you know how to use version control. In that case, it’s clear that using a text-based format is a good idea and it makes collaboration easier. The same way, I actually think that some of the test subjects in the paper had with LaTeX was simply that they did not use an editor with a spell-checker.

The underlying concept is that LaTeX works in an ecosystem of tools working together, which is a concept that we do not, in general, teach people. I have been involved with Software Carpentry and even before that I was trying teach people who are not trained in computers about these sort of tools, but we do not do that great of a job at teaching this concept, of the ecosystem. It is abstract and not directly clear to students why it is useful.

Spending a few hours going through the basic Unix commands seems like a brain-dead activity when people cannot connect this to their other knowledge or pressing needs.

On the other hand, it is very frustrating when somebody comes to me with a problem they have been struggling with for days and, in a minute, I can give them a solution because it’s often “oh, you can grep in extended mode and pipe it to gawk” (or worse, before they finish the description, I’ll say “run dos2unix and it will fix it” or “the problem you are describing is the exact use case of this excellent Python package, so you don’t need to code it from scratch”). Then they ask “how could I learn that? Is there a book/course?” and I just don’t have an answer better than “do this for 10 years and you’ll slowly get it”.

It’s hard to teach the whole ecosystem at once, which means that it’s hard to teach the abstractions behind it. Or maybe, I just have not yet figured out how it would be possible.

§

Finally, let me just remark that LaTeX is a particularly crappy piece of software. It is so incredibly bad that it only survives because the alternatives manage to be even worse. It’s even sadder when you realise that LaTeX is now over 30 years old, while Word is an imitation of even older technology We still have not been able to come up with something that is clearly better.

§

This flawed paper probably had better altmetrics than anything I’ll ever write in science, again showing what a bad idea altmetrics are.

[1] feel free to read “Word or Word-like software” in this and subsequent sentences. I actually often use Google Docs nowadays.
[2] Latexdiff is also pretty helpful in generating diffed versions.
[3] Actually, for collaboration, the Google Docs model is vastly superior as you don’t have to email back-n-forth. It also includes a bit of version control.

Denmark

I was in Denmark last week, teaching software carpentry. The students were very enthusiastic, but they had very different starting points, which made teaching harder.

For a complete beginner’s to programming course, I typically rely heavily on the Python Tutor created by Philip Guo, which is an excellent tool. Then, my goal is to get them to understand names, objects, and the flow of control.

I don’t use the term variable when discussing Python as I don’t think it’s a very good concept. C has variables, which work like little boxes you put values in. If you’re thinking of little boxes in Python, things get confusing. If you try to think of little boxes plus pointers (or references), it’s still not a very good map of what Python is actually doing.

For more intermediate students (the kind that has used one programming language), I typically still go through this closely. I find that many still have major faults in their mental model of how names and objects work. However, for these intermediate  students have, this can go much faster [1]. If it’s the first time they are even seeing the idea of writing code, then it naturally needs to be slow.

Last week, because the class was so mixed, it was probably too slow for some and too fast for others.

§

A bit of Danish weirdness:

sausage

 A sausage display at a local pub

[1] I suppose if students knew Haskell quite well but no imperative programming, this may no longer apply, but teaching Python to Haskell programmers is not a situation I have been in.

The importance of unit testing & version control for scientific software

This post is inspired by the fact that I’m teaching software carpentry in Denmark this week, but I have had this conversation a few times, so I thought I should write it down.

Often the reaction to teaching things like version control or unit testing to scientists is of the sort aren’t these things more appropriate for professional software developers who can put in the effort to learn them? I strongly disagree.

In fact, I’ll defend that unit testing and version control are more important for science than commercial software.

§

Let’s say you are running a web-based business. Unfortunately, your website’s code is a mess. Many of features were implemented by someone who left a while back and none of your new hires really understands what that code does. Fortunately, however, the code works, the site is pleasing to the eye and customers are happily paying for your services. Even these old code bases can have their lives stretched for far longer than you’d expect. Life is not that bad.

Let’s say, on the other hand, you are running a computer-based science enterprise. Unfortunately, your code is a mess. Many of the features were implemented by someone who left a while back and none of your new hires really understands what that code does. Fortunately, the code produces pretty plots. Unfortunately, you cannot explain what the plots represent beyond a vague idea. You can adapt the code to a new dataset, but never really sure why it’s working like it is and sometimes the outputs are downright mysterious. Life is pretty bad. You need to start over.

§

The difference is that in many commercial aspects, only the final output matters. If a website is pretty, it won’t much matter whether the CSS behind it is a mess. If the search engine gives the customers want they want, both costumer and vendor are happy and nobody will say I’ll buy, but first can we go over the methodological details? There are solid reasons to make the code clean and well-tested (in terms of minimizing the negative impact of individual members leaving the team or avoiding increasing costs to maintenance & extension), but it’s not required for success.

In science, however, it is not enough to have a pretty output plot. You also need to be able to explain the details behind the plot and be certain that the plot was produced the way you think it was produced. If the code gives a mysterious result, then it’s not OK to just add a hacky line of code adding 10 to result to make it work. Similarly, the ability to go back in time in your code is a nice thing in business, but can be essential in science because we value reproducibility.

§

This is why I think it’s very good that unit testing & version control are both part of the software carpentry core curriculum.

Motivation & Demotivation

(Cross-posted from the teaching software carpentry blog, this is the second part of my motivation stories [the first part was on why you should learn version control])

When I was in high school, I didn’t enjoy learning other languages that much. I’d pick up enough to get decent grades, but it was not something I really liked (I also had the advantage of being bilingual so that I never had to put any effort towards learning English). When I tell this to people who know me in real life they are sometimes surprised because, nowadays, note only do I speak three languages on a regular day, I can go up to five on a good day and I enjoy learning about them.

What changed? When I left high school and started traveling, I discovered that languages could be used to talk to people. Like most life discoveries, it is a basic fact of life, which I knew it already, but I didn’t realize the implications. My actually interacting with people in another tongue, I became more interested in knowing the languages an. Eventually started enjoying the details of the language for their own sake. The desire to talk to people transferred to a desire to learn about their languages.

Epilogue: Nowadays, when I need to learn a whole new field (like a year ago, when I started working on metagenomics). Instead of relying solely on reading papers &c I will start attending seminaries even when I don’t yet understand much of what goes on, I try to get into the discussions (naturally, at first, I just listen or ask questions; unable to really contribute). Eventually, I start looking up stuff that people said on my own and boom! I’m reading papers without any efforts.

 

Why you should learn version control: a personal story

As part of software carpentry instructor training, we were asked to write a motivational story for learning one of the themes of the material. This is mine, which I posted on the software carpentry website:

This is my motivational story to use version control (I do not have yet a story of when I was not motivated to learn in mind, so I will post that one later).

A few years ago, I submitted a paper somewhere and it got accepted. In the meanwhile, a few months had passed and we were now asked to provide the camera ready versions.

In order to generate the higher quality version of the figures, I reran the script which did so, after changing just a few of the output parameters to get a high-resolution version. However, the resulting figure did not look like the one we had submitted! Not so different as to warrant a different conclusion, mind you, but you could see that there was a slight shift in the plots when you looked at them side-by-side.

First, I calmed reviewed the code to see if something obvious popped, then I worriedly reran the whole pipeline to re-generate intermediate results, and finally I started to panic. What was wrong? Had I submitted a paper with a result which I did not know how to reproduce?

Because I keep all my code under version, I rolled back the code to the version which had been available at the time of submission. Now, it regenerated the figure exactly as we had submitted it. Relief.

Using binary search (which git has built in), I was able to isolate the exact code change which caused the difference. It turned out to be a very minor change in the way that a certain computation is made, which was mathematically equivalent but not numerically equivalent (i.e., it would have been the same if computers had infinite precision, but because we round, we obtain different results). This meant that an almost arbitrary decision at one point of the algorithm was done differently and then the results shifted enough to be visible.

Thus, because of version control, I was both able to (1) reproduce the figure at the necessary higher resolution and (2) understand why the results had changed. There was much rejoicing.

 

Screencast on Github Webflow

As part of the software carpentry instructor training programme, I did a little screencast on how to use the github webflow:

(post/discussion on software carpentry blog)

§

I did a little preparation beforehand (preparing the repository and trying it out), but the capture itself took less than a pomodoro (a pomodoro is 25 minutes).

On the one hand, I did not do a lot of editing. On the other hand, the practice I had from doing the video abstract for my Bioinformatics paper really helped here.

§

I did it all on Linux, using qx11grab + arecord to capture, then kdenlive to paste it together. In this case, I could even just have used qx11grab as it was all a single take.

Putting My Effort Where My Mouth Is

Yesterday, I argued that computer programming should be part of basic instruction.

Today, I’ll talk about a particular group of people who need to learn to programme: scientists, even lab scientists.

Personally, one of the great draws of moving away from academia into industry (the good industry places, at least) is to be able to use decent tools. Code sharing is a solved problem (the solution is version control). CVS was released in 1986. That’s almost 30 years ago. In academia, use of this type of tool is not (yet?) widespread.

§

In this context, I have (for a long time now) been putting my effort where my mouth is and created a course called Programming for Scientists, which I have taught a couple of times.

Last week, at EMBL, I helped teach a course on Python. This was done on a tight schedule and still we were overwhelmed with demand (we sold out in about 24 after a single email announcement).

Software Carpentry is another great project to teach scientists about programming and computer usage.