A Weird Python 3 Unicode Failure

The following code can fail on my system:

from os import listdir
for f in listdir(‘.’):
print(f)
Why?

UnicodeEncodeError: ‘utf-8′ codec can’t encode character ‘\udce9′ in position 13: surrogates not allowed
What?

I have a file with the name b’Latin1 file: \xe9′. This is a filename with a “é” encoded using Latin-1 (which is byte value \xe9)
Python attempts to decode it using the current locale, which is utf-8. Unfortunately, \xe9 is not valid UTF-8, so Python solves this by inserting a surrogate character. So, I get a variable f which can be used to open the file.
However, I cannot print the value of this f because when it attempts to convert back to UTF-8 to print, an error is triggered.
I can understand what is happening, but it’s just a mess. [1]

§

Here is a complete example:

f = open(‘Latin1 file: é’.encode(‘latin1′), ‘w’)
f.write(“My file”)
f.close()

from os import listdir
for f in listdir(‘.’):
print(f)
On a modern Unix system (i.e., one that uses UTF-8 as its system encoding), this will fail.

§

A good essay on the failure of the Python 3 transition is out there to be written.

[1] ls on the same directory generates a placeholder character, which is a decent fallback.

“Science’s Biggest Fail”

I completely agree with Scott Adams on this one: (many posts tagged nutrition on this blog have echoed the same sentiment)

What’s is science’s biggest fail of all time?

I nominate everything about diet and fitness.

Maybe science has the diet and fitness stuff mostly right by now. I hope so. But I thought the same thing twenty years ago and I was wrong.

[…]

Today I saw a link to an article in Mother Jones bemoaning the fact that the general public is out of step with the consensus of science on important issues. The implication is that science is right and the general public are idiots. But my take is different.

I think science has earned its lack of credibility with the public. If you kick me in the balls for 20-years, how do you expect me to close my eyes and trust you?

§

And I somewhat disagree with this response. It’s a common cop-out:

Who, exactly, does Adams think has been kicking him in the balls for 20 years?

Scientists themselves? Science teachers? Pop-science journalists? He downplays the roles of all these parties in his article[…]

The article says that the problem is pop-science journalists and the people who share their stories on Facebook & twitter.

Sorry, but no. Those parties are somewhat at fault, but so are real, bona fide tenured scientists and the scientific community.

Here is another weak argument:

How indeed? In the scientific journal papers I read, I rarely (if ever) encounter a scientist who claims anything like “this topic is now closed.”

 Of course, scientists rarely say a topic is closed, but they say things like “now that we’ve determined X, this opens new avenues of research.”

§

The overhyping of nutritional claims by scientist is bad enough that Nature wrote an editorial naming and shaming a Harvard department chair for oversimplifying the research.

Outside of nutrition, look at this egregious paper from 2013, heavily quoted in the public press: Evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River policy Yuyu Chena, Avraham Ebensteinb, Michael Greenstonec, and Hongbin Lie. The abstract says: the results indicate that life expectancies are about 5.5 y (95% CI: 0.8, 10.2) lower in the north owing to an increased incidence of cardiorespiratory mortality. The only hint of how weak the support of the claim is the large width of the confidence interval, but read Andrew Gellman‘s takedown to fully understand how crappy it is.

If you want something medical, this is an older link of scientists misleading journalists.

2014 in Review

A bit past due, but I like to do these year in review posts, even if just for myself. A lot of time it seems that I spend too much time doing absurd tasks like formatting text or figuring out some silly web application or file format, so it’s nice to see that it does come to something tangible in the end.

Papers

2014 was slowish, compared to 2013, but I have a few big things in the pipeline (one accepted, another in minor revisions, and the second edition of our book incorporating all errata and some other improvements is almost done).

Still, several projects I was involved with finally got their papers out in 2014:

  1. Metagenomic insights into the human gut resistome and the forces that shape it by Kristoffer Forslund, Shinichi Sunagawa, Luis Pedro Coelho, and Peer Bork in Bioessays (2014)DOI:10.1002/bies.201300143

We looked at the presence of antibiotic resistance potential and how it correlates with animal antibiotic usage across countries.

  1. Trypanosoma brucei histone H1 inhibits RNA polymerase I transcription and is important for parasite fitness in vivo by Pena AC, Pimentel MR, Manso H, Vaz-Drago R, Neves D, Aresta-Branco F, Ferreira FR, Guegan F, Coelho LP, Carmo-Fonseca M, Barbosa-Morais NL, Figueiredo LM. in Mol Microbiol, 2014 Jun 19. DOI:10.1111/mmi.12677
  2. A community effort to assess and improve drug sensitivity prediction algorithms by James C Costello, Laura M Heiser, Elisabeth Georgii, Mehmet Gönen, Michael P Menden, Nicholas J Wang, Mukesh Bansal, Muhammad Ammad-ud-din, Petteri Hintsanen, Suleiman A Khan, John-Patrick Mpindi, Olli Kallioniemi, Antti Honkela, Tero Aittokallio, Krister Wennerberg, NCI DREAM Community, James J Collins, Dan Gallahan, Dinah Singer, Julio Saez-Rodriguez, Samuel Kaski, Joe W Gray & Gustavo Stolovitzky in Nature BiotechnologyDOI:10.1038/nbt.2877
  3. A community computational challenge to predict the activity of pairs of compounds by Mukesh Bansal, Jichen Yang, Charles Karan, Michael P Menden, James C Costello, Hao Tang, Guanghua Xiao, Yajuan Li, Jeffrey Allen, Rui Zhong, Beibei Chen, Minsoo Kim, Tao Wang, Laura M Heiser, Ronald Realubit, Michela Mattioli, Mariano J Alvarez, Yao Shen, NCI-DREAM Community, Daniel Gallahan, Dinah Singer, Julio Saez-Rodriguez, Yang Xie, Gustavo Stolovitzky & Andrea Califano in Nature Biotechnology DOI:10.1038/nbt.3052

Blogposts

The most read blog post during 2014 was (Why Python is Better than Matlab for Scientific Software)[https://metarabbit.wordpress.com/2013/10/18/why-python-is-better-than-matlab-for-scientific-software/].

For the posts written in 2014, the most read was this comment on modernity, which had a less-read follow-up.

The post that was hardest to write was talking about an academic dry spell.

Other

I did a bit of traveling, giving talks in Leuven, Paris, and San Sebastian, and teaching software carpentry in, Denmark, Cyprus and Jordan. San Sebastian was a fantastic place (I had never really gotten the appeal of the independently wealthy life until I visited San Sebastian).

Another interesting thing I did was a webcast on linear regression, including the funky kind, in Python. It was an interesting experience, maybe I’ll do it again.

Most importantly

Normally, I try to leave my personal life out of this blog (and the public internets), but 2014 was also remarkable for the birth of our second daughter, Sarah.

Computers are better at assessing personality than people?

So claims a new study. At first, I thought this would be due to artificial measurements and scales. That is, if you ask a random person to rate their friends on a 1-10 “openness to experience” scale, they might not know what a 7 actually means once you compare across the whole of the population. However, computers still did slightly better at predicting things like “field of study”.

Given the amount of researcher degrees of freedom, (note that some of the results are presented for “compound variables” instead of measured responses) I think the safe conclusion is computers are as bad as people at reading other humans.

The Ecosystem of Unix and the Difficulty of Teaching It

Plos One published an awful paper comparing Word vs LaTeX where the task was to copy down a piece of text. Because Word users did better than LaTeX users at this task, the authors conclude that Word is more efficient.

First of all, this fits perfectly with my experience: Word [1] is faster for single page documents, where I don’t care about precise formatting, such as a letter. It says nothing about how it performs on large documents which are edited over months (or years). The typical Word failure mode are “you paste some text here and your image placement is now screwed up seven pages down” or “how do I copy & paste between these two documents without messing up the formatting?” This does not happen so much with a single page document.

Of course, the authors are not happy with the conclusion that Word is better for copying down a short piece of predefined text and instead generalize to “that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems.” This is a general failure mode of psychological research: here is a simple, well-defined experimental result in a very artificial setting. Now, let me completely overgeneralize to the real world. The authors of the paper actually say this in their defense: “We understand that people who are not familiar with the experimental methods of psychology (and usability testing) are surprised about our decision to use predefined texts.” That is to say, in our discipline, we are always sort of sloppy, but reviewers in the discipline do the same, so it’s fine.

§

Now, why waste time bashing a Plos One paper in usability research?

Because one interesting aspect of the discussion is that several people have pointed out that Word is better for collaboration because of the Track Changes features. For many of us, this is laughable because one of the large advantages of LaTeX is that you can use version control on the files. You can easily compare the text written today with a version from two months ago, it makes it easier to have multiple people working, &c.[2] In Word, using Track Changes is still “pass the baton” collaboration, whereby you email stuff around and say “now it’s your turn to edit it” [3].

However, this is only valid if you know how to use version control. In that case, it’s clear that using a text-based format is a good idea and it makes collaboration easier. The same way, I actually think that some of the test subjects in the paper had with LaTeX was simply that they did not use an editor with a spell-checker.

The underlying concept is that LaTeX works in an ecosystem of tools working together, which is a concept that we do not, in general, teach people. I have been involved with Software Carpentry and even before that I was trying teach people who are not trained in computers about these sort of tools, but we do not do that great of a job at teaching this concept, of the ecosystem. It is abstract and not directly clear to students why it is useful.

Spending a few hours going through the basic Unix commands seems like a brain-dead activity when people cannot connect this to their other knowledge or pressing needs.

On the other hand, it is very frustrating when somebody comes to me with a problem they have been struggling with for days and, in a minute, I can give them a solution because it’s often “oh, you can grep in extended mode and pipe it to gawk” (or worse, before they finish the description, I’ll say “run dos2unix and it will fix it” or “the problem you are describing is the exact use case of this excellent Python package, so you don’t need to code it from scratch”). Then they ask “how could I learn that? Is there a book/course?” and I just don’t have an answer better than “do this for 10 years and you’ll slowly get it”.

It’s hard to teach the whole ecosystem at once, which means that it’s hard to teach the abstractions behind it. Or maybe, I just have not yet figured out how it would be possible.

§

Finally, let me just remark that LaTeX is a particularly crappy piece of software. It is so incredibly bad that it only survives because the alternatives manage to be even worse. It’s even sadder when you realise that LaTeX is now over 30 years old, while Word is an imitation of even older technology We still have not been able to come up with something that is clearly better.

§

This flawed paper probably had better altmetrics than anything I’ll ever write in science, again showing what a bad idea altmetrics are.

[1] feel free to read “Word or Word-like software” in this and subsequent sentences. I actually often use Google Docs nowadays.
[2] Latexdiff is also pretty helpful in generating diffed versions.
[3] Actually, for collaboration, the Google Docs model is vastly superior as you don’t have to email back-n-forth. It also includes a bit of version control.

New Year Links

1. Excelent Ken Regan article on chess:

László Mérő, in his 1990 bookWays of Thinking, called the number of class units from a typical beginning adult player to the human world champion the depth of a game.

Tic-tac-toe may have a depth of 1: if you assume a beginner knows to block an immediate threat of three-in-a-row but plays randomly otherwise, then you can score over 75% by taking a corner when you go first and grifting a few games when you go second. Another school-recess game, dots-and-boxes, is evidently deeper. […]

This gave chess a depth of 11 class units up to 2800, which was world champion Garry Kasparov’s rating in 1990. If I recall correctly, checkers ({8 \times 8}) and backgammon had depth 10 while bridge tied chess at 11, but Shogi scored 14 and yet was dwarfed by Japan’s main head game, Go, at 25.

2. The Indian government blocked github. Yep, the government there is stupid.