I do not mean to say the scientific paper of the future should be a PDF, I just mean that it will mostly likely be a PDF or some PDF-derived format. By future, I mean around 2040 (so, in 20-25 years).
I just read James Somers in the Atlantic, arguing that The Scientific Paper Is Obsolete (Here’s what’s next). In that article, he touts Mathematica notebooks as a model of what should be done and Jupyter as the current embodiment of this concept.
I will note that Mathematica came out in 1988 (a good 5 years before the PDF format) and has yet failed to take the world by storm (the article claims that “the program soon became as ubiquitous as Microsoft Word”, a claim which is really hard to reconcile with reality). Perhaps Mathematica was held back because it’s expensive and closed source (but so is Microsoft Word, and Word has taken the world by storm).
How long did it take to get to HTML papers?
For a very long time, the future of the scientific paper was going to be some smart version of HTML. We did eventually get to the point where most journals have decent HTML versions of their papers, but it’s mostly dumb HTML.
As far as I can tell, none of the ideas of having a semantically annotated paper panned out. About 10 years ago, the semantic web was going to revolutionize science. That didn’t happen and it’s even been a while since I heard someone arguing that that would be the future of the scientific paper.
Tools like Read Cube or Paperpile still parse the PDFs and try to infer what’s going on instead of relying on fancy semantic annotations.
What about future proofing the system?
About a week ago, I tweeted:
This is about a paper which is now in press. It’s embargoed, but I’ll post about it when it comes out in 2 weeks.
I have complained before about the lack of backwards compatibility in the Python ecosystem. I can open and print a PDF from 20 years ago (or a PostScript file from the early 1980s) without any issues, but I have trouble running a notebook from last year.
At this point, someone will say docker! and, yes, I can build a docker image (or virtual machine) with all my dependencies and freeze that, but who can commit to hosting/running these over a long period? What about the fact that even tech-savvy people struggle to keep all these things properly organized? I can barely get co-authors to move beyond the “let’s email Word files back and forth.”
With less technical co-authors, can you really imagine them downloading a docker container and properly mounting all the filesystems with OverlayFS to send me back edits? Sure, there are a bunch of cool startups with nicer interfaces, but will they be here in 2 years (let alone 20)?
Is it even a good idea to have the presentation of the results mixed with their computation?
I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.
There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?
Similarly, all the file paths and the 5 arguments you need to pass to pandas.read_table to read the data correctly: why should we care when we are just trying to get the gist of the results? One of our goals in NGLess is to try to separate some of this accidental complexity from the main processing pipeline, but this also limits what we can do with it (this is the tradeoff, it’s a domain specific tool; it’s hard to achieve the same with a general purpose tool like Jupyter/Python).
I do really like Jupyter for tutorials as the mix of code and text are a good fit. I will work to make sure I have something good for the students, but I don’t particularly enjoy working with the notebook interface, so I need to be convinced before I jump on the bandwagon more generally.
§
I actually do think that the future will eventually be some sort of smarter thing than simulated paper, but I also think that (1) it will take much longer than 20 years and (2) it probably won’t be Jupyter getting us there. It’s a neat tool for many things, but it’s not a PDF killer.