If you're going to do good science, release the computer code too

lutorm · on Feb 8, 2010

I am not surprised that people find errors in code written by researchers and grad students who have little training in software development and, perhaps more importantly, are doing so in a culture which values them writing papers, not good code. (See for example http://lanl.arxiv.org/abs/0903.3971 for a discussion of this situation in astronomy/astrophysics.)

I find it much more surprising that professionally developed software used for scientific research is also error ridden. And while it might be difficult to convince individual researchers to release their code, that's nothing compared to the difficulties of convincing Wolfram research to release the source code to Mathematica...

But I do think that research is somewhat undeservedly singled out for this, just because some academic software is open for inspection. Like the article mentions, it certainly seems like the financial software has caused a lot of badness. How about flight control software used by NASA that crashed the Mars orbiter? Who knows how many innocent lives have been lost due to software errors in military systems like UAVs and missiles. Maybe none, but we can't know because it's all secret. Shouldn't they be required to show their code, too?

smallblacksun · on Feb 8, 2010

But the military and NASA don't claim to be generating reproducible knowledge through the use of their code. In particular, the military doesn't WANT other people to be able to reproduce what their code does. Also, there is a difference between operational code (code that runs a physical object like a lander or a UAV) and analytical code. NASA makes some of their code available here: http://opensource.arc.nasa.gov

lutorm · on Feb 8, 2010

Cool, I didn't know about the NASA open source project.

You are right that knowledge production isn't the purpose of those other entities, of course. However, in my mind the purpose is less important than the outcome -- why is it more harmful to society if scientists produce a flawed scientific result than if the military kills innocents or the financial sector brings on a market crash because of flawed models? They all hurt society and could all benefit from more scrutiny. I admit the military case is a stretch, but certainly the financial sector seems like a relevant example.

DaniFong · on Feb 9, 2010

Mathematica probably has 3 orders of magnitude more code coverage than code underlying scientific papers. What's more, WRI encourages people to publish interactive Mathematica notebooks as a kind of literate programming that allows people to run code as well as read and view the results. I think it's an overall net positive.

cgranade · on Feb 9, 2010

That is an overall net positive, but I still think that, in order to be fully compatible with the ideals of academic freedom and reproducibility, papers should depend largely (ideally, only) on code that is fully open source. I recognize that this may not be compatible with the interests of companies like Wolfram and Math Works, but I think it is in the best long-term interests of research as a whole.

DaniFong · on Feb 9, 2010

I think so too. Patrick Collison and I and likely many others have ideas for an open source Mathematica competitor in idea stages and design, and in the back of our minds. But it's a herculean task. Maybe one day we will get to it.

cgranade · on Feb 11, 2010

It is a Herculean task, but that's part of the miracle and promise of open-source: you don't have to do it alone. To address the lack of a good MATLAB-style science oriented IDE, I started a project (http://github.com/cgranade/scicore) today to try and attract more intelligent coders than myself to help. As others have pointed out, SAGE (http://www.sagemath.org/) is a good effort at an open-source Mathematica competitor, even if I feel it lacks something in the UI department. Point being, it's possible to get a project going that becomes larger than yourself using an open model.

jackfoxy · on Feb 8, 2010

If science is to remain science, and not devolve into mysticism, data and computer models must be available to other researchers in order to repeat experiments and provide knowledgeable criticism. Calling anything "settled science" which is not openly available to all researchers is not scientific.

kurtosis · on Feb 8, 2010

I have no beef with open audits of published science that is used in decisions of economic consequence.

But I would only add that sometimes you learn a lot more from trying to reproduce a result without the code/schematics of the original experiment. If you implement it yourself and get a different answer, you should publish it and not bias yourself by paying too much attention to the original authors interpretation. As long as you can justify your methods you should be fine.

Also, I feel that it's a lot more fun to design an experiment knowing that it's possible than it is to merely copy someone else's published procedure. A month in the lab spares you a day in the library!

regularfry · on Feb 8, 2010

A sound idea.

While I can imagine any number of reasons people might post facto not wish to release code, if it were developed from the start with the intention of releasing it, I think we'd all benefit.

Inevitably, the cost of doing so would increase the cost of the research, but I believe it would be worth it.

anamax · on Feb 8, 2010

> Inevitably, the cost of doing so would increase the cost of the research, but I believe it would be worth it.

I'm not convinced that it would increase costs.

I'll bet that there's a lot of reinvented code in science. If every project released their code, new projects would start reusing code from current projects. In some cases, that sharing and reuse would reduce costs.

JunkDNA · on Feb 8, 2010

I have seen code reinvention in my career a number of times. In one instance, I was actually asked to code up a method where the code and method had been published in a scientific journal. When I asked why I should implement this on my own, instead of using code developed by the group who published the method, I was told, "Because you can't trust anyone else's code. It's better to write everything from scratch so you know it's right".

I don't personally have the hubris to think I can code up a method better than the people who invented it in the first place. That aside, it's just so wasteful.

So instead of spending time on novel work we were doing, I spent a month implementing a half-baked version of something other people had done.

btilly · on Feb 8, 2010

As silly as the explanation was, there is actually a good reason to re-implement. And that is that if nobody does, then any bug in the original code will survive to cause problems with nobody knows how many results before anyone catches the bug.

Reimplementing from scratch then comparing with the original gives an opportunity to find such bugs.

barrkel · on Feb 8, 2010

Yes, but such arguments apply at different levels of abstraction.

I doubt one would rewrite the OS, compiler or runtime libraries because they couldn't be trusted; though all these can also have bugs.

btilly · on Feb 8, 2010

One would probably not rewrite them. However people both can and do take their software and run it on a different operating system, compiled with a different compiler, linked with different run-time libraries, on a different type of hardware. And yes, I've seen bad software assumptions flushed out by doing so. (Don't use floating point for complex financial calculations please. OK??)

philwelch · on Feb 8, 2010

Not just bugs, but also ambiguities in specification, which helps us to refine our higher level specifications.

RyanMcGreal · on Feb 8, 2010

> I was told, "Because you can't trust anyone else's code. It's better to write everything from scratch so you know it's right".

I'm reminded of ESR's paraphrasing of Linus' Law regarding eyeball count and bug depth. Ten haphazard proprietary implementations are not more reliable than one battle-tested open implementation.

tensor · on Feb 8, 2010

Given that this is a discussion about science, I have to point out that this "law" you quote is merely a hypothesis.

Lewisham · on Feb 8, 2010

It's surprising how few Computer Science papers release code as well. I don't care if it's platform-specific and it requires ridiculous numbers of obscure libraries and only operates on proprietary data that you can't release. I don't care, I want the code to be open-source. I want to see what you did, and whether I believe that it does what you claim it does in the paper.

Where possible, I open-source everything I try to be published. There's only one project I haven't (a scraper for the WoW Armory), but even then I released the library I built for it.

There's no excuse to not do so. Unless you have something to hide.

lutorm · on Feb 8, 2010

There's no excuse to not do so. Unless you have something to hide.

Not true, for the same reason that commercial ventures don't like to release source code even if they don't have something to hide.

Having a capable computer code can be a substantial competitive advantage and make it possible to do studies no one else can. While this is less than desirable from the standpoint of science, it's perfectly understandable given the career pressures that individual scientists operate under.

j_baker · on Feb 8, 2010

This creates a conflict of interests though. Is the research legit or has it been "enhanced" to help a business venture the researcher has in the works?

lutorm · on Feb 8, 2010

Oh, for sure. But I wasn't even talking about any business ventures (those are rare in astrophysics...) but more about keeping your code under wraps to prevent others from benefiting from your hard work. Especially, when (as I said in another post), code development is not especially beneficial for your career.

Though it's hard to find a situation where people don't have a (short-term) incentive to make their work look good. One can hope it will catch up with them in the long run, but more likely by then they have a new job (and, in academics, tenure) that will never hear about their past shoddy work.

btilly · on Feb 8, 2010

The solution is to make peer reviewed code produced for a paper be considered equivalent to a paper in tenure decisions. And for all papers in peer reviewed journals that do computer analysis to be backed up by peer reviewed, published code.

That makes code development beneficial for your career, gives an incentive to not keep it under wraps, improves quality, and is likely to reduce the number of published incorrect results.

Of course that is a pipe dream at this point, but what's wrong with dreaming?

swolchok · on Feb 8, 2010

I haven't released source for either of the projects I've released so far in graduate school because they are attack projects that demonstrate security flaws. It is not so clear that there is "no excuse" not to release them.

Another reason not to release source code is that there might be obvious follow-on work and you want to publish that paper too, rather than help someone else scoop you by giving them your tools.

j_baker · on Feb 8, 2010

I could be wrong, but I believe universities share some of the blame. It seems as though most of them are more interested in turning the research into a profit than they're concerned with doing good science.

maurycy · on Feb 8, 2010

Finally. Finally a discussion about this.

timr · on Feb 8, 2010

Enough with the false melodrama, please. Aside from the fact that your comment is content-free and inane, scientists have been discussing this subject since computer simulation first became a part of science. A lot of scientists do share their code (I'm one of them, and I believe in sharing code). But there are good arguments on the other side. Among them:

1) Papers describe methods in enough detail to reproduce them. If they don't, there's a serious problem.

2) Independent lines of verification. If simulation code becomes a reference, it's inevitable that the same bugs/bad assumptions will contaminate an entire field. Independent re-implementation of the same algorithms is a strong hedge against this phenomenon (even if it means that there are more bugs overall).

3) Money. A lot of scientists fund their research in part through licensing of implementations of their algorithms. I don't like it, but until someone gets around to repealing Bayh-Dole (a real scientific travesty, IMO), this is going to continue to be a problem.

In short, what you really meant to say was that finally someone wrote a newspaper article about this subject. It's not a new discussion.

DaniFong · on Feb 8, 2010

Closed academic publishing is intellectually bankrupt, and is probably one of the greatest problems effecting research today. People don't share code, and put a paywall between themselves and the public. There are open journals, but they are rarely as prestigious, and so are not as valuable to those seeking tenure. These academics put tenure before fruitful scientific discussion.

michael_nielsen · on Feb 8, 2010

I agree in general, but wanted to add one small caveat that I think is interesting: PLoS Biology is (a) open access, with a Creative Commons license; and (b) has rapidly established a reputation as one of the top journals in biology.

lutorm · on Feb 9, 2010

So would you rather people publish in "low-impact" journals and then leave science completely because they can't get a permanent job?

"Intellectually bankrupt" is a pretty strong term to use for people who work for a small fraction of the amount of money normally talked about on this site.

I'm not saying there aren't issues, but blaming the individuals who are trying to make a living by doing science isn't going to help. The success rate of getting permanent jobs in science might be higher than that of startups, but the "payoff" is a small fraction.

DaniFong · on Feb 9, 2010

I have not left science completely: I've made my own job. It is possible but it is only made harder because of the closed system.

There are many of us who've left academia and still do science. We're generally maligned, and removed from the ability to even participate in a discussion due to a variety of academic access restrictions, and why?

What's more, day by day people are showing how to achieve scientific credibility and influence through their blogs and paper hosting services like ArXiv or, as Michael Nielsen points out, open journals like PLoS Biology. The majority of scientists still bow to tenure pressure, and frankly I don't understand why. There are other opportunities if you want to gain status, and one doesn't even have to gain traditional academic status if one wants to do real science. There are other options.

lutorm · on Feb 11, 2010

Which academic access restrictions are you talking about? I know people who have started independent "institutes" but the only reason you need to do so is to receive federal funding. It's true that if you brand yourself as an "independent researcher", people might be inclined to think you are a crackpot, but publishing real papers should take care of that.

I'm not sure blogs are a relevant source for scientific studies though. Not necessarily because I think peer review is the greatest system, but having your paper published in an actual journal (open journals are fine) at least means you managed to convince a few other people that it's worth looking at the paper.

Retric · on Feb 8, 2010

A respected journal needs to be able to pay for someone to review a paper before publication. I suspect that setting up a foundation to do this for free would be a great charity, but without some sort of backing you can't create high quality.

michael_nielsen · on Feb 8, 2010

In nearly all fields, referees of scientific papers are not paid. Referees of scientific books may be paid a small honorarium, but, compared with consulting, it's a pittance.

Background experience: I refereed somewhat over 100 papers and perhaps a dozen or so books during my career as a physicist. My work now overlaps with the scientific publishing industry more broadly.

Retric · on Feb 8, 2010

It is my understanding that referees are sometimes paid, but editors rarely work for free. (Using http://en.wikipedia.org/wiki/Peer_review version of editors.) It's a lot of work to find out who is best to referee a given work and to track down conflicts of interest etc.

michael_nielsen · on Feb 8, 2010

With editors the issue of payment depends on the journal. Some journals employ a staff of professional editors. Others recruit tenured scientists who do the editorial work for an honorarium that seems tiny, considering the amount of work involved. I don't know which model is more widely used - I can think offhand of many journals of both types, but don't recall ever having seen statistics.

lolcraft · on Feb 8, 2010

I'll add another argument for not releasing scientific code, specially of a matter so polarizing as climate change: programmers aren't scientists. We aren't familiar with neither the field's literature nor the science being used. For example, the famous comment from the CRU code said:

; Plots 24 yearly maps of calibrated (PCR-infilled or not) MXD reconstructions ; of growing season temperatures. Uses “corrected” MXD – but shouldn’t usually ; plot past 1960 because these will be artificially adjusted to look closer to ; the real temperatures.

Do you expect the average programmer to take a 2 year course in climate science to truly understand what MXD means instead of instantly running around screaming "wolf!"? If you did so, be disappointed (a careful reading of Google results for "CRU code" might be enlightening).

jgrahamc · on Feb 8, 2010

That seems like a bogus reason. Some people who had already decided that climate change was fraud decided to read comments in the code and interpret them to fit their worldview.

Other people read the CRU code and made intelligent comments about the code.

matthewmarkus · on Feb 9, 2010

Hear, hear! I did a whole series of posts analyzing a skeptic's claims:

http://matthewmarkus.posterous.com/raw-darwin-airport-temper... http://matthewmarkus.posterous.com/darwin-airport-temperatur... http://matthewmarkus.posterous.com/darwin-airport-temperatur...

I found at least one error/discrepancy in each side's output. Of course, neither side released any code. The disclosure of methods is a prerequisite for repeatable experiments, the cornerstone of science.

paulkirk · on Feb 8, 2010

Those are bad reasons on all counts:

1) Code is notoriosly difficult to describe in a natural language, so it's unlikely papers contain enough detail to replicate a complex program. 2) People can still reimplement the model even when it's available. In fact it's going to be more likely that non-scientists join the convesation (which will improve the abysmall public perception of science, especially climate science). 3) Making money is great for an inventor and/or startup founder. You can still sell implementations, but they can't be trusted until they have been run on a large number of independent computers. Until then, it's not science.

merraksh · on Feb 8, 2010

There are a few examples of how this can be done. One of them is Mathematical Programming Computation (MPC), a journal where articles submitted must be accompanied by the source code that was used to produce the results. The article is peer-reviewed, and the code submitted is tested by "technical editors" to verify that the results are correct. See http://mpc.zib.de

moron4hire · on Feb 8, 2010

Opening the source for research software is absolutely vital to the concept of reproduceability. However, this fact of the level of programming training for most scientists is a major issue. A lot of novice programmers tend to fall into a trap of "it runs without error, it must be right." Even expert programmers struggle with verifying that their results are correct; technically, program verification is a mathematical impossibility. So it's a daunting task to start with, reproducing results of software-based research.

This is only compounded by the fact that reading source code sucks. Source code is an end result of multiple processes that occur in feedback loops. With just the source code, you never see how the code got that way. It's like showing someone a maze with the start and end points marked but the middle of the map blocked out.

Different programmer's conceptions of what constitutes good code varies widely. One man's golden code is another's garbage. Just because the source code is available doesn't mean anyone is going to understand it or be able to work with it effectively.

Compounding this all is the fact that few people are going to want to read the source code. Analyzing source code is dull work, maybe the worst job a programmer can take while still doing programming. Most programmers are far happier to discard old code and start from scratch. This is often a bad idea and doesn't lead to a better product, but at least you don't want to kill yourself while you're doing it.

When it comes to reproducing algorithmic results, I would prefer having a description of the algorithm, a set of inputs, and a set of outputs. I would then write the actual code myself and see if I get the same results. This, I think, is much closer to the concept of reproducing lab results in the physical sciences. You wouldn't use the same exact particle accelerators if you were verifying the results from a paper on nuclear physics. I'm afraid having access to the raw source code will be used as a crutch where logic errors are missed from reusing portions of code without much thought about the consequences. Take, for instance, the subtle differences in implementations of the modulo operator across programming languages: http://en.wikipedia.org/wiki/Modulo_operator#Common_pitfalls

It would be great if scientific software were open. Unfortunately, it won't matter a lick if it is.

jgrahamc · on Feb 8, 2010

Yes, tell me about it: http://www.jgc.org/blog/2010/02/something-odd-in-crutem3-sta...

eshi · on Feb 8, 2010

I might be alone in this, but this seems like a symptom of the problems of IP laws.

artsrc · on Feb 8, 2010

One problem with IP laws is that to fully enforce them you need a police state.

I don't know precisely what you are thinking, but my view is that the IP framework should be: For a published work to be eligible for copyright, source code must be published. Something like a cross between github and the library of congress.

Publishing source code does not currently relinquish all rights. This would add greatly to our societies store of knowledge and would help prevent the IP theft in the code of published works.

eshi · on Feb 9, 2010

This is sort of what I was getting at. I agree that releasing source code shouldn't be a matter of giving up property rights. In fact, plenty of commercial systems and software do allow source code access. However, it always seems to be through messy licenses and cumbersome legal agreements to not divulge anything.

As it stands, companies seem more motivated to protect their IP rights than to produce tools that would keep science reliable. IMHO, companies view source code as the product of their investments and secrets worth protecting. The main fear seems to be that if these secrets are published competition could use them against them by receiving a boost in their own R&D efforts by deriving methods and processes from their own work.

This doesn't seem like just a software problem since I've heard wetware horror stories from biotech and agriculture folks.

It honestly makes me wonder if software should be something you can patent. At some level, it seems disturbingly similar to companies that patent colors, genes, or derived living organisms.

albertcardona · on Feb 8, 2010

The title contains the reason on why we created Fiji (http://pacific.mpi-cbg.de): so that instead of releasing a Matlab script without documentation on its many parameters and exact Matlab version used, as a print out (or nowadays, downloadable .m file as supplementary material), we could offer instead a ready-downloadable, version-controlled and fully working program.

A colleague of mine made similar remarks recently:

"... if you can’t see the code of a piece of ... software, then you cannot say what the software really does, and this is not scientific."