Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Also, the data could have been collected by unsound methods in the first place, so running the same data through the same computer program won't really amount to reproducing a result.


So, it appears this is a fundamental dishonesty with "publishable" science: that it becomes an ego trip for some scientists who like to overstate their results. While a groundbreaking discovery is sure to be scrutinized to the most minute detail by the community, an average "incremental science" paper will not, and will probably be published even if it contains errors. There are many published papers with unintented but grave computational errors that render even their main findings invalid, and oftentimes they are not retracted. This attitude should change, with people admitting that sometimes mistakes are only human, and they do not diminish the contribution of the research.

Complete data sharing would be a huge leap for science: Imagine all scientists digging into every experiment ever made for, say, cancer or HIV research and discovering hitherto unknown correlations, new interpretations of the data etc. It would provide huge shortcuts, given how many experiments get essentially repeated over the years.


> it becomes an ego trip for some scientists who like to overstate their results

I think you're looking at this the wrong way - while some scientists might indeed be doing it for the "ego trip," the vast majority are just trying to avoid being swallowed whole by the vicious academic research environment. Reaching the position of tenured professor at a major research university is extremely challenging.

Now, I'm not suggesting that what they're doing is right. I'm just saying that one needs to dig further to find the root causes of these problems.


Complete data sharing is only a good idea if you wait a significant amount of time before doing so. Scientists are lazy animals and it's much easier to data-mine than do a new experiment, however scientifically speaking data-mining is literally worthless. When it becomes generally accepted it DESTROYS the credibility of entire disciplines see: the latest nutrition of the week fads, Economics, and a host of others.

The only other practice that's almost as bad requires three separate errors, working with small sample sizes, not publishing all experiments, and accepting significant statistical noise (P>.01 I am talking to you.)


It makes the field a lot more noisy, yes, but how can empirical studies ever destroy a field? If only we had some data to talk about, or even some papers to talk about. For example, i 'd rather be arguing about some papers that i 've read recently, but, alas, not only are they behind paywalls, they don't even have any meaningful comment boards.


There are fields where well over 30% of published papers are contradicted by the next paper on the same subject. Getting into why this happens with empirical studies of large data-sets is complected, but boils down to looking at enough things until signal is indistinguishable from noise. For a simpler example assume this was actually done and they published their findings: http://xkcd.com/882/ Now assume other than the P>.05 there methods where impeccable what information have you gained?

A) The actually probability that green Jelly beans are actually linked to ACNE is impossible to tell. (http://en.wikipedia.org/wiki/Bayes%27_theorem) You might shift your expectations, but if you do the probability's the shift in expectations is tiny, because there is so much noise.

Now fill a field with that junk and suddenly reading a paper provides vary little information which slows everything down. You can discuss such things but it's about as meaningful as talking about who won the world cup. http://xkcd.com/904/ Worse yet, people rarely publish false results which means even reading a well done study is only meaningful if you can find some other logic to back it up. At which point it might be worth investigating, but the reason it's worth investigating your expectations and has next nothing to do with the paper you just read, and even if you find some deep truth the glory goes to the guy who was publishing noise.

PS: It get's worse. Because, contradicting a study is worth publishing and publishing is a numbers game, you have many people who simply reproduce research to pad their numbers and cut down on clutter. But, if your tolerances are loose enough say P >.05 and you have enough random crap in the hopper every 400 completely random papers can service two rounds of this get a lot of attention only to be discredited.

Edit: This is also why it takes a huge body of background reading and a deep understanding of statistics before you have the context to meaningfully discus a recent paper with a scientist.


> It would provide huge shortcuts, given how many experiments get essentially repeated over the years.

Isn't that the whole point of insisting on reproducibility? Scientists are supposed to repeat experiments as many times as it takes to convince the rest of the scientific community that their results are valid. Reproducibility, not publication, is the final QA mechanism for science. Shortcuts are sometimes desirable (e.g. if people are dying right now), but they're the exception, not the norm.


I 'm not talking about reproducibility as validation, but about the fact that multiple studies can be combined to discover new regularities. An example from neuroscience: hundreds of labs recording from the brains of genetically similar mice doing very similar tasks under slightly different conditions, but we have no way of looking at the raw recordings in a systematic way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: