Friday, March 12, 2010

The Case For Raw Data And Source Code

There's been a bit of chatter about raw data and source code in science on a few of the climate blogs lately. The issue has tended to be framed as "raw data" vs "anomalies". I think that's a fairly poor way to frame it because anyone using anomalies can also use raw data, and vise versa.

Bottom line, I think it's extremely important for scientists to make all their raw data and computer source code available to the public. The rest of this post will present a few reasons why.

The Computer Is The Science
Computers increasingly play a central role in science. They don't just tabulate results and draw graphs. They are actually the tool used to perform measurements and experiments and to even define the physics used to conduct the experiments.

String Theory, for example, has very little relationship to the real world from an experimental point of view. Most of String Theory is code running on a computer. Climate models too are computer programs that define a physics and execute entirely on computer.

I'm not trying to make an argument over whether or not doing science on a computer is good or bad. I'm merely pointing out that it's done, and it's rather common.

Point 1: Verifiability
A consequence of having your physics defined by code and your universe defined by data is that the experiments run on a computer cannot be verified without the code and data. In order to have verifiability, a fundamental goal of science, the code and data must be available. I think this is obvious, but I want to present an example of unverifiable science that was just reveled on this blog.

The Aqua satellite has an instrument called an AMSU that scans the Earth detecting 15 different frequencies of light. Each of these detectors is called a channel. It turns out that the hardware for channel 4 failed in late 2007, and since that time NASA, who owns the Aqua satellite, has been producing the data from channel 4 not from readings detected by instruments, but from computer code and lookup data.

They actually have good reasons on why this should work. They've performed tests using their code and lookup data and the results match reality very well. The problem is, the code and lookup data to create this synthetic channel 4 data is not available outside NASA's JPL lab. This means that not only has no one from outside JPL ever verified their claims that the procedure is sound, no one outside JPL can verify those claims. We (no mater who "we" are) have to take it on good faith that the data NASA creates for channel 4 is realistic. It cannot be demonstrated to be realistic.

Clearly, this is not science. To get science back on the track towards verifiability in the modern world, open access to all data and computer code is needed.

Point 2: Data Compatibility
I think this second point applies more to climate science in particular than science as a whole. It has to do with the concept of processing temperature data into anomalies. There are very good reasons why scientists convert temperature readings into anomalies. If done with care, anomalies can be used to meaningfully compare two different sets of data, for example.

Anomalies also have the nice feature of tracking changes in temperature in a way that's not dependent on the temperature that came just before it. For example, a July anomaly of, say, +0.2 doesn't mean that July was +0.2 degrees hotter than the previous June. It means it was +0.2 degrees hotter than the July from the previous year. All by itself, this use of anomalies filters out unwanted noise from that data. We're not interested in learning if it gets hotter as we move into summer. We already know it usually does. We're interested in learning if the trend over the years is going up or down. And that's what anomalies tell us.

But it's difficult to compare data in anomaly form to data that's not in anomaly form. And most climate data, CO2, water vapor, sun spots, cosmic rays, etc., etc. is not in anomaly form. So if we have a July that's cooler than the previous July but warmer than the June just before it, the July anomaly goes down, where as the July temperature and most other climate data that's correlated to temperature, goes up.

Point 3: Errors
I'll make this last point quick, because this post is longer than I expected.

You simply cannot find errors in data that's been homogenized and processed until it no longer has it's original shape. Smoothing, averaging, and so on, wash away these errors and the person using the data doesn't know they were ever there. This doesn't mean, btw, that such processes correct the errors. They simply hide them. To verify that the data is correct, both the original data and the computer code used to process it is required.

So these are some of the reasons I believe it's important to make all data and computer code available to the public. I want to stress again that making raw data and code available is not an argument against using the processed data results and no code. There's no reason why you can't do both.

No comments:

Post a Comment