Re: What is a "big job" in the industry?

I don't know about the industry, but my Master's project involves visualizing LC-MS (Liquid Column Mass Spectrometry) data. The raw data, once converted from the proprietary format, is a 130 megabyte XML file, containing 128 "scans", each of which contains roughly 200,000 pairs of (mass, intensity) pairs, called peaks (both floating point numbers). Relax, the pairs are not stored inside verbose XML tags, but the actual data is a big Base64 encoded, packed string.

I don't luckily have to use the raw data, as the proprietary conversion tool is able to produce centroided data (which, by the way, reduces the dataset size by four orders of magnitude). However, I used the raw data when I tested parsing the XML file -- a great way to see which part of your code scales up and which assumes that the data is not big.

I have to parse the XML file in two passes. On the first pass, the metadata is read in. On the second pass, byte offsets to the beginning of the actual Base64 encoded data blocks and their lengths are saved in memory. Later, one can read in individual scans from the file by simply seeking to the position and reading that many bytes. Initially, I had the parser (which, by the way, is XML::Twig) do both in one pass, but this was horribly slow for some reason. I suspect XML::Twig tried parsing the Base64 encoded data as XML, because with two passes, I can simply make it skip the data in both. Could be a user error.

Although it's not that big a file, parsing it takes half a minute, and reading in the peak data takes five or six minutes. I'm storing the peaks in piddles, so they don't take much more space in memory than on disk. Then, in the visualizing phase, it takes seven or eight minutes to thread over the 128 piddles and plot them. This is on Athlon XP 2600+ with half a gigabyte of memory. This is a bit of a pain, as the application area requires you to be able to zoom in to the data, and waiting eight minutes between zooms is not something the casual user is going to do, unless he's smoking weed.

I suspect I could squeeze that to four minutes by using C, but currently there's not much need for that, as the centroided data is good enough, and I'm already using PDL as much as I can (i.e. in tight, implicit loops).

--
print "Just Another Perl Adept\n";

Comment on Re: What is a "big job" in the industry?


Pathologically Eclectic Rubbish Lister
	PerlMonks