Macabre Monks,

While I sit here waiting for my script to run, I wonder: out there in the rest of the Perl industry, what's considered a large data processing job?

My case today: I have a 650Mb flat text file with 15-million lines of 1's and 0's. I have another flat text file with 600 lines, each line with 7 positions containing 1's and 0's. Each character and position of every line of the first file has to be checked against each character and position of the second file. That is, 15-million lines of text to parse into characters, then 15-million * 600 * 7 = 6.3-billion comparisons to make, and write matches (normally around 100-million) to about 40-thousand "match position list" files on disk.

There are no patterns to look for. I have been unable to find a better method than brute force character-wise comparison. It takes about 8 hours to run.

And I've sometimes wondered - is this outrageous, or just everyday at work for some?


Forget that fear of gravity,
Get a little savagery in your life.
Max Webster

Replies are listed 'Best First'.
Re: What is a "big job" in the industry?
by Joost (Canon) on May 30, 2007 at 19:46 UTC
    Each character and position of every line of the first file has to be checked against each character and position of the second file.
    I'm sceptical about that claim. I'm guessing you won't need to check the position if the characters don't match and vice versa. That can make a LOT of difference. Especially, since the 600 lines file should easily fit in memory for fast lookups.

    Anyway. The project I'm working on now has jobs that take from about a day to several weeks to run, maybe more. Not all of it is in production now, though, and for some sub-systems, we're mostly generating data on the fly. I guess we'll generate a couple of dozens of terabytes of data. I think it's a pretty "big job" :-)

    General tips:

    1. buy a machine with plenty of ram and load & parse all the source data you'll need more than once in memory. It's amazing how much you can do with 4Gb of RAM. Especially if you can also use C/C++ (or existing C-based CPAN libraries) for tight storage and inner loops.

    2. if that won't fit, but your algorithm is going to go more-or-less sequentially through the data, cache as much of the source data as will comfortably fit in memory.

Re: What is a "big job" in the industry?
by perrin (Chancellor) on May 30, 2007 at 20:10 UTC
    Take a look at the sliding window technique. It helped me with a job that required line-by-line tests. It basically works by batching the disk I/O.
Re: What is a "big job" in the industry?
by Jenda (Abbot) on May 31, 2007 at 00:02 UTC

    1s and 0s only? Any chance the lines are of the same size? Or actually even if they are not wouldn't it be better to store the data "directly" in binary instead of as a plain text file containin just 0s, 1s and newlines. Even if the lines were not the same length you might still store the data more efficiently. Storing the line length and using one bit instead of one byte for each 0/1. Not only you'll save lots of disk space and disk access, the comparisons will likewise be much quicker, with much less data to compare.

    my $s1 = '10110101001010100011101101001010'; my $s2 = '10110101001010100011101101001011'; my $b1 = pack('b*', $s1); my $b2 = pack('b*', $s2); use Benchmark qw(timethese); timethese( 10000, { 'strings' => \&comp_strings, 'packs' => \&comp_packs, } ); sub comp_strings { for (1..1000) { $s1 eq $s1 and $s1 eq $s2 } } sub comp_packs { for (1..1000) { $b1 eq $b1 and $b1 eq $b2 } } __END__ Benchmark: timing 10000 iterations of packs, strings... packs: 4 wallclock secs ( 4.07 usr + 0.00 sys = 4.07 CPU) @ 246 +0.02/s (n=10000) strings: 5 wallclock secs ( 4.71 usr + 0.00 sys = 4.71 CPU) @ 212 +4.50/s (n=10000)

        Well the question is whether punch_card_don needs to do that. I'm not clear on that. It seems to me he's just comparing things on known positions, not searching for them, but I may easily be wrong. I have no idea whether pack()ing the data (at whatever stage of the processing or generation) can be used or whether it would help or just complicate things. I was just trying to suggest something that punch_card_don might have omited, hoping it will be of some use.

        Besides ... even if he does need to search for the substrings it might still be beneficial to pack the data for storage. He might save up to 7/8 of the disk access. Which might help even if he has to unpack the data to process them.

        I C, I suspect. Even the brute-force method of shifting a bit at a time into a register, masking, & comparing would surely beat a regexp engine. Worst case, that's a shift, bitwise or, bitwise and, & a branch if not 0. I'm sure there are much better algorithms, too.
Re: What is a "big job" in the industry?
by BrowserUk (Pope) on May 30, 2007 at 19:57 UTC
      Obviously a job for a trie! :)
Re: What is a "big job" in the industry?
by princepawn (Parson) on May 30, 2007 at 19:52 UTC
    Each character and position of every line of the first file has to be checked against each character and position of the second file.
    What do you mean by "check" ? Most of my data processing work, when this involved, is done by dumping everything into an indexed MySQL or SQLite table and then running queries to do the dirty work.

    Some top tier databases have ETL (extract, transform, load) tools that would allow you to simply specify the format of the two files and have it automatically load them for you.

    Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality
Re: What is a "big job" in the industry?
by BrowserUk (Pope) on May 30, 2007 at 22:55 UTC
Re: What is a "big job" in the industry?
by vrk (Chaplain) on May 31, 2007 at 17:01 UTC

    I don't know about the industry, but my Master's project involves visualizing LC-MS (Liquid Column Mass Spectrometry) data. The raw data, once converted from the proprietary format, is a 130 megabyte XML file, containing 128 "scans", each of which contains roughly 200,000 pairs of (mass, intensity) pairs, called peaks (both floating point numbers). Relax, the pairs are not stored inside verbose XML tags, but the actual data is a big Base64 encoded, packed string.

    I don't luckily have to use the raw data, as the proprietary conversion tool is able to produce centroided data (which, by the way, reduces the dataset size by four orders of magnitude). However, I used the raw data when I tested parsing the XML file -- a great way to see which part of your code scales up and which assumes that the data is not big.

    I have to parse the XML file in two passes. On the first pass, the metadata is read in. On the second pass, byte offsets to the beginning of the actual Base64 encoded data blocks and their lengths are saved in memory. Later, one can read in individual scans from the file by simply seeking to the position and reading that many bytes. Initially, I had the parser (which, by the way, is XML::Twig) do both in one pass, but this was horribly slow for some reason. I suspect XML::Twig tried parsing the Base64 encoded data as XML, because with two passes, I can simply make it skip the data in both. Could be a user error.

    Although it's not that big a file, parsing it takes half a minute, and reading in the peak data takes five or six minutes. I'm storing the peaks in piddles, so they don't take much more space in memory than on disk. Then, in the visualizing phase, it takes seven or eight minutes to thread over the 128 piddles and plot them. This is on Athlon XP 2600+ with half a gigabyte of memory. This is a bit of a pain, as the application area requires you to be able to zoom in to the data, and waiting eight minutes between zooms is not something the casual user is going to do, unless he's smoking weed.

    I suspect I could squeeze that to four minutes by using C, but currently there's not much need for that, as the centroided data is good enough, and I'm already using PDL as much as I can (i.e. in tight, implicit loops).

    print "Just Another Perl Adept\n";

Re: What is a "big job" in the industry?
by Moriarty (Abbot) on May 31, 2007 at 01:01 UTC

    Where I work, we only have one job that takes around 8 hours (and I'm sure that could be improved on if I had the time or inclination to do so) and it involves converting 1.6 million records into printable documents, sorting, then distributing them to various locations around the country for printing. It only happens once every quarter, so it's not a real big deal.

    I'm sure if this was based in the U.S. it would be at least 10 times bigger.

Re: What is a "big job" in the industry?
by derby (Abbot) on May 30, 2007 at 21:52 UTC

    Big? Okay .. I'd say it was big. Can the program be improved? Can't they all? Is it outrageous? Nope, not in the least.

Re: What is a "big job" in the industry?
by eric256 (Parson) on May 31, 2007 at 16:05 UTC

    I have a nightly job that runs 6-8 hours. Processing only around 2gig of data so it could probably be improved, but the main bottle neck is getting that 2gig out of a server over the internet ;) Not an ideal world, but it could be worse.

    Eric Hodges
Re: What is a "big job" in the industry?
by talexb (Canon) on Jun 06, 2007 at 02:36 UTC

    Intriguing. If you're doing some serious pattern-matching or bit-crunching, it seems you should be able to get better performance (if that's your goal -- I assume it is) from some kind of optimization, like coding part of the problem in C.

    If it's a straight character for character comparison, Perl may be great, but I bet C would be faster. Is there any way you could tantalize us with a small sample?

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: What is a "big job" in the industry?
by Anonymous Monk on Jun 04, 2007 at 14:34 UTC
    I'm not certain I fully understand the problem, but this would seem to cry out for using bitvectors and bitwise operations. Then you could just:
    foreach my $bigfile_bitvector (@bigfile_bitvectors) { foreach my $smallfile_bitvector (@smallfile_bitvectors) { my $diff_bitvector = $bigfile_bitvector ^ $smallfile_bitvector; write_diffs($diff_bitvector); } }
    You'll have to craft an appropriate write_diffs function to look for the 1s in the diff bitvector and write them in your necessary format.