What is a "big job" in the industry?


Pathologically Eclectic Rubbish Lister
	PerlMonks

What is a "big job" in the industry?

by punch_card_don (Curate)

on May 30, 2007 at 19:21 UTC ( [id://618277]=perlmeditation: print w/replies, xml )

Need Help??

Macabre Monks,

While I sit here waiting for my script to run, I wonder: out there in the rest of the Perl industry, what's considered a large data processing job?

My case today: I have a 650Mb flat text file with 15-million lines of 1's and 0's. I have another flat text file with 600 lines, each line with 7 positions containing 1's and 0's. Each character and position of every line of the first file has to be checked against each character and position of the second file. That is, 15-million lines of text to parse into characters, then 15-million * 600 * 7 = 6.3-billion comparisons to make, and write matches (normally around 100-million) to about 40-thousand "match position list" files on disk.

There are no patterns to look for. I have been unable to find a better method than brute force character-wise comparison. It takes about 8 hours to run.

And I've sometimes wondered - is this outrageous, or just everyday at work for some?

Thanks

Forget that fear of gravity,
Get a little savagery in your life.
Max Webster

Comment on What is a "big job" in the industry?

Replies are listed 'Best First'.

Re: What is a "big job" in the industry?
by Joost (Canon) on May 30, 2007 at 19:46 UTC

Each character and position of every line of the first file has to be checked against each character and position of the second file.

Anyway. The project I'm working on now has jobs that take from about a day to several weeks to run, maybe more. Not all of it is in production now, though, and for some sub-systems, we're mostly generating data on the fly. I guess we'll generate a couple of dozens of terabytes of data. I think it's a pretty "big job" :-)

General tips:

1. buy a machine with plenty of ram and load & parse all the source data you'll need more than once in memory. It's amazing how much you can do with 4Gb of RAM. Especially if you can also use C/C++ (or existing C-based CPAN libraries) for tight storage and inner loops.

2. if that won't fit, but your algorithm is going to go more-or-less sequentially through the data, cache as much of the source data as will comfortably fit in memory.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

Re: What is a "big job" in the industry?
by perrin (Chancellor) on May 30, 2007 at 20:10 UTC

sliding window technique

Re: What is a "big job" in the industry?
by Jenda (Abbot) on May 31, 2007 at 00:02 UTC

1s and 0s only? Any chance the lines are of the same size? Or actually even if they are not wouldn't it be better to store the data "directly" in binary instead of as a plain text file containin just 0s, 1s and newlines. Even if the lines were not the same length you might still store the data more efficiently. Storing the line length and using one bit instead of one byte for each 0/1. Not only you'll save lots of disk space and disk access, the comparisons will likewise be much quicker, with much less data to compare.

my $s1 = '10110101001010100011101101001010';
my $s2 = '10110101001010100011101101001011';

my $b1 = pack('b*', $s1);
my $b2 = pack('b*', $s2);

use Benchmark qw(timethese);

timethese( 10000, {
  'strings' => \&comp_strings,
  'packs' => \&comp_packs,
}
);

sub comp_strings {
    for (1..1000) {
        $s1 eq $s1 and $s1 eq $s2
    }
}

sub comp_packs {
    for (1..1000) {
        $b1 eq $b1 and $b1 eq $b2
    }
}
__END__
Benchmark: timing 10000 iterations of packs, strings...
    packs:  4 wallclock secs ( 4.07 usr +  0.00 sys =  4.07 CPU) @ 246
+0.02/s (n=10000)
  strings:  5 wallclock secs ( 4.71 usr +  0.00 sys =  4.71 CPU) @ 212
+4.50/s (n=10000)
[download]

Jenda
Support Denmark!
Defend the free world!

Re^2: What is a "big job" in the industry?

by BrowserUk (Patriarch) on May 31, 2007 at 00:20 UTC

Problem: How do you locate a 7-bit bitstring at any offset within a (average) 45-bit bitstring, if the 7-bits being saught can cross byte boundaries?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Re^3: What is a "big job" in the industry?

by Jenda (Abbot) on May 31, 2007 at 10:27 UTC

Well the question is whether punch_card_don needs to do that. I'm not clear on that. It seems to me he's just comparing things on known positions, not searching for them, but I may easily be wrong. I have no idea whether pack()ing the data (at whatever stage of the processing or generation) can be used or whether it would help or just complicate things. I was just trying to suggest something that punch_card_don might have omited, hoping it will be of some use.

Besides ... even if he does need to search for the substrings it might still be beneficial to pack the data for storage. He might save up to 7/8 of the disk access. Which might help even if he has to unpack the data to process them.

Jenda
Support Denmark!
Defend the free world!

Re^4: What is a "big job" in the industry?

by BrowserUk (Patriarch) on May 31, 2007 at 15:58 UTC

Re^3: What is a "big job" in the industry?

by Anonymous Monk on May 31, 2007 at 20:59 UTC

I C, I suspect. Even the brute-force method of shifting a bit at a time into a register, masking, & comparing would surely beat a regexp engine. Worst case, that's a shift, bitwise or, bitwise and, & a branch if not 0. I'm sure there are much better algorithms, too.

Re^4: What is a "big job" in the industry?

by BrowserUk (Patriarch) on May 31, 2007 at 23:59 UTC

Re: What is a "big job" in the industry?
by BrowserUk (Patriarch) on May 30, 2007 at 19:57 UTC

Can I have copies of the two files? I bet I can get that 8 hours down to less than 1 hour! :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Re^2: What is a "big job" in the industry?

by ysth (Canon) on May 31, 2007 at 05:31 UTC

Obviously a job for a trie! :)

Re: What is a "big job" in the industry?
by princepawn (Parson) on May 30, 2007 at 19:52 UTC

Each character and position of every line of the first file has to be checked against each character and position of the second file.

Some top tier databases have ETL (extract, transform, load) tools that would allow you to simply specify the format of the two files and have it automatically load them for you.

Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

Re: What is a "big job" in the industry?
by vrk (Chaplain) on May 31, 2007 at 17:01 UTC

I don't know about the industry, but my Master's project involves visualizing LC-MS (Liquid Column Mass Spectrometry) data. The raw data, once converted from the proprietary format, is a 130 megabyte XML file, containing 128 "scans", each of which contains roughly 200,000 pairs of (mass, intensity) pairs, called peaks (both floating point numbers). Relax, the pairs are not stored inside verbose XML tags, but the actual data is a big Base64 encoded, packed string.

I don't luckily have to use the raw data, as the proprietary conversion tool is able to produce centroided data (which, by the way, reduces the dataset size by four orders of magnitude). However, I used the raw data when I tested parsing the XML file -- a great way to see which part of your code scales up and which assumes that the data is not big.

I have to parse the XML file in two passes. On the first pass, the metadata is read in. On the second pass, byte offsets to the beginning of the actual Base64 encoded data blocks and their lengths are saved in memory. Later, one can read in individual scans from the file by simply seeking to the position and reading that many bytes. Initially, I had the parser (which, by the way, is XML::Twig) do both in one pass, but this was horribly slow for some reason. I suspect XML::Twig tried parsing the Base64 encoded data as XML, because with two passes, I can simply make it skip the data in both. Could be a user error.

Although it's not that big a file, parsing it takes half a minute, and reading in the peak data takes five or six minutes. I'm storing the peaks in piddles, so they don't take much more space in memory than on disk. Then, in the visualizing phase, it takes seven or eight minutes to thread over the 128 piddles and plot them. This is on Athlon XP 2600+ with half a gigabyte of memory. This is a bit of a pain, as the application area requires you to be able to zoom in to the data, and waiting eight minutes between zooms is not something the casual user is going to do, unless he's smoking weed.

I suspect I could squeeze that to four minutes by using C, but currently there's not much need for that, as the centroided data is good enough, and I'm already using PDL as much as I can (i.e. in tight, implicit loops).

--
print "Just Another Perl Adept\n";

Re: What is a "big job" in the industry?
by BrowserUk (Patriarch) on May 30, 2007 at 22:55 UTC

... flat text file with 600 lines, each line with 7 positions containing 1's and 0's.

Hmm. There are only 127 possible combinations of 7 x 1s & 0s?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Re: What is a "big job" in the industry?
by Moriarty (Abbot) on May 31, 2007 at 01:01 UTC

Where I work, we only have one job that takes around 8 hours (and I'm sure that could be improved on if I had the time or inclination to do so) and it involves converting 1.6 million records into printable documents, sorting, then distributing them to various locations around the country for printing. It only happens once every quarter, so it's not a real big deal.

I'm sure if this was based in the U.S. it would be at least 10 times bigger.

Re: What is a "big job" in the industry?
by derby (Abbot) on May 30, 2007 at 21:52 UTC

Big? Okay .. I'd say it was big. Can the program be improved? Can't they all? Is it outrageous? Nope, not in the least.

-derby

Re: What is a "big job" in the industry?
by eric256 (Parson) on May 31, 2007 at 16:05 UTC

I have a nightly job that runs 6-8 hours. Processing only around 2gig of data so it could probably be improved, but the main bottle neck is getting that 2gig out of a server over the internet ;) Not an ideal world, but it could be worse.

___________
Eric Hodges

Re: What is a "big job" in the industry?
by talexb (Chancellor) on Jun 06, 2007 at 02:36 UTC

Intriguing. If you're doing some serious pattern-matching or bit-crunching, it seems you should be able to get better performance (if that's your goal -- I assume it is) from some kind of optimization, like coding part of the problem in C.

If it's a straight character for character comparison, Perl may be great, but I bet C would be faster. Is there any way you could tantalize us with a small sample?

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: What is a "big job" in the industry?
by Anonymous Monk on Jun 04, 2007 at 14:34 UTC

foreach my $bigfile_bitvector (@bigfile_bitvectors) {
  foreach my $smallfile_bitvector (@smallfile_bitvectors) {
    my $diff_bitvector = $bigfile_bitvector ^ $smallfile_bitvector;
    write_diffs($diff_bitvector);
  }
}
[download]

Back to Meditations

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlmeditation [id://618277]
Approved by Joost
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others taking refuge in the Monastery: (3)

As of 2024-04-19 21:33 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found