Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Huge data file and looping best practices

by carillonator (Novice)
on Apr 26, 2009 at 16:05 UTC ( #760143=perlquestion: print w/ replies, xml ) Need Help??
carillonator has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I'm pretty new to Perl, but have experience with PHP. I have been asked to improve a Perl script written by someone else, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:
patent #, char1, char2, char3, ... , char480 1234567,1,0,1,0,1,0, ... (480 characteristics) (x 8 million lines)
The script compares each binary characteristic of each patent with every other patent and counts the number of differences for each pair. My attempt at the improved code is below.

I see that the entire 6gb data file is brought into memory, so I'm looking for the best way to go one line at a time. The program will be run on an 8-core machine with 64G memory. Notice it takes arguments that limit execution to a certain range of iterations of the first loop, so I can run 7 different instances at the same time (one per core) on different parts of the data. Or, is there a smarter way to allocate resources? O'Reilly's Perl Best Practices book says to use while instead of for loops when processing files, but I would like to keep the ability to limit iterations with command line arguments.

Since it will take a VERY long time to run all of this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated. Thanks in advance!!

#!/usr/bin/perl use strict; my(@patno1,@patno2,@record1,@record2); my $startat=@ARGV[0]; my $endat=@ARGV[1]; open(OUT, "<patents.csv")|| die("Could not open patents.csv file!\n"); + my @lines=<OUT>; close(OUT); #clear variance file if it exists open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n" +); close(OUT); map(chomp,@lines); # iterate over all patents for(my $i=$startat;$i<=$endat;$i++) { @record1=split(/\,/,$lines[$i]); $patno1=shift(@record1); # iterate through other lines to compare for(my $j=$i+1;$j<$#lines;$j++) { @record2=split(/\,/,$lines[$j]); $patno2=shift @record2; my $variance=0; # iterate through each characteristic for(my $k=0;$k<$#record1;$k++) { if($record1[$k]!=$record2[$k]) { $variance++; } } open(OUT, ">>variance.csv")|| die("Could not open file va +riance.csv!\n"); print OUT $patno1.",".$patno2.",".$variance."\n"; close(OUT); } }

Comment on Huge data file and looping best practices
Select or Download Code
Re: Huge data file and looping best practices
by talexb (Canon) on Apr 26, 2009 at 16:16 UTC

    Without even looking at your code, my first reaction upon hearing about 480 columns, 8,000,000 lines and a 6G file is "put it into a database and let the database worry about that".

    Once the data's in a database, you can group similar records together, look at just a subset, and so forth.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      @talex, will mySQL, for example, store the data more compactly, or allow for faster analysis? We're not concerned with subsets of data other than to break up the computation tasks into several chunks, across processors or computers. We really need all the data. Plus, we don't have the SQL skills to do the analysis that way.

        A database may not be the best solution here -- from reading the other posts, it could be that you're going to be more interested in 'clumping' each of the data points together, creating 'neighborhoods' of 'nearest neighbors'. My Systems Design professor Ed Jernigan did research along those lines.

        Perhaps a first cut would be some sort of encoding of each data point, then a 'clumping' based on that, with further analysis on the smaller 'clumps'.

        Alex / talexb / Toronto

        "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Huge data file and looping best practices
by przemo (Scribe) on Apr 26, 2009 at 16:32 UTC

    Looks like we have O(N^2*K) complexity (N=number of patents, K=number of characteristics) here... not good...

    If you really have to count difference for every pair of patents, try using Bit::Vector for every line (i.e. one characteristic = one bit), calculate the number of different positions they have with XOR operation and count set bits in the resulting vector. This should take less memory and should be also faster. (?)

    Although I have some doubts about the time needed to do so (you need to write to the output about (8*10^6)^2)/2 = 32 "teralines").

      thanks @przemo. I'll check out Bit::Vector. We think there are only about 400,000 unique characteristic sets among the 8 million patents, so we'll probably end up dealing with that data set instead, on account of the absurd amount of data this program would return.

        If you have 400,000 unique characteristic sets among the 8 million patients, now you're getting somewhere. If you could find a way to consistently stringify a given set the same way each time it comes up, you could turn that into a hash key, and as its value create a datastructure of patient names. Now you have a workable structure that could be split into manageable files based on the groupings by unique characteristics.

        ...just a thought, though I'm not sure how helpful it is.


        Dave

Re: Huge data file and looping best practices
by samtregar (Abbot) on Apr 26, 2009 at 16:56 UTC
    Tough problem! You could switch to reading the file as you go but that's likely to make your program much slower since you need to access every line so many times.

    I'd start by processing the input file into a more efficient representation - something that can be accessed using mmap() from C for example. If all of your characteristics are boolean (and they are in your example) you can represent them as 15 unsigned 32-bit integers. Add an integer for the patient # and you can represent a row in just 512 bits. You can write the pre-processing code in Perl using pack() or Bit::Vector.

    Then I'd write some Inline::C code to mmap() the data file and provide access to "rows". The code to compare one row to another should also be written in C. It's basically an XOR of the characteristics and a bit-count of the result, so not hard to write at all. I'd definitely look at whether a lookup table can speed things up - perhaps at the 8-bit or 16-bit level. Or you could look at caching comparisons.

    Finally, I'd use Parallel::ForkManager to make it 8-way parallel. Have each working processes take 1/8 of the patient space and write to its own output file. When you're done, cat all the output files together and you should be done.

    I'd be shocked if this didn't run 100x faster than the Perl code you've got now.

    -sam

Re: Huge data file and looping best practices
by roboticus (Canon) on Apr 26, 2009 at 17:01 UTC
    carillonator:

    It sounds like you might be attempting to solve the wrong problem. Why are you interested in the differences between each pair of patents? If it's to find "similar" patents to a specified one, it might be faster to compute the results as needed. I can't really imagine that you would care about the difference between a patent on 'improved grooming proceedure for cocker spaniels' and 'increasing yield of aspartame using industrial waste'.

    Once you identify why you need the results, you might find ways of getting what you need quickly without exhaustively building a database for every possible combination (6.4 x 10^13 entries ... sounds like a reasonably large database when you're done...). For example, if there's a threshold above which you don't care, you might be able to cluster your patents into groups such that membership in one group excludes it from all other groups, drastically reducing your search space. If you can do it well enough, you may be able to run queries against the data in real time.

    ...roboticus
Re: Huge data file and looping best practices
by BrowserUk (Pope) on Apr 26, 2009 at 18:26 UTC

    Try something like this:

    #! perl -sw use 5.010; use strict; my( $i, @patNos, @data ) = 0; while( <> ) { my @bits =split ','; $patNos[ $i ] = shift @bits; $data[ $i++ ] = pack 'b480', @bits; } open OUT, '>', 'variances.csv' or die $!; my @variances; for my $first ( 0 .. $#patNos ) { for my $second ( 0 .. $#patNos ) { next if $first == $second; say OUT "$patNos[ $first ], $patNos[ $second ], ", unpack '%32b*', ( $data[ $first ] ^ $data[ $second ] ); } } close OUT;

    By packing the 0s & 1s on input, each record will compact to 60 bytes, making your 8 million records occupy about 1.2 GB.

    The second benefit of this is that you can calculate your variances for each pairing in a single statement, thereby eliminating the expensive 480 iteration inner loop and speeding things up by close to 3 orders of magnitude.

    It'll probably make the need to split the task across processes unnecessary. Though it would still be possible if needed.

    Once encoded as bitstrings, pack '%32b*', ( $data[ $first ] ^ $data[ $second ] ); will efficiently compare all 480 attributes and count the variance in one go.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Would it be worthwhile to pre-allocate the  @patNos and  @data arrays with  $#array = N_000_000; statements? (The exact number of lines/elements could be obtained with a  `wc -l` call.)

      Even though it would lead to 8,000,000 redundant entries giving the variance of each patent with itself (i.e., 0), might it be profitable to eliminate the  next if $first == $second; statement from the inner loop of the variance calculation section? Or, avoiding redundant entries, to break the inner loop into two consecutive loops with identical BLOCK contents:  for my $second (0 .. $first-1) { ... } and then  for my $second ($first+1 .. $#patNos) { ... } ?

      Or is all this just premature optimization?

      Update: Slight wording changes.

        Would it be worthwhile to pre-allocate the @patNos and @data arrays with $#array = N_000_000; statements?

        It wouldn't hurt, but would make little difference to the overall runtime given the scale of the processing.

        Even though it would lead to 8,000,000 redundant entries giving the variance of each patent with itself (i.e., 0), might it be profitable to eliminate the next if $first == $second; statement from the inner loop of the variance calculation section? Or, avoiding redundant entries, to break the inner loop into two consecutive loops with identical BLOCK contents: for my $second (0 .. $first-1) { ... } and then for my $second ($first+1 .. $#patNos) { ... } ?

        With time to give it a little more thought, there is no point in comparing both patent 1 with patent 2 & patent 2 with patent 1. With that insight, the inner loop should run from $first +1 .. $#patNos - 1. That reduces the variance calculations from 64e12 to 32e12 - 8e6 for a further halving of the runtime:

        #! perl -sw use 5.010; use strict; my( $i, @patNos, @data ) = 0; $#patNos = $#data = 8e6; while( <> ) { my @bits =split ','; $patNos[ $i ] = shift @bits; $data[ $i++ ] = pack 'b480', @bits; print "\r$.\t" unless $. % 1000; } say "\n", time; open OUT, '>', 'variances.csv' or die $!; my @variances; for my $first ( 0 .. $#patNos-1 ) { for my $second ( $first+1 .. $#patNos ) { say OUT "$patNos[ $first ], $patNos[ $second ], ", unpack '%32b*', ( $data[ $first ] ^ $data[ $second ] ); } } close OUT;

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        And now for the ultimate speedup.

        Instead of building an array of 8 million, 60-byte bitstrings and performing 32e12 xors and bitcounts, you build a single 480 MB bitstring by concatenating all the packed data.

        Now, you can perform the XOR of records 0 through 7,999,999 with records 1 through 8,000,000 in one operation using substr to select the appropriate chunks of the big string. And then use substr with unpack to count the bits in each 60 byte chunk of the result.

        Then you repeat the process with substrs covering 0 .. 7,999,998 & 2 .. 8,000,000. Rinse repeat till done.

        8e6 XORs and 32e12 bitcounts and done.

        My best guestimate is that instead of the OP code requiring upwards of 10 days on 8 cores, this'll take less than half a day on one core.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      beautiful! Thank you everyone for the very thoughtful answers. I think I'm over my head here, but the first step is going to be to reduce the 8 million lines to what we're guessing is 400,000 unique sets of characteristics, then go from there. This is easy enough, but the 400,000 unique sets will then all have to be compared to each other.

      Then I'll work through these methods of converting the data to bit strings to simplify the comparisons.

      Thank you thank you!

        the first step is going to be to reduce the 8 million lines to what we're guessing is 400,000 unique sets of characteristics,

        How long wil it take you to do that reduction?

        Because with the additional efficiencies outlined in 760218 & 760226, and a little threading or forking, you could distribute this over your 8 processors and have the full cross product very quickly.

        Of course, that would be a huge amount of data to further process, but maybe you could apply some (cheap) selection criteria prior to output.

        Anyway, good luck!


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Huge data file and looping best practices
by igelkott (Curate) on Apr 26, 2009 at 21:09 UTC

    This happens to be a classic computational chemistry problem. When searching large chemical databases, compound characteristics are hashed into long bit strings to minimize the number of expensive graph comparisons. One common type of pairwise bitstring comparison is called a Tanimoto coefficient.

    Tanimoto = (A & B) / (A | B)
    Ie, the number of characteristics in common divided by the number of characteristics found in either. The bias is that positive information counts higher than negative. This is used to look for closest relatives, find a diverse subset and compare the diversity of collections (comparing the average of each object's closest relative within its collection).

    I've probably gone too far with all this Comp Chem stuff but it might be a good way to compare many patents. Could also look for Cosine or Dice coefficients used in other fields.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://760143]
Approved by talexb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2014-09-16 05:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (156 votes), past polls