I'm pretty new to Perl, but have experience with PHP. I have been asked to improve a Perl script written by someone else, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:
patent #, char1, char2, char3, ... , char480
1234567,1,0,1,0,1,0, ... (480 characteristics)
(x 8 million lines)
The script compares each binary characteristic of each patent with every other patent and counts the number of differences for each pair. My attempt at the improved code is below.
I see that the entire 6gb data file is brought into memory, so I'm looking for the best way to go one line at a time. The program will be run on an 8-core machine with 64G memory. Notice it takes arguments that limit execution to a certain range of iterations of the first loop, so I can run 7 different instances at the same time (one per core) on different parts of the data. Or, is there a smarter way to allocate resources? O'Reilly's Perl Best Practices book says to use while instead of for loops when processing files, but I would like to keep the ability to limit iterations with command line arguments.
Since it will take a VERY long time to run all of this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated.
Thanks in advance!!
open(OUT, "<patents.csv")|| die("Could not open patents.csv file!\n");
#clear variance file if it exists
open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n"
# iterate over all patents
# iterate through other lines to compare
# iterate through each characteristic
open(OUT, ">>variance.csv")|| die("Could not open file va
print OUT $patno1.",".$patno2.",".$variance."\n";