Make my script faster and more efficient

sesemin has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I need to make the following script more efficient and faster. It works fine with a few lines in the file but if you are dealing with 6.0 Million lines it will take forever.

I am trying to rename the IDs in the second column of file 2. file 1 contains the conversion table.

File.1 conversion table in a tab delimited format

CGP_A0000000001 AAAA
CGP_A0000000002 AAAB
CGP_A0000000003 AAAC
CGP_A0000000004 AAAD
CGP_A0000000005 AAAE
CGP_A0000000006 AAAF
CGP_A0000000007 AAAG
CGP_A0000000008 AAAH
CGP_A0000000009 AAAI
CGP_A0000000010 AAAJ
CGP_A0000000011 AAAK
CGP_A0000000012 AAAL
CGP_A0000000013 AAAM
CGP_A0000000014 AAAN
CGP_A0000000015 AAAO
CGP_A0000000016 AAAP
CGP_A0000000017 AAAQ
CGP_A0000000018 AAAR
CGP_A0000000019 AAAS
CGP_A0000000020 AAAT
CGP_A0000000021 AAAU
CGP_A0000000022 AAAV
[download]

File.2 has three column and the 2nd needs to be replaced, using conver
+sion table of (File1)


3998122 CGP_A0000000001 13
5877245 CGP_A0000000001 17
3637488 CGP_A0000000001 19
3998162 CGP_A0000000001 21
638421  CGP_A0000000001 23
2395226 CGP_A0000000001 25
3094278 CGP_A0000000001 27
2029460 CGP_A0000000001 29
1406937 CGP_A0000000001 31
2054853 CGP_A0000000001 35
4182290 CGP_A0000000001 37
3784069 CGP_A0000000002 13
6477860 CGP_A0000000002 17
394789  CGP_A0000000002 19
5095549 CGP_A0000000002 21
692543  CGP_A0000000002 23
5446227 CGP_A0000000002 25
1546807 CGP_A0000000002 27
1741167 CGP_A0000000002 29
1187972 CGP_A0000000002 31
1600142 CGP_A0000000002 33
1833098 CGP_A0000000002 35
1770403 CGP_A0000000003 1353
3254322 CGP_A0000000003 1355
6152600 CGP_A0000000003 1357
3195476 CGP_A0000000003 1361
3108815 CGP_A0000000003 1371
77684   CGP_A0000000003 1373
3269969 CGP_A0000000003 1375
3259137 CGP_A0000000003 1377
6502805 CGP_A0000000003 1379
5899118 CGP_A0000000003 1381
5417394 CGP_A0000000003 1383
806606  CGP_A0000000003 1385
1662014 CGP_A0000000003 1387
6490426 CGP_A0000000003 1389
6206360 CGP_A0000000003 1391
[download]

the RESULT file is the same as the second file but the the second colu
+mn has the equivalent IDs from 2nd column of file 1
3998122 AAAA 13
5877245 AAAA 17
3637488 AAAA 19
3998162 AAAA 21
638421  AAAA 23
2395226 AAAA 25
[download]

#!/usr/bin/perl -w

use strict;
use warnings;
use vars qw(%origins);

if( @ARGV < 3){
      print "usage: A message here\n";
      exit 0;
}

open(INPUT1,$ARGV[0]) || die "Cannot open file \"$ARGV[0]\""; #Orginal
+ IDs and four letter codes
open(INPUT2,$ARGV[1]) || die "Cannot open file \"$ARGV[1]\""; #Orginal
+ IDs in the second column
open(RESULTS,">$ARGV[2]")|| die "Cannot open the Results file \"$ARGV[
+2]\""; # Origanl IDs will change to four letter code

my %origins;

while (<INPUT1>) {
    chomp;
    my @columns = split '\t';
    $origins{$columns[0]} = $columns[1];

}

close(INPUT1);

while (<INPUT2>) {
    chomp;
    (my $bioC, my $contig_id , my $pip) = split("\t", $_);
     for my $oKey (sort keys %origins) {
        my $origin = $origins{$oKey};
     if ($contig_id eq $oKey){
     print RESULTS "$bioC\t$origin\t$pip\n";
     #print "$bioC\t$origin\t$pip\n";
     }
   }
}
close(INPUT2);
close(RESULTS);
[download]

Comment on Make my script faster and more efficient Select or Download Code

Replies are listed 'Best First'.
Re: Make my script faster and more efficient by BrowserUk (Patriarch) on Dec 03, 2008 at 03:43 UTC
Try it this way. Provided I haven't made any elementary mistakes, it shoudl produce the same result and run in a fraction of the time: #!/usr/bin/perl -w use strict; use warnings; if( @ARGV < 3){ print "usage: A message here\n"; exit 0; } open(INPUT1,$ARGV[0]) \|\| die "Cannot open file \"$ARGV[0]\""; #Orginal + IDs and four letter codes open(INPUT2,$ARGV[1]) \|\| die "Cannot open file \"$ARGV[1]\""; #Orginal + IDs in the second column open(RESULTS,">$ARGV[2]")\|\| die "Cannot open the Results file \"$ARGV[ +2]\""; # Origanl IDs will change to four letter code my %origins; while (<INPUT1>) { chomp; my @columns = split '\t'; $origins{ $columns[0] } = $columns[1]; } close(INPUT1); while( <INPUT2> ) { chomp; my( $bioC, $contig_id, $pip ) = split("\t", $_); print RESULTS "$bioC\t$origins{ $contig_id }\t$pip\n"; } } close(INPUT2); close(RESULTS); [download] How small a fraction? If input file 1 contains 1000 lines, then < 1/1000th of the time Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Make my script faster and more efficient by monarch (Priest) on Dec 03, 2008 at 03:35 UTC
I can't help but thinking your inner loop is unnecessary. You're scanning through your list of mappings until you find the one for the current line. But you're missing the beauty of hashes - direct access to the value you want.. Consider this: `while (<INPUT2>) { $_ =~ s/[\r\n]+\z//s; # I don't use chomp, sorry my ( $bioC, $contig_id, $pip ) = split( "\t", $_ ); my $origin = $origins{$contig_id}; print RESULTS "$bioC\t$origin\t$pip\n"; }` [download] Update: you could also DIE with an error message in the event that the file contained a code that wasn't in your mapping table: `my $origin = $origins{$contig_id}; if ( ! defined( $origin ) ) { die( "No mapping for \"$contig_id\" found in origins table" ); }` [download]	[reply] [d/l] [select]
Re^2: Make my script faster and more efficient by BrowserUk (Patriarch) on Dec 03, 2008 at 03:51 UTC
# I don't use chomp, sorry Why not? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: Make my script faster and more efficient by quester (Vicar) on Dec 03, 2008 at 06:07 UTC
I've gotten gun-shy of chomp as well. In a mixed Unix/Windows environment, or running Cygwin, it's pretty common for $/ to not be set by default to the same line ending that is used in (some of) the input file(s). Monarch's method is more tolerant of incorrect or inconsistent line endings than chomp.	[reply]
Re^4: Make my script faster and more efficient by gone2015 (Deacon) on Dec 03, 2008 at 13:28 UTC
Re: Make my script faster and more efficient by kyle (Abbot) on Dec 03, 2008 at 03:36 UTC
You're iterating over the keys of a hash where it appears exists would work. `while (<INPUT2>) { chomp; my ($bioC, $contig_id, $pip) = split /\t/; if ( exists $origins{ $contig_id } ) { print RESULTS "$bioC\t$origins{$contig_id}\t$pip\n"; } }` [download] If, for some reason, you need to iterate over the keys on each line, you could save time by saving the result of `sort keys` in an array before looping over the lines.	[reply] [d/l] [select]
Re: Make my script faster and more efficient by nagalenoj (Friar) on Dec 03, 2008 at 06:24 UTC
Hi, Your problem would have been solved by a reply posted(The code without using the inner loop). As you have told that you are going to work with huge data, It's better to use the DBM hash instead of hash. Storing more records in the hash, will take more memory. Using DBM hash the data will stored in a database file, and at the same time it is as easy as working with hashes. For clarity refer the following sample code to work with DBM hashes, Example: #!/usr/bin/perl #===================================================================== +========== # FILE: dbm_test.pl # USAGE: ./dbm_test.pl # DESCRIPTION: To test DBM. #===================================================================== +========== use strict; use warnings; my %hash; # To open a dbmfile and associate with hash dbmopen(%hash, "nagalenoj", '0666') \|\| die "Can't open database bookdb +!"; $hash{'one'}="1"; $hash{'two'}="2"; $hash{'three'}="3"; #Retrieving values print $hash{'three'}; dbmclose %hash; #===================================================================== +========== [download] And I noted that you have opened all the files in the beginning. You can avoid that. open the second file after finishing the process with first file. This will help you to improve the efficiency and coding standard Instead of using print and exit, you could use die.	[reply] [d/l]
Re^2: Make my script faster and more efficient by tilly (Archbishop) on Dec 03, 2008 at 17:57 UTC
Particularly when working with large data sets, if the hash can be stored in memory, it really, really should be due to performance considerations. Secondly if you have to use a DBM file, I highly recommend not using dbmopen. That is because of the following gotcha. You wrote a program that stores data using dbmopen. Your program has been successfully running, storing, and accessing data, every night for months. Then it suddenly stops working one day, and nobody knows how to get at your data. How are you going to figure out that this is because a sysadmin installed DB_File on your machine? And once you do figure it out, how are you going to fix your program?	[reply]
Re: Make my script faster and more efficient by ahmad (Hermit) on Dec 03, 2008 at 04:04 UTC
I can see a continuous pattern in the data you provided So my guess why don't you generate the ids right away without reading from the first file ? Something like this might work??? #!/usr/bin/perl -w use strict; my $StartID = ''; my $GenCODE = 'ZZZ'; while (<DATA>) { chomp; my ( $C , $ID , $NO ) = split /\s+/; if ( $ID ne $StartID ) { $StartID = $ID; $GenCODE++; } print "$C $GenCODE $NO\n"; } __DATA__ 3998122 CGP_A0000000001 13 5877245 CGP_A0000000001 17 3637488 CGP_A0000000001 19 3998162 CGP_A0000000001 21 638421 CGP_A0000000001 23 2395226 CGP_A0000000001 25 3094278 CGP_A0000000001 27 2029460 CGP_A0000000001 29 1406937 CGP_A0000000001 31 2054853 CGP_A0000000001 35 4182290 CGP_A0000000001 37 3784069 CGP_A0000000002 13 6477860 CGP_A0000000002 17 394789 CGP_A0000000002 19 5095549 CGP_A0000000002 21 692543 CGP_A0000000002 23 5446227 CGP_A0000000002 25 1546807 CGP_A0000000002 27 1741167 CGP_A0000000002 29 1187972 CGP_A0000000002 31 1600142 CGP_A0000000002 33 1833098 CGP_A0000000002 35 1770403 CGP_A0000000003 1353 3254322 CGP_A0000000003 1355 6152600 CGP_A0000000003 1357 3195476 CGP_A0000000003 1361 3108815 CGP_A0000000003 1371 77684 CGP_A0000000003 1373 3269969 CGP_A0000000003 1375 3259137 CGP_A0000000003 1377 6502805 CGP_A0000000003 1379 5899118 CGP_A0000000003 1381 5417394 CGP_A0000000003 1383 806606 CGP_A0000000003 1385 1662014 CGP_A0000000003 1387 6490426 CGP_A0000000003 1389 6206360 CGP_A0000000003 1391 [download]	[reply] [d/l]

Back to Seekers of Perl Wisdom