http://www.perlmonks.org?node_id=727578

sesemin has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I need to make the following script more efficient and faster. It works fine with a few lines in the file but if you are dealing with 6.0 Million lines it will take forever.

I am trying to rename the IDs in the second column of file 2. file 1 contains the conversion table.

File.1 conversion table in a tab delimited format CGP_A0000000001 AAAA CGP_A0000000002 AAAB CGP_A0000000003 AAAC CGP_A0000000004 AAAD CGP_A0000000005 AAAE CGP_A0000000006 AAAF CGP_A0000000007 AAAG CGP_A0000000008 AAAH CGP_A0000000009 AAAI CGP_A0000000010 AAAJ CGP_A0000000011 AAAK CGP_A0000000012 AAAL CGP_A0000000013 AAAM CGP_A0000000014 AAAN CGP_A0000000015 AAAO CGP_A0000000016 AAAP CGP_A0000000017 AAAQ CGP_A0000000018 AAAR CGP_A0000000019 AAAS CGP_A0000000020 AAAT CGP_A0000000021 AAAU CGP_A0000000022 AAAV
File.2 has three column and the 2nd needs to be replaced, using conver +sion table of (File1) 3998122 CGP_A0000000001 13 5877245 CGP_A0000000001 17 3637488 CGP_A0000000001 19 3998162 CGP_A0000000001 21 638421 CGP_A0000000001 23 2395226 CGP_A0000000001 25 3094278 CGP_A0000000001 27 2029460 CGP_A0000000001 29 1406937 CGP_A0000000001 31 2054853 CGP_A0000000001 35 4182290 CGP_A0000000001 37 3784069 CGP_A0000000002 13 6477860 CGP_A0000000002 17 394789 CGP_A0000000002 19 5095549 CGP_A0000000002 21 692543 CGP_A0000000002 23 5446227 CGP_A0000000002 25 1546807 CGP_A0000000002 27 1741167 CGP_A0000000002 29 1187972 CGP_A0000000002 31 1600142 CGP_A0000000002 33 1833098 CGP_A0000000002 35 1770403 CGP_A0000000003 1353 3254322 CGP_A0000000003 1355 6152600 CGP_A0000000003 1357 3195476 CGP_A0000000003 1361 3108815 CGP_A0000000003 1371 77684 CGP_A0000000003 1373 3269969 CGP_A0000000003 1375 3259137 CGP_A0000000003 1377 6502805 CGP_A0000000003 1379 5899118 CGP_A0000000003 1381 5417394 CGP_A0000000003 1383 806606 CGP_A0000000003 1385 1662014 CGP_A0000000003 1387 6490426 CGP_A0000000003 1389 6206360 CGP_A0000000003 1391
the RESULT file is the same as the second file but the the second colu +mn has the equivalent IDs from 2nd column of file 1 3998122 AAAA 13 5877245 AAAA 17 3637488 AAAA 19 3998162 AAAA 21 638421 AAAA 23 2395226 AAAA 25
#!/usr/bin/perl -w use strict; use warnings; use vars qw(%origins); if( @ARGV < 3){ print "usage: A message here\n"; exit 0; } open(INPUT1,$ARGV[0]) || die "Cannot open file \"$ARGV[0]\""; #Orginal + IDs and four letter codes open(INPUT2,$ARGV[1]) || die "Cannot open file \"$ARGV[1]\""; #Orginal + IDs in the second column open(RESULTS,">$ARGV[2]")|| die "Cannot open the Results file \"$ARGV[ +2]\""; # Origanl IDs will change to four letter code my %origins; while (<INPUT1>) { chomp; my @columns = split '\t'; $origins{$columns[0]} = $columns[1]; } close(INPUT1); while (<INPUT2>) { chomp; (my $bioC, my $contig_id , my $pip) = split("\t", $_); for my $oKey (sort keys %origins) { my $origin = $origins{$oKey}; if ($contig_id eq $oKey){ print RESULTS "$bioC\t$origin\t$pip\n"; #print "$bioC\t$origin\t$pip\n"; } } } close(INPUT2); close(RESULTS);

Replies are listed 'Best First'.
Re: Make my script faster and more efficient
by BrowserUk (Patriarch) on Dec 03, 2008 at 03:43 UTC

    Try it this way. Provided I haven't made any elementary mistakes, it shoudl produce the same result and run in a fraction of the time:

    #!/usr/bin/perl -w use strict; use warnings; if( @ARGV < 3){ print "usage: A message here\n"; exit 0; } open(INPUT1,$ARGV[0]) || die "Cannot open file \"$ARGV[0]\""; #Orginal + IDs and four letter codes open(INPUT2,$ARGV[1]) || die "Cannot open file \"$ARGV[1]\""; #Orginal + IDs in the second column open(RESULTS,">$ARGV[2]")|| die "Cannot open the Results file \"$ARGV[ +2]\""; # Origanl IDs will change to four letter code my %origins; while (<INPUT1>) { chomp; my @columns = split '\t'; $origins{ $columns[0] } = $columns[1]; } close(INPUT1); while( <INPUT2> ) { chomp; my( $bioC, $contig_id, $pip ) = split("\t", $_); print RESULTS "$bioC\t$origins{ $contig_id }\t$pip\n"; } } close(INPUT2); close(RESULTS);

    How small a fraction? If input file 1 contains 1000 lines, then < 1/1000th of the time


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Make my script faster and more efficient
by monarch (Priest) on Dec 03, 2008 at 03:35 UTC
    I can't help but thinking your inner loop is unnecessary. You're scanning through your list of mappings until you find the one for the current line. But you're missing the beauty of hashes - direct access to the value you want..

    Consider this:

    while (<INPUT2>) { $_ =~ s/[\r\n]+\z//s; # I don't use chomp, sorry my ( $bioC, $contig_id, $pip ) = split( "\t", $_ ); my $origin = $origins{$contig_id}; print RESULTS "$bioC\t$origin\t$pip\n"; }

    Update: you could also DIE with an error message in the event that the file contained a code that wasn't in your mapping table:

    my $origin = $origins{$contig_id}; if ( ! defined( $origin ) ) { die( "No mapping for \"$contig_id\" found in origins table" ); }
        I've gotten gun-shy of chomp as well. In a mixed Unix/Windows environment, or running Cygwin, it's pretty common for $/ to not be set by default to the same line ending that is used in (some of) the input file(s). Monarch's method is more tolerant of incorrect or inconsistent line endings than chomp.
Re: Make my script faster and more efficient
by kyle (Abbot) on Dec 03, 2008 at 03:36 UTC

    You're iterating over the keys of a hash where it appears exists would work.

    while (<INPUT2>) { chomp; my ($bioC, $contig_id, $pip) = split /\t/; if ( exists $origins{ $contig_id } ) { print RESULTS "$bioC\t$origins{$contig_id}\t$pip\n"; } }

    If, for some reason, you need to iterate over the keys on each line, you could save time by saving the result of sort keys in an array before looping over the lines.

Re: Make my script faster and more efficient
by nagalenoj (Friar) on Dec 03, 2008 at 06:24 UTC
    Hi,

    Your problem would have been solved by a reply posted(The code without using the inner loop).

    As you have told that you are going to work with huge data, It's better to use the DBM hash instead of hash. Storing more records in the hash, will take more memory. Using DBM hash the data will stored in a database file, and at the same time it is as easy as working with hashes.

    For clarity refer the following sample code to work with DBM hashes,

    Example: #!/usr/bin/perl #===================================================================== +========== # FILE: dbm_test.pl # USAGE: ./dbm_test.pl # DESCRIPTION: To test DBM. #===================================================================== +========== use strict; use warnings; my %hash; # To open a dbmfile and associate with hash dbmopen(%hash, "nagalenoj", '0666') || die "Can't open database bookdb +!"; $hash{'one'}="1"; $hash{'two'}="2"; $hash{'three'}="3"; #Retrieving values print $hash{'three'}; dbmclose %hash; #===================================================================== +==========

    And I noted that you have opened all the files in the beginning. You can avoid that. open the second file after finishing the process with first file. This will help you to improve the efficiency and coding standard

    Instead of using print and exit, you could use die.

      Particularly when working with large data sets, if the hash can be stored in memory, it really, really should be due to performance considerations.

      Secondly if you have to use a DBM file, I highly recommend not using dbmopen. That is because of the following gotcha. You wrote a program that stores data using dbmopen. Your program has been successfully running, storing, and accessing data, every night for months. Then it suddenly stops working one day, and nobody knows how to get at your data. How are you going to figure out that this is because a sysadmin installed DB_File on your machine? And once you do figure it out, how are you going to fix your program?

Re: Make my script faster and more efficient
by ahmad (Hermit) on Dec 03, 2008 at 04:04 UTC

    I can see a continuous pattern in the data you provided

    So my guess why don't you generate the ids right away without reading from the first file ?

    Something like this might work???

    #!/usr/bin/perl -w use strict; my $StartID = ''; my $GenCODE = 'ZZZ'; while (<DATA>) { chomp; my ( $C , $ID , $NO ) = split /\s+/; if ( $ID ne $StartID ) { $StartID = $ID; $GenCODE++; } print "$C $GenCODE $NO\n"; } __DATA__ 3998122 CGP_A0000000001 13 5877245 CGP_A0000000001 17 3637488 CGP_A0000000001 19 3998162 CGP_A0000000001 21 638421 CGP_A0000000001 23 2395226 CGP_A0000000001 25 3094278 CGP_A0000000001 27 2029460 CGP_A0000000001 29 1406937 CGP_A0000000001 31 2054853 CGP_A0000000001 35 4182290 CGP_A0000000001 37 3784069 CGP_A0000000002 13 6477860 CGP_A0000000002 17 394789 CGP_A0000000002 19 5095549 CGP_A0000000002 21 692543 CGP_A0000000002 23 5446227 CGP_A0000000002 25 1546807 CGP_A0000000002 27 1741167 CGP_A0000000002 29 1187972 CGP_A0000000002 31 1600142 CGP_A0000000002 33 1833098 CGP_A0000000002 35 1770403 CGP_A0000000003 1353 3254322 CGP_A0000000003 1355 6152600 CGP_A0000000003 1357 3195476 CGP_A0000000003 1361 3108815 CGP_A0000000003 1371 77684 CGP_A0000000003 1373 3269969 CGP_A0000000003 1375 3259137 CGP_A0000000003 1377 6502805 CGP_A0000000003 1379 5899118 CGP_A0000000003 1381 5417394 CGP_A0000000003 1383 806606 CGP_A0000000003 1385 1662014 CGP_A0000000003 1387 6490426 CGP_A0000000003 1389 6206360 CGP_A0000000003 1391