Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Replacement of data in a column of a file using Hashes created from another file

by rohitmonk (Initiate)
on Oct 30, 2012 at 08:59 UTC ( [id://1001475]=perlquestion: print w/replies, xml ) Need Help??

rohitmonk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks. Being a beginner this is my first post seeking your wisdom.

I am trying to edit a file in one format to another one. I have a list of values which are to be replaced with the values and their description, such as

gi|315134697|dbj|AP012030.1|=gi|315134697|dbj|AP012030.1| Escherichia coli DH1 (ME8569) DNA,...

gi|260447279|gb|CP001637.1|=gi|260447279|gb|CP001637.1| Escherichia coli DH1, complete genome

gi|238859724|gb|CP001396.1|=gi|238859724|gb|CP001396.1| Escherichia coli BW2952, complete g...

gi|194400059|gb|EU855241.1|=gi|194400059|gb|EU855241.1| Shigella flexneri strain FBD047 23S...

gi|194400053|gb|EU855235.1|=gi|194400053|gb|EU855235.1| Shigella dysenteriae strain FBD056 ...

gi|169887498|gb|CP000948.1|=gi|169887498|gb|CP000948.1| Escherichia coli str. K12 substr. D...

gi|85674274|dbj|AP009048.1|=gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W...

gi|48994873|gb|U00096.2|=gi|48994873|gb|U00096.2| Escherichia coli str. K-12 substr. MG1...

gi|81239530|gb|CP000034.1|=gi|81239530|gb|CP000034.1| Shigella dysenteriae Sd197, complete...

gi|5801828|gb|AF053967.1|AF053967=gi|5801828|gb|AF053967.1|AF053967 Escherichia coli strain ECOR ...

gi|5801827|gb|AF053966.1|AF053966=gi|5801827|gb|AF053966.1|AF053966 Escherichia coli rrlD operon,...

gi|406775301|gb|CP003297.1|=gi|406775301|gb|CP003297.1| Escherichia coli O104:H4 str. 2009E...

gi|383403426|gb|CP002967.1|=gi|383403426|gb|CP002967.1| Escherichia coli W, complete genome

I need to replace the value preceding the '=' sign by the value succeeding it. So I made a hash of it using the split function.

my %hash; for($i=0;$i<=$#arr0;$i++) { @arr1 = split(/\=/,$arr0[$i]); #print $#arr1; $hash{$arr1[0]} = $arr1[1]; }

Then i wanted to use this hash as reference and replace every instance of the occurrence of the hash-key by the hash value.

The file where I want to do the replacement looks like this

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 100.00 280 0 0 1 280 3402569 3402290 4e-140 506

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 227880 228159 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 2704973 2704694 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4018745 4019024 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4149866 4150145 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4191268 4191547 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 98.93 280 3 0 1 280 3924929 3925208 9e-136 491

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 100.00 280 0 0 1 280 459101 459380 4e-140 506

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 1156698 1156977 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 3643499 3643220 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4302307 4302028 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4343709 4343430 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4474830 4474551 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 98.93 280 3 0 1 280 4568646 4568367 9e-136 491

So i attempted to write the final code, as follows -

#!/usr/bin/perl -w ($a,$b,$c) = @ARGV; if (scalar @ARGV!=3) { print "Program Name [hash file] [outfmt file] [output file] \n +"; exit; } open FILE1, "$a" or die $!; @arr0 = <FILE1>; chomp(@arr0); close(FILE1); open FILE2, "$b" or die $!; @arr2 = <FILE2>; chomp(@arr2); close(FILE2); my %hash; for($i=0;$i<=$#arr0;$i++) { @arr1 = split(/\=/,$arr0[$i]); $hash{$arr1[0]} = $arr1[1]; } open(OUT, ">>$c"); for($j=0;$j<=$#arr2;$j++) { @arr3=split(/\t/,$arr2[$j]); foreach $k (keys %hash) { if ($arr3[1] eq $k) { $arr3[1] = $hash{$k}; } } print OUT "$arr3[0]\t$arr3[1]\t$arr3[2]\t$arr3[3]\t$arr3[4]\t$arr3[5]\ +t$arr3[6]\t$arr3[7]\t$arr3[8]\t$arr3[9]\t$arr3[10]\t$arr3[11]\n"; } close(OUT);

This works fine for a small file, but my files are more than 2 million lines each. I want to increase the speed of my program. Can you please share your wisdom on how to make it faster for larger files?

Regards

Replies are listed 'Best First'.
Re: Replacement of data in a column of a file using Hashes created from another file
by choroba (Cardinal) on Oct 30, 2012 at 09:22 UTC
    The reason why it is slow is you use the nested loops (for each line, you loop over all the keys). The following code generates a regular expression that will match all the keys, so it saves you one loop:
    #!/usr/bin/perl use warnings; use strict; open my $EQ, '<', '1.txt' or die "1: $!"; my %subst; while (<$EQ>) { chomp; # <- updated my ($search, $replace) = split /=/; $subst{$search} = $replace; } my $regex = join '|', map quotemeta, keys %subst; open my $LST, '<', '2.txt' or die "2: $!"; while (<$LST>) { s/($regex)/$subst{$1}/; print; }
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thank you for the reply. But I need to get an output file, I am pretty new to this syntax.

      And this output which gets printed does not have any replacements in it when i run it. Where is it searching for the hash-key and replacing it with the value?

        Not sure what you mean:

        Input:

        10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 100.00 +280 0 0 1 280 3402569 3402290 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 227880 228159 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 2704973 2704694 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 4018745 4019024 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 4149866 4150145 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 4191268 4191547 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 98.93 2 +80 3 0 1 280 3924929 3925208 9e-136 491 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 100.00 2 +80 0 0 1 280 459101 459380 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 1156698 1156977 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 3643499 3643220 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 4302307 4302028 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 4343709 4343430 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 4474830 4474551 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 98.93 28 +0 3 0 1 280 4568646 4568367 9e-136 491

        Output:

        10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 100.00 280 0 0 1 280 3402569 3402290 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 227880 228159 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 2704973 2704694 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4018745 4019024 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4149866 4150145 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4191268 4191547 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 98.93 280 3 0 1 280 3924929 3925208 9e-136 491 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 100.00 280 0 0 1 280 459101 459380 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 1156698 1156977 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 3643499 3643220 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4302307 4302028 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4343709 4343430 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4474830 4474551 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 98.93 280 3 0 1 280 4568646 4568367 9e-136 491

        The replacement is done in the s/($regex)/$subst{$1}/ statement, and you can direct the output to a file by:

        ./program.pl > output_file.txt
        There does appear to be a stray carriage return in the output - the code may needs a chomp somewhere...

      Thank you for the code, it was dope. Cool skills u got, wish to learn more. Cheers....

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1001475]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2024-04-25 11:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found