Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Replacement of data in a column of a file using Hashes created from another file

by rohitmonk (Initiate)
on Oct 30, 2012 at 08:59 UTC ( #1001475=perlquestion: print w/ replies, xml ) Need Help??
rohitmonk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks. Being a beginner this is my first post seeking your wisdom.

I am trying to edit a file in one format to another one. I have a list of values which are to be replaced with the values and their description, such as

gi|315134697|dbj|AP012030.1|=gi|315134697|dbj|AP012030.1| Escherichia coli DH1 (ME8569) DNA,...

gi|260447279|gb|CP001637.1|=gi|260447279|gb|CP001637.1| Escherichia coli DH1, complete genome

gi|238859724|gb|CP001396.1|=gi|238859724|gb|CP001396.1| Escherichia coli BW2952, complete g...

gi|194400059|gb|EU855241.1|=gi|194400059|gb|EU855241.1| Shigella flexneri strain FBD047 23S...

gi|194400053|gb|EU855235.1|=gi|194400053|gb|EU855235.1| Shigella dysenteriae strain FBD056 ...

gi|169887498|gb|CP000948.1|=gi|169887498|gb|CP000948.1| Escherichia coli str. K12 substr. D...

gi|85674274|dbj|AP009048.1|=gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W...

gi|48994873|gb|U00096.2|=gi|48994873|gb|U00096.2| Escherichia coli str. K-12 substr. MG1...

gi|81239530|gb|CP000034.1|=gi|81239530|gb|CP000034.1| Shigella dysenteriae Sd197, complete...

gi|5801828|gb|AF053967.1|AF053967=gi|5801828|gb|AF053967.1|AF053967 Escherichia coli strain ECOR ...

gi|5801827|gb|AF053966.1|AF053966=gi|5801827|gb|AF053966.1|AF053966 Escherichia coli rrlD operon,...

gi|406775301|gb|CP003297.1|=gi|406775301|gb|CP003297.1| Escherichia coli O104:H4 str. 2009E...

gi|383403426|gb|CP002967.1|=gi|383403426|gb|CP002967.1| Escherichia coli W, complete genome

I need to replace the value preceding the '=' sign by the value succeeding it. So I made a hash of it using the split function.

my %hash; for($i=0;$i<=$#arr0;$i++) { @arr1 = split(/\=/,$arr0[$i]); #print $#arr1; $hash{$arr1[0]} = $arr1[1]; }

Then i wanted to use this hash as reference and replace every instance of the occurrence of the hash-key by the hash value.

The file where I want to do the replacement looks like this

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 100.00 280 0 0 1 280 3402569 3402290 4e-140 506

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 227880 228159 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 2704973 2704694 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4018745 4019024 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4149866 4150145 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4191268 4191547 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 98.93 280 3 0 1 280 3924929 3925208 9e-136 491

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 100.00 280 0 0 1 280 459101 459380 4e-140 506

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 1156698 1156977 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 3643499 3643220 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4302307 4302028 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4343709 4343430 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4474830 4474551 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 98.93 280 3 0 1 280 4568646 4568367 9e-136 491

So i attempted to write the final code, as follows -

#!/usr/bin/perl -w ($a,$b,$c) = @ARGV; if (scalar @ARGV!=3) { print "Program Name [hash file] [outfmt file] [output file] \n +"; exit; } open FILE1, "$a" or die $!; @arr0 = <FILE1>; chomp(@arr0); close(FILE1); open FILE2, "$b" or die $!; @arr2 = <FILE2>; chomp(@arr2); close(FILE2); my %hash; for($i=0;$i<=$#arr0;$i++) { @arr1 = split(/\=/,$arr0[$i]); $hash{$arr1[0]} = $arr1[1]; } open(OUT, ">>$c"); for($j=0;$j<=$#arr2;$j++) { @arr3=split(/\t/,$arr2[$j]); foreach $k (keys %hash) { if ($arr3[1] eq $k) { $arr3[1] = $hash{$k}; } } print OUT "$arr3[0]\t$arr3[1]\t$arr3[2]\t$arr3[3]\t$arr3[4]\t$arr3[5]\ +t$arr3[6]\t$arr3[7]\t$arr3[8]\t$arr3[9]\t$arr3[10]\t$arr3[11]\n"; } close(OUT);

This works fine for a small file, but my files are more than 2 million lines each. I want to increase the speed of my program. Can you please share your wisdom on how to make it faster for larger files?

Regards

Comment on Replacement of data in a column of a file using Hashes created from another file
Select or Download Code
Replies are listed 'Best First'.
Re: Replacement of data in a column of a file using Hashes created from another file
by choroba (Canon) on Oct 30, 2012 at 09:22 UTC
    The reason why it is slow is you use the nested loops (for each line, you loop over all the keys). The following code generates a regular expression that will match all the keys, so it saves you one loop:
    #!/usr/bin/perl use warnings; use strict; open my $EQ, '<', '1.txt' or die "1: $!"; my %subst; while (<$EQ>) { chomp; # <- updated my ($search, $replace) = split /=/; $subst{$search} = $replace; } my $regex = join '|', map quotemeta, keys %subst; open my $LST, '<', '2.txt' or die "2: $!"; while (<$LST>) { s/($regex)/$subst{$1}/; print; }
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thank you for the reply. But I need to get an output file, I am pretty new to this syntax.

      And this output which gets printed does not have any replacements in it when i run it. Where is it searching for the hash-key and replacing it with the value?

        Not sure what you mean:

        Input:

        10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 100.00 +280 0 0 1 280 3402569 3402290 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 227880 228159 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 2704973 2704694 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 4018745 4019024 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 4149866 4150145 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 2 +80 1 0 1 280 4191268 4191547 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 98.93 2 +80 3 0 1 280 3924929 3925208 9e-136 491 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 100.00 2 +80 0 0 1 280 459101 459380 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 1156698 1156977 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 3643499 3643220 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 4302307 4302028 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 4343709 4343430 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 28 +0 1 0 1 280 4474830 4474551 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 98.93 28 +0 3 0 1 280 4568646 4568367 9e-136 491

        Output:

        10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 100.00 280 0 0 1 280 3402569 3402290 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 227880 228159 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 2704973 2704694 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4018745 4019024 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4149866 4150145 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4191268 4191547 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escheri +chia coli DH1 (ME8569) DNA,... 98.93 280 3 0 1 280 3924929 3925208 9e-136 491 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 100.00 280 0 0 1 280 459101 459380 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 1156698 1156977 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 3643499 3643220 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4302307 4302028 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4343709 4343430 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4474830 4474551 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| Escheric +hia coli DH1, complete genome 98.93 280 3 0 1 280 4568646 4568367 9e-136 491

        The replacement is done in the s/($regex)/$subst{$1}/ statement, and you can direct the output to a file by:

        ./program.pl > output_file.txt
        There does appear to be a stray carriage return in the output - the code may needs a chomp somewhere...

      Thank you for the code, it was dope. Cool skills u got, wish to learn more. Cheers....

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1001475]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2015-07-29 02:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (260 votes), past polls