Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

pattern matching to separate data

by patric (Acolyte)
on May 03, 2009 at 13:06 UTC ( #761574=perlquestion: print w/replies, xml ) Need Help??

patric has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, Am trying to separate the data from a file into two different files based on the matching of either "GENEID" or "PROTID". Below is the input file.
input file: >data_1 GENEID_8 1_exons 87028 - 87375 348 bp, chain - ATGCCCAAATTAGTCAACATATTGATCACTACGGAGGAAATCTTGAAGAGTTCAAGGGGC TGTCCATTTTACTTGAAGAGCCTAAAGATCAAAAAGGGTGATAATAAATCTTTAGAAGAT ATGCTCATAATTGAATCTAACCTTACGATTTCTTCTACTTCTAATTGA >data_1 PROTID_8 1_exons 87028 - 87375 115 aa, chain - KLVNILITTEEILKSSRGIVLTVEQTSSIKRKFGWKKKKVKSAKKQKRESKPKKDGPK AAEAKGKYFHYDADGHWRRNCPFYLKSLKIKKGDNKSLEDMLIIESNLTISSTSN >data_2 GENEID_12 2_exons 121021 - 121590 486 bp, chain - ATGTGGCACAACCGCCTAGGCCACATGGGTGACAAGGGGCTGAGGGAGTTGAGCAGGAGA AGACACTTCTCAGTTAAGGGGACTCCACAGCAGAATGGGATGGCCGAGAGGATGAATAGA ACACTTTTGGAAAAAGGCTCGATGCATGAGGCTGTAGGCAGAGCTTCCAAAGGCATTCTG GGTTGA >data_2 PROTID_12 2_exons 121021 - 121590 161 aa, chain - LVHTDIYFMREKSEVFTKFKIWRAEVEKEQGRSVKCLRSDNGREYTSREFQDYCEECGIR RHFSVKGTPQQNGMAERMNRTLLEKGSMHEAVGRASKGILG program written so far: #!/usr/bin/perl open(OUT1,">GENEID.out")or die "can not create new file"; open(OUT2,">PROTID.out")or die "can not create new file"; open(FILE,"input.txt")or die "can not open file"; while ($line=<FILE>){ $hit1= $line=~ /^(>data_\d+\s+GENEID_\d+.*\n.*)/s; print OUT1 "$hit1\n"; $hit2= $line=~ /^(>data_\d+\s+PROTID_\d+.*\n.*)/s; print OUT2 "$hit2\n"; } desired output: file GENEID.out: >data_1 GENEID_8 1_exons 87028 - 87375 348 bp, chain - ATGCCCAAATTAGTCAACATATTGATCACTACGGAGGAAATCTTGAAGAGTTCAAGGGGC TGTCCATTTTACTTGAAGAGCCTAAAGATCAAAAAGGGTGATAATAAATCTTTAGAAGAT ATGCTCATAATTGAATCTAACCTTACGATTTCTTCTACTTCTAATTGA >data_2 GENEID_12 2_exons 121021 - 121590 486 bp, chain - ATGTGGCACAACCGCCTAGGCCACATGGGTGACAAGGGGCTGAGGGAGTTGAGCAGGAGA AGACACTTCTCAGTTAAGGGGACTCCACAGCAGAATGGGATGGCCGAGAGGATGAATAGA ACACTTTTGGAAAAAGGCTCGATGCATGAGGCTGTAGGCAGAGCTTCCAAAGGCATTCTG GGTTGA file PROTID.out >data_1 PROTID_8 1_exons 87028 - 87375 115 aa, chain - KLVNILITTEEILKSSRGIVLTVEQTSSIKRKFGWKKKKVKSAKKQKRESKPKKDGPK AAEAKGKYFHYDADGHWRRNCPFYLKSLKIKKGDNKSLEDMLIIESNLTISSTSN >data_2 PROTID_12 2_exons 121021 - 121590 161 aa, chain - LVHTDIYFMREKSEVFTKFKIWRAEVEKEQGRSVKCLRSDNGREYTSREFQDYCEECGIR RHFSVKGTPQQNGMAERMNRTLLEKGSMHEAVGRASKGILG
my results are giving only the headers(the line which starts with >) and not the alphabetic string. can any one please correct me in which line i am going wrong in my code? thank you.

Replies are listed 'Best First'.
Re: pattern matching to separate data
by ELISHEVA (Prior) on May 03, 2009 at 13:29 UTC

    You need to set the record separator to '>', like this: $/='>'. That way each call to <FILE> will get a single record rather than just part of the record. By default $/ is set to the new line and so if you leave your code as is, you are only getting up to the end of each line. See perlvar for more information.

    A useful debugging tip is to print out each $line with begin and end markers: print STDERR "##$line##\n". The source of the problem would have been immediately clear had you done that. You might also enjoy this link from someone else who forgot to do that. By pure coincidence, Unbelievably Obvious Debugging Tip just happened to be in today's random pick of Selected Best Nodes

    Best, beth

      Thanks for your advice..that really worked :)
Re: pattern matching to separate data
by jwkrahn (Monsignor) on May 03, 2009 at 18:15 UTC

    This should do what you want:

    #!/usr/bin/perl use warnings; use strict; open OUT1, '>', 'GENEID.out' or die "can not create 'GENEID.out' $!"; open OUT2, '>', 'PROTID.out' or die "can not create 'PROTID.out' $!"; open FILE, '<', 'input.txt' or die "can not open 'input.txt' $!"; while ( my $line = <FILE> ) { select OUT1 if $line =~ /^>data_\d+\s+GENEID_\d+\b/; select OUT2 if $line =~ /^>data_\d+\s+PROTID_\d+\b/; print $line; }
Re: pattern matching to separate data
by graff (Chancellor) on May 03, 2009 at 23:42 UTC
    Previous replies have given you a working solution, but in case it helps to know how the OP code went wrong:
    while ($line=<FILE>){ $hit1= $line=~ /^(>data_\d+\s+GENEID_\d+.*\n.*)/s; print OUT1 "$hit1\n"; $hit2= $line=~ /^(>data_\d+\s+PROTID_\d+.*\n.*)/s; print OUT2 "$hit2\n"; }
    The problems are:
    • The while loop is reading one line at a time, and printing to both output files on every iteration.

    • The input is structured as multi-line records, and the criteria for selecting the correct output file is only present on the first line of each record, so you would need to maintain a "state" variable (or use a variable for the output file handle, and assign it properly on reading the first line of each multi-line record) -- but your loop is pretending that every line contains the criteria for deciding which output to use.

    • You are using capturing parens in your regex match, but assigning the result to a scalar variable in a scalar context, which means the value assigned will be the number of captured strings (i.e. 1 or 0, depending on which line was just read). Note the following difference between assigning the match return in a scalar context ($c) versus a list context (@m, or $m in parens)

      $str = "text with some pattern in it"; $c = $str =~ / (some pattern) /; # sets $c to the numeric value " +1" @m = $str =~ / (some pattern) /; # assigns "some pattern" as sole + element of @m ( $m ) = $str =~ / (some pattern) /; # sets $m to "some pattern"
    The result of those points taken together is that both your output files had the same line count as your input file, and the content of those lines is either "1" or "0". (When you said your "results are giving only the headers...", I suspect that you were looking at data that was not created by the code you posted.)
Re: pattern matching to seperate data
by snopal (Pilgrim) on May 03, 2009 at 13:43 UTC

    This looks like homework to me so I'll make suggestions rather than code

    1) Filehandles can be stored in variables too:

    my ($fh1, $fh2); open $fh1, ">GENEID.out" or ... open $fh2, ">PROTID.out" or ...

    2) Read every line - print every line. You are obviously not doing that.

    3) set another Filehandle var '$fh' with the appropriate Filehandle as you detect a change and use your new $fh to store and print to the right file.

Re: pattern matching to separate data
by citromatik (Curate) on May 04, 2009 at 08:43 UTC

    Another alternative using Tie::File::AnyData::Bio::Fasta:

    use strict; use warnings; use Fcntl qw/:DEFAULT/; use Tie::File::AnyData::Bio::Fasta; tie my @arr, 'Tie::File::AnyData::Bio::Fasta', 'input.txt' or die $!; tie my @geneids, 'Tie::File::AnyData::Bio::Fasta', 'GENEID.out',mode = +> O_RDWR | O_CREAT or die $!; tie my @protids, 'Tie::File::AnyData::Bio::Fasta', 'PROTID.out',mode = +> O_RDWR | O_CREAT or die $!; @geneids = grep {/GENEID/} @arr; @protids = grep {/PROTID/} @arr; untie @arr; untie @geneids; untie @protids;

    citromatik

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://761574]
Approved by AnomalousMonk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2020-11-26 03:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?