Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Vertical Regex

by joomanji (Acolyte)
on Jun 03, 2009 at 21:05 UTC ( [id://768164]=perlquestion: print w/replies, xml ) Need Help??

joomanji has asked for the wisdom of the Perl Monks concerning the following question:

I have a file format called FRG file. I would like extract certain information from the file. My current script able to parse some of the information but not accurate. My current script recognizes "acc: " as the sequence ID. However, there is another "acc:" which represent the library name instead of the sequence id. I would like to capture the "acc:" after the "{FRG" bracket, and the sequences under seq: but not rest of line. My current script will grep everything after "typ:* to next acc:"? How can I match only the sequence from seq: to .qlt?
{BAT bna:(Batch name) crt:12012334370 acc:110767247557 com: Generated By LibraryInterface process Start: 1147321117200 StartDate: 20060511 00:16:47 End: 1149189374000 EndDate: 20060601 15:16:14 Index: 1 of 1 Filename: Library_name.frg . } {DST act:A acc:1052000138220 mea:5000.0 std:1000.0 } {FRG act:A acc:1101077781160 typ:R src: . etm:0 seq: acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat tatctgacga . qlt: 6666666?::866::<>H\EEGE@B?AAJMRXXXXX\XSE?BB??MPROOPOSZHIKIIO____aaRQQM MPOPJOMOROQHKMPNM____Y___]_\]VVV\\V_VVXXX\Y______]Y__Y__V\VVVV______\V \X____MMV\ X\Y___TXX_]___\V\\\\______\\V]_\\\\\\S_\\V\]\X\\\]\\\S_________]V__\SS SZZS\\\]]RRQQSSGSRMHHGHHF\\PSSSSS\\\__]]\]\\\MSRQQNRRTZSS]\]VOPHCEJOOL \_\V\VSVHGGCMMP@>9977@EJAXVV\JSIOCAB97@AECHC>A@??99=<ABL>A98?EC>99@>A> <<<>ACC=J@C@@GRTKCCHE?>CFR]]\DD<8>>:7:@B;:B8888<<@EHE>97B77<<@GCLG<>99 :;9:88<776 . clr:17,855 } {FRG act:A acc:1101077781161 typ:R src: . etm:0 seq: gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc taatccaaattgtttaccctagacaatcaaactgccagtattaaac . qlt: 7966666666766:877:E>G==AAAEEEOOMMHQGEEDG@ACBIILJLKNRRRNNDDNNQMJXR]VIGI YY]_YaaMMRSS\VVVSPTWPRTV_Y__V \Xaaa\]_V\XaaPL]]_\__TVRXVW]XXXXWXV__TXPLLXW___VTTXXVaaaaaa]XPTV_XXXXR XVT_Y_______Y__aY_\TRRMSRRV\_]]]S\]V___V\X____aVXXXV\]aa_]XXXWPPPWWPLL aPX]]]___XXXPTLLXX__]]aaaaaaa]\Y__T\XRa]]a\VV\RVVXTT]RXXXVMSHMOW]XXXXX Xaaa]]XPLLLXXTT]H____V_______Y_____]aXTLPWT___XTPPWTXWXVMMVROT_]]T_TTT TV_TTTTTHICJHKNTRPRHHK\VTXXXRRRVMOIISRRWXXPXRRXXX]]]V\\V\VSR\V\VRRRH\X XXR]XLLTX]]]]]XXMNLNNR]_\SS]IIHEEEKGEHJ]]EEGGCABA>>A>>;::;=?@><7779B97 <;A:=<<79<7:<<>:8<8866677<99866999<888=9=;>AA: . clr:20,824 }
My code:
#!/usr/bin/perl #FRG to FASTA $/ = "acc:"; $| = 1; while(<>){ #FRG file as input chomp; my ($titleline, $sequence) = split(/\n/,$_,2); next unless ($sequence && $titleline); my ($id) = $titleline =~ /^(\S+)/; $sequence =~ s/\n//g; print ">$titleline\n$sequence\n"; }
The ideal output should be like :
>1101077781160 acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat tatctgacga >1101077781161 gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc taatccaaattgtttaccctagacaatcaaactgccagtattaaac

Replies are listed 'Best First'.
Re: Vertical Regex
by zwon (Abbot) on Jun 03, 2009 at 21:34 UTC

    Not sure how stable is the file structure, maybe you really should write a parser, as suggests JavaFan, but the following example would work on the given input:

    use strict; use warnings; open my $fh, "<", "file.frg"; while(<$fh>) { if (/^{FRG/../^}/){ if (/^acc:(\d+)/) { print ">$1\n"; } elsif ( /^seq:/ ) { while(<$fh>) { last if /^\./; print; } } } }
      Thank you very much zwon! Your script work just fine for me. I've made some changes to the code so that I can define what is the input file. But I'm not sure whether this is the correct way of doing it or not. Thank you for showing me how the regex work to match certain text i want. Thank you
      my ($frg) = @ARGV; $frgfile = "$frg"; open(frgfile) or die("Unable to open FRG file"); while(<frgfile>) { if (/^{FRG/../^}/){ if (/^acc:(\d+)/) { print ">$1\n"; } elsif ( /^seq:/ ) { while(<frgfile>) { last if /^\./; print; } } } }
        my ($frg) = @ARGV; $frgfile = "$frg"; open(frgfile) or die("Unable to open FRG file");

        That's very confusing form. It's a good practice to always start your scripts with

        use strict; use warnings;
        this may help you to avoid many problems. In this case you would see some warnings. I'd write this as follows:
        open my $fd, '<', $ARGV[0] or die $!; while (<$fd>) { ...
        Note how you can store filehandle in scalar variable.
Re: Vertical Regex
by JavaFan (Canon) on Jun 03, 2009 at 21:19 UTC
    You are probably better off writing a parser that can parse the entire file, turns in into a structure (a list of hashes for instance), and then processes the structures.

    From the example I'd say writing a parser is fairly trivial, but without a spec of how such an FRG file can be formatted it's hard to say. But that isn't any different from trying to solve the problem with a handful of regexes.

      I agree with you that writing a parser to parse the entire file is more feasible and can be use to extract other information. For this script, i would like to learn how to confine searches from the input text. From here, I should learn and apply it to make it a better parser. I'm very interested to know how to parse a text file to a list of hash, as when the file getting very big, it would be ideal to speed up the search process?
Re: Vertical Regex
by johngg (Canon) on Jun 03, 2009 at 22:06 UTC

    Here is a slightly different approach to zwon's where I only read the file in one place and use state variables to keep track of where we are in the data file.

    use strict; use warnings; my $frgFile = q{spw768164.frg}; open my $frgFH, q{<}, $frgFile or die qq{open: < $frgFile: $!\n}; my $inFRG = 0; my $inSeq = 0; while( <$frgFH> ) { if( $inFRG ) { if( m|^\}| ) { $inFRG = 0; } elsif( $inSeq ) { if( m{^\.} ) { $inSeq = 0; } else { print; } } elsif( m{^acc:(\d+)} ) { print qq{>$1\n}; } elsif( m{^seq:} ) { $inSeq = 1; } else { ; } } else { next unless m|^\{FRG|; $inFRG = 1; } } close $frgFH or die qq{close: < $frgFile: $!\n};

    The output.

    >1101077781160 acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat tatctgacga >1101077781161 gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc taatccaaattgtttaccctagacaatcaaactgccagtattaaac

    I hope this is of interest.

    Cheers,

    JohnGG

      Thanks JohnGG for demonstrating another method for pattern matching. Will look into this code in detail so that I can apply the knowledge to the future scripts! Thanks !!
Re: Vertical Regex
by bichonfrise74 (Vicar) on Jun 03, 2009 at 22:40 UTC
    Something like this perhaps?
    #!/usr/bin/perl use strict; while (<DATA>) { if ( /^{FRG/ ... /^}/ ) { print ">$1\n" if /^acc:(\d+)/; if ( /^seq:/ ... /^\./ ) { next if ( /^seq:/ or /^\./ ); print; } } }
      Thanks for the solution, it quick and easy just to pass out the data that i needed. It works as well! TQ!
Re: Vertical Regex
by tweetiepooh (Hermit) on Jun 04, 2009 at 08:40 UTC
    Also check out the BioPerl module in CPAN as that does mention FASTA and cluster files. Googling seems to find more references to converting FASTA to FRG formats.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://768164]
Approved by pKai
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-19 03:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found