Vertical Regex

joomanji has asked for the wisdom of the Perl Monks concerning the following question:

I have a file format called FRG file. I would like extract certain information from the file. My current script able to parse some of the information but not accurate. My current script recognizes "acc: " as the sequence ID. However, there is another "acc:" which represent the library name instead of the sequence id. I would like to capture the "acc:" after the "{FRG" bracket, and the sequences under seq: but not rest of line. My current script will grep everything after "typ:* to next acc:"? How can I match only the sequence from seq: to .qlt?

{BAT
bna:(Batch name)
crt:12012334370
acc:110767247557
com:
Generated By LibraryInterface process
Start: 1147321117200
StartDate: 20060511 00:16:47
End: 1149189374000
EndDate: 20060601 15:16:14
Index: 1 of 1
Filename: Library_name.frg
.
}
{DST
act:A
acc:1052000138220
mea:5000.0
std:1000.0
}
{FRG
act:A
acc:1101077781160
typ:R
src:
.
etm:0
seq:
acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta
ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat
tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat
taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat
tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt
catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat
tatctgacga
.
qlt:
6666666?::866::<>H\EEGE@B?AAJMRXXXXX\XSE?BB??MPROOPOSZHIKIIO____aaRQQM
MPOPJOMOROQHKMPNM____Y___]_\]VVV\\V_VVXXX\Y______]Y__Y__V\VVVV______\V
\X____MMV\
X\Y___TXX_]___\V\\\\______\\V]_\\\\\\S_\\V\]\X\\\]\\\S_________]V__\SS
SZZS\\\]]RRQQSSGSRMHHGHHF\\PSSSSS\\\__]]\]\\\MSRQQNRRTZSS]\]VOPHCEJOOL
\_\V\VSVHGGCMMP@>9977@EJAXVV\JSIOCAB97@AECHC>A@??99=<ABL>A98?EC>99@>A>
<<<>ACC=J@C@@GRTKCCHE?>CFR]]\DD<8>>:7:@B;:B8888<<@EHE>97B77<<@GCLG<>99
:;9:88<776
.
clr:17,855
}
{FRG
act:A
acc:1101077781161
typ:R
src:
.
etm:0
seq:
gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc
attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata
caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc
taatccaaattgtttaccctagacaatcaaactgccagtattaaac
.
qlt:
7966666666766:877:E>G==AAAEEEOOMMHQGEEDG@ACBIILJLKNRRRNNDDNNQMJXR]VIGI
YY]_YaaMMRSS\VVVSPTWPRTV_Y__V
\Xaaa\]_V\XaaPL]]_\__TVRXVW]XXXXWXV__TXPLLXW___VTTXXVaaaaaa]XPTV_XXXXR
XVT_Y_______Y__aY_\TRRMSRRV\_]]]S\]V___V\X____aVXXXV\]aa_]XXXWPPPWWPLL
aPX]]]___XXXPTLLXX__]]aaaaaaa]\Y__T\XRa]]a\VV\RVVXTT]RXXXVMSHMOW]XXXXX
Xaaa]]XPLLLXXTT]H____V_______Y_____]aXTLPWT___XTPPWTXWXVMMVROT_]]T_TTT
TV_TTTTTHICJHKNTRPRHHK\VTXXXRRRVMOIISRRWXXPXRRXXX]]]V\\V\VSR\V\VRRRH\X
XXR]XLLTX]]]]]XXMNLNNR]_\SS]IIHEEEKGEHJ]]EEGGCABA>>A>>;::;=?@><7779B97
<;A:=<<79<7:<<>:8<8866677<99866999<888=9=;>AA:
.
clr:20,824
}
[download]

My code:

#!/usr/bin/perl
#FRG to FASTA

$/ = "acc:";
$| = 1;

while(<>){ #FRG file as input
   chomp;
   my ($titleline, $sequence) = split(/\n/,$_,2);
   next unless ($sequence && $titleline);
   my ($id) = $titleline =~ /^(\S+)/;
   $sequence =~ s/\n//g; 
   print ">$titleline\n$sequence\n";
        }
[download]

The ideal output should be like :

>1101077781160
acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta
ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat
tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat
taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat
tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt
catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat
tatctgacga
>1101077781161
gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc
attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata
caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc
taatccaaattgtttaccctagacaatcaaactgccagtattaaac
[download]

Comment on Vertical Regex Select or Download Code

Replies are listed 'Best First'.
Re: Vertical Regex by zwon (Abbot) on Jun 03, 2009 at 21:34 UTC
Not sure how stable is the file structure, maybe you really should write a parser, as suggests JavaFan, but the following example would work on the given input: `use strict; use warnings; open my $fh, "<", "file.frg"; while(<$fh>) { if (/^{FRG/../^}/){ if (/^acc:(\d+)/) { print ">$1\n"; } elsif ( /^seq:/ ) { while(<$fh>) { last if /^\./; print; } } } }` [download]	[reply] [d/l]
Re^2: Vertical Regex by joomanji (Acolyte) on Jun 04, 2009 at 16:27 UTC
Thank you very much zwon! Your script work just fine for me. I've made some changes to the code so that I can define what is the input file. But I'm not sure whether this is the correct way of doing it or not. Thank you for showing me how the regex work to match certain text i want. Thank you `my ($frg) = @ARGV; $frgfile = "$frg"; open(frgfile) or die("Unable to open FRG file"); while(<frgfile>) { if (/^{FRG/../^}/){ if (/^acc:(\d+)/) { print ">$1\n"; } elsif ( /^seq:/ ) { while(<frgfile>) { last if /^\./; print; } } } }` [download]	[reply] [d/l]
Re^3: Vertical Regex by zwon (Abbot) on Jun 04, 2009 at 17:18 UTC
`my ($frg) = @ARGV; $frgfile = "$frg"; open(frgfile) or die("Unable to open FRG file");` [download] That's very confusing form. It's a good practice to always start your scripts with `use strict; use warnings;` [download] this may help you to avoid many problems. In this case you would see some warnings. I'd write this as follows: `open my $fd, '<', $ARGV[0] or die $!; while (<$fd>) { ...` [download] Note how you can store filehandle in scalar variable.	[reply] [d/l] [select]
Re: Vertical Regex by JavaFan (Canon) on Jun 03, 2009 at 21:19 UTC
You are probably better off writing a parser that can parse the entire file, turns in into a structure (a list of hashes for instance), and then processes the structures. From the example I'd say writing a parser is fairly trivial, but without a spec of how such an FRG file can be formatted it's hard to say. But that isn't any different from trying to solve the problem with a handful of regexes.	[reply]
Re^2: Vertical Regex by joomanji (Acolyte) on Jun 04, 2009 at 16:23 UTC
I agree with you that writing a parser to parse the entire file is more feasible and can be use to extract other information. For this script, i would like to learn how to confine searches from the input text. From here, I should learn and apply it to make it a better parser. I'm very interested to know how to parse a text file to a list of hash, as when the file getting very big, it would be ideal to speed up the search process?	[reply]
Re: Vertical Regex by johngg (Canon) on Jun 03, 2009 at 22:06 UTC
Here is a slightly different approach to zwon's where I only read the file in one place and use state variables to keep track of where we are in the data file. `use strict; use warnings; my $frgFile = q{spw768164.frg}; open my $frgFH, q{<}, $frgFile or die qq{open: < $frgFile: $!\n}; my $inFRG = 0; my $inSeq = 0; while( <$frgFH> ) { if( $inFRG ) { if( m\|^\}\| ) { $inFRG = 0; } elsif( $inSeq ) { if( m{^\.} ) { $inSeq = 0; } else { print; } } elsif( m{^acc:(\d+)} ) { print qq{>$1\n}; } elsif( m{^seq:} ) { $inSeq = 1; } else { ; } } else { next unless m\|^\{FRG\|; $inFRG = 1; } } close $frgFH or die qq{close: < $frgFile: $!\n};` [download] The output. >1101077781160 acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat tatctgacga >1101077781161 gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc taatccaaattgtttaccctagacaatcaaactgccagtattaaac [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Vertical Regex by joomanji (Acolyte) on Jun 04, 2009 at 16:32 UTC
Thanks JohnGG for demonstrating another method for pattern matching. Will look into this code in detail so that I can apply the knowledge to the future scripts! Thanks !!	[reply]
Re: Vertical Regex by bichonfrise74 (Vicar) on Jun 03, 2009 at 22:40 UTC
Something like this perhaps? `#!/usr/bin/perl use strict; while (<DATA>) { if ( /^{FRG/ ... /^}/ ) { print ">$1\n" if /^acc:(\d+)/; if ( /^seq:/ ... /^\./ ) { next if ( /^seq:/ or /^\./ ); print; } } }` [download]	[reply] [d/l]
Re^2: Vertical Regex by joomanji (Acolyte) on Jun 04, 2009 at 16:37 UTC
Thanks for the solution, it quick and easy just to pass out the data that i needed. It works as well! TQ!	[reply]
Re: Vertical Regex by tweetiepooh (Hermit) on Jun 04, 2009 at 08:40 UTC
Also check out the BioPerl module in CPAN as that does mention FASTA and cluster files. Googling seems to find more references to converting FASTA to FRG formats.	[reply]


Perl: the Markov chain saw
	PerlMonks