I have a file format called FRG file. I would like extract certain information from the file. My current script able to parse some of the information but not accurate. My current script recognizes "acc: " as the sequence ID. However, there is another "acc:" which represent the library name instead of the sequence id. I would like to capture the "acc:" after the "{FRG" bracket, and the sequences under seq: but not rest of line. My current script will grep everything after "typ:* to next acc:"?
How can I match only the sequence from seq: to .qlt?
{BAT
bna:(Batch name)
crt:12012334370
acc:110767247557
com:
Generated By LibraryInterface process
Start: 1147321117200
StartDate: 20060511 00:16:47
End: 1149189374000
EndDate: 20060601 15:16:14
Index: 1 of 1
Filename: Library_name.frg
.
}
{DST
act:A
acc:1052000138220
mea:5000.0
std:1000.0
}
{FRG
act:A
acc:1101077781160
typ:R
src:
.
etm:0
seq:
acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta
ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat
tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat
taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat
tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt
catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat
tatctgacga
.
qlt:
6666666?::866::<>H\EEGE@B?AAJMRXXXXX\XSE?BB??MPROOPOSZHIKIIO____aaRQQM
MPOPJOMOROQHKMPNM____Y___]_\]VVV\\V_VVXXX\Y______]Y__Y__V\VVVV______\V
\X____MMV\
X\Y___TXX_]___\V\\\\______\\V]_\\\\\\S_\\V\]\X\\\]\\\S_________]V__\SS
SZZS\\\]]RRQQSSGSRMHHGHHF\\PSSSSS\\\__]]\]\\\MSRQQNRRTZSS]\]VOPHCEJOOL
\_\V\VSVHGGCMMP@>9977@EJAXVV\JSIOCAB97@AECHC>A@??99=<ABL>A98?EC>99@>A>
<<<>ACC=J@C@@GRTKCCHE?>CFR]]\DD<8>>:7:@B;:B8888<<@EHE>97B77<<@GCLG<>99
:;9:88<776
.
clr:17,855
}
{FRG
act:A
acc:1101077781161
typ:R
src:
.
etm:0
seq:
gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc
attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata
caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc
taatccaaattgtttaccctagacaatcaaactgccagtattaaac
.
qlt:
7966666666766:877:E>G==AAAEEEOOMMHQGEEDG@ACBIILJLKNRRRNNDDNNQMJXR]VIGI
YY]_YaaMMRSS\VVVSPTWPRTV_Y__V
\Xaaa\]_V\XaaPL]]_\__TVRXVW]XXXXWXV__TXPLLXW___VTTXXVaaaaaa]XPTV_XXXXR
XVT_Y_______Y__aY_\TRRMSRRV\_]]]S\]V___V\X____aVXXXV\]aa_]XXXWPPPWWPLL
aPX]]]___XXXPTLLXX__]]aaaaaaa]\Y__T\XRa]]a\VV\RVVXTT]RXXXVMSHMOW]XXXXX
Xaaa]]XPLLLXXTT]H____V_______Y_____]aXTLPWT___XTPPWTXWXVMMVROT_]]T_TTT
TV_TTTTTHICJHKNTRPRHHK\VTXXXRRRVMOIISRRWXXPXRRXXX]]]V\\V\VSR\V\VRRRH\X
XXR]XLLTX]]]]]XXMNLNNR]_\SS]IIHEEEKGEHJ]]EEGGCABA>>A>>;::;=?@><7779B97
<;A:=<<79<7:<<>:8<8866677<99866999<888=9=;>AA:
.
clr:20,824
}
My code:
#!/usr/bin/perl
#FRG to FASTA
$/ = "acc:";
$| = 1;
while(<>){ #FRG file as input
chomp;
my ($titleline, $sequence) = split(/\n/,$_,2);
next unless ($sequence && $titleline);
my ($id) = $titleline =~ /^(\S+)/;
$sequence =~ s/\n//g;
print ">$titleline\n$sequence\n";
}
The ideal output should be like :
>1101077781160
acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta
ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat
tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat
taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat
tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt
catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat
tatctgacga
>1101077781161
gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc
attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata
caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc
taatccaaattgtttaccctagacaatcaaactgccagtattaaac