Greetings Monks,
I am having a bit of a problem coming up with a grammar to parse what looks like a very simple file. I can get it to work, but the resulting parser is excruciatingly slow - enough so to be completely unusable, and to make me think that it's possible to do better.
So, without further ado, the grammar I am using is:
sequences : header sequence(s)
header : seq_count app_number
seq_count : "<160> NUMBER OF SEQ ID NOS:" /\d+/
app_number : "<140> CURRENT APPLICATION NUMBER:" /[\w\/,]+/
sequence : seq_id seq_length seq_type organism feat_token(s?) seq
seq_id : "<210> SEQ ID NO" /\d+/
seq_length : "<211> LENGTH:" /\d+/
seq_type : "<212> TYPE:" type
type : "DNA" | "PRT"
organism : "<213> ORGANISM:" /\w+ \w+/
feat_token : feature | name_key | location | other
feature : "<220> FEATURE:" /[\w\s]*/
name_key : "<221> NAME/KEY:" /\w+/
location : "<222> LOCATION:" /[\d\.\(\)]+/
other : "<223> OTHER INFORMATION:" /[^<]+/
seq : "<400> SEQUENCE:" /\d+/ /[\w\s]+/
And the actual data I am trying to get at:
<160> NUMBER OF SEQ ID NOS: 727
<140> CURRENT APPLICATION NUMBER: US/09/984,429
<210> SEQ ID NO 1
<211> LENGTH: 733
<212> TYPE: DNA
<213> ORGANISM: Homo sapiens
<400> SEQUENCE: 1
gggatccgga gcccaaatct tctgacaaaa ctcacacatg cccaccgtgc ccagcacct
+g 60
aattcgaggg tgcaccgtca gtcttcctct tccccccaaa acccaaggac accctcatg
+a 120
tctcccggac tcctgaggtc acatgcgtgg tggtggacgt aagccacgaa gaccctgag
+g 180
tcaagttcaa ctggtacgtg gacggcgtgg aggtgcataa tgccaagaca aagccgcgg
+g 240
aggagcagta caacagcacg taccgtgtgg tcagcgtcct caccgtcctg caccaggac
+t 300
ggctgaatgg caaggagtac aagtgcaagg tctccaacaa agccctccca acccccatc
+g 360
agaaaaccat ctccaaagcc aaagggcagc cccgagaacc acaggtgtac accctgccc
+c 420
catcccggga tgagctgacc aagaaccagg tcagcctgac ctgcctggtc aaaggcttc
+t 480
atccaagcga catcgccgtg gagtgggaga gcaatgggca gccggagaac aactacaag
+a 540
ccacgcctcc cgtgctggac tccgacggct ccttcttcct ctacagcaag ctcaccgtg
+g 600
acaagagcag gtggcagcag gggaacgtct tctcatgctc cgtgatgcat gaggctctg
+c 660
acaaccacta cacgcagaag agcctctccc tgtctccggg taaatgagtg cgacggccg
+c 720
gactctagag gat
+ 733
<210> SEQ ID NO 2
<211> LENGTH: 5
<212> TYPE: PRT
<213> ORGANISM: Homo sapiens
<220> FEATURE:
<221> NAME/KEY: Site
<222> LOCATION: (3)
<223> OTHER INFORMATION: Xaa equals any of the twenty naturally ocurri
+ng L-amino acids
<400> SEQUENCE: 2
Trp Ser Xaa Trp Ser
1 5
<210> SEQ ID NO 3
<211> LENGTH: 86
<212> TYPE: DNA
<213> ORGANISM: Artificial Sequence
<220> FEATURE:
<221> NAME/KEY: Primer_Bind
<223> OTHER INFORMATION: Synthetic sequence with 4 tandem copies of th
+e GAS binding site
found in the IRF1 promoter (Rothman et al., Immunity 1:457-468
(1994)), 18 nucleotides complementary to the SV40 early promoter
+,
and a Xho I restriction site.
<400> SEQUENCE: 3
gcgcctcgag atttccccga aatctagatt tccccgaaat gatttccccg aaatgattt
+c 60
cccgaaatat ctgccatctc aattag
+ 86
<210> SEQ ID NO 86
<211> LENGTH: 194
<212> TYPE: DNA
<213> ORGANISM: Artificial Sequence
<220> FEATURE:
<223> OTHER INFORMATION: Amplimer
<400> SEQUENCE: 86
tgcttggtga aggaatagcc accccagaga aggagtatgg acttctatac acaatcatt
+c 60
attcattcat tcattcattc attcattcat tcattcacta ctcatgcatg atctttgtc
+c 120
ttatcttcct ccactgtcac atgaataccc acccactgca cctacctgct tcctattcc
+t 180
gagaacccag gctc
+ 194
<210> SEQ ID NO 87
<211> LENGTH: 23
<212> TYPE: DNA
<213> ORGANISM: Homo sapiens
<400> SEQUENCE: 87
ggcaatggag gagttccggg aca
+ 23
I suspect I am being overly greedy somewhere in the rules, slowing things down - any suggestions on how this can be improved? (As you can probably tell, I am extremely new to P::RD)
I don't need it to be blindingly fast, but currently each file (several megs) would take well over half an hour, and I have thousands of them!
thanks in advance