Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: split a file into records and process it

by rubasov (Friar)
on Mar 24, 2010 at 17:42 UTC ( #830623=note: print w/ replies, xml ) Need Help??


in reply to split a file into records and process it

I've tried to write a little parser by adapting the code from an earlier post of mine (Re: The story of a strange line of code: pos($_) = pos($_);). As others already noted there are several not-so-clear points in your format specification, however the code below tries to be easily adjustable to your real needs.

#! /usr/bin/perl use strict; use warnings; use Data::Dumper; # May appear an attribute more than once in a record? # Sorry this is not DRY, the attribute names are duplicated # here and in the parser regex. my %is_single = ( 'exon' => 1, 'gene_id' => 1, 'product_id' => 1, 'sno_rna' => 1, 'query_subject' => 0, 'gene_name' => 1, 'link' => 1, 'other' => 0, ); # You can use split, just match the null string # before the real match in a look-ahead. my @records = split /^(?=\d+$)/m, do { local $/; <DATA> }; # An array of hash of something, one item / record. my @parsed_records; #my %sno_records; for (@records) { my %record; # You probably want to eliminate those ugly trailing spaces first # and then leave out the '\s*' parts just before '$'. my $re = qr{ (?: ^ (?<exon> \d+ ) \s* $ ) | (?: ^ GI:\s* (?<gene_id> \d+ ) \s* $ ) | (?: ^ NM_ (?<product_id> \d+\.\d ) \s* $ ) | (?: ^ snoRNA\s+ (?<sno_rna> .+ ) \s* $ ) | (?s: ^ (?<query_subject> Query .*? Sbjct .*? ) \s* $ ) | (?i: ^ (?<gene_name> Homo \s sapiens .* ) \s* $ ) | (?: ^ (?<link> http://.* ) \s* $ ) | (?: ^ (?<other> .+ ) \s* $ ) # Order of branches matters, leave (?<other>) at the very end. }mx; while (m/$re/gc) { my ( $key ) = keys %+; my ( $val ) = values %+; # If a key can appear only once then simply store it. if ( $is_single{$key} ) { $record{$key} = $val; } # Else put it into an array. else { push @{ $record{$key} }, $val; } } # This @parsed_records is _not_ keyed by sno_rna, as it # seemed unnatural for me with the provided sample data. push @parsed_records, \%record; # But you can easily transform it to a data structure keyed by sno_r +na # just uncomment the lines related to %sno_records. #push @{ $sno_records{ $record{sno_rna} } }, \%record; #delete $record{sno_rna}; } print Dumper( \@parsed_records ); #print Dumper( \%sno_records ); __DATA__
3 GI:91982771 NM_001040105.1 snoRNA 10 Query 4 TGGAGTCAAT 13 |||||||||| Sbjct 4854 TGGAGTCAAT 4845 Homo sapiens mucin 17, cell surface associated (MUC17), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDU305DZ01N&log%24=nuclalign&blast_rank=97&list_uids=9 +1982771 3 GI:154448895 NM_001100162.1 snoRNA 25, 26 and 27 Query 2 CCTGGAGTCGAGTG 15 |||||||||||||| Sbjct 146 CCTGGAGTCGAGTG 133 Homo sapiens exportin 7 (XPO7), transcript variant 3, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=2&list_uids=15 +4448895 31 4 different hits GI:153945877 NM_002458.1 snoRNA 25, 26 and 27 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 6818 CTGGAGTCGAGTG 6806 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 8489 CTGGAGTCGAGTG 8477 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 10589 CTGGAGTCGAGTG 10577 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 12260 CTGGAGTCGAGTG 12248 Homo sapiens mucin 5B, oligomeric mucus/gel-forming (MUC5B), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=9&list_uids=15 +3945877 4 GI:150418008 NM_206862.2 snoRNA 25, 26 and 27 Query 1 ACCTGGAGTCGAG 13 ||||||||||||| Sbjct 4775 ACCTGGAGTCGAG 4763 Homo sapiens transforming, acidic coiled-coil containing protein 2 (TA +CC2), transcript variant 1, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=10&list_uids=1 +50418008
update: I've put the DATA section behind a readmore tag.


Comment on Re: split a file into records and process it
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://830623]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2015-07-08 08:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (96 votes), past polls