Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

split a file into records and process it

by biohisham (Priest)
on Mar 24, 2010 at 14:12 UTC ( #830575=perlquestion: print w/ replies, xml ) Need Help??
biohisham has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I am really stuck trying to visualize the best data structure to use to get a generic format file processed for info, the text file was so messed up but I got it to look like what follows
#### The file has the following headers for each record### Exon # Gene id Nm_id snoRNA Key text Sequence Query, subject Gene name and weblink ##Start the records### 3 GI:91982771 NM_001040105.1 snoRNA 10 Query 4 TGGAGTCAAT 13 |||||||||| Sbjct 4854 TGGAGTCAAT 4845 Homo sapiens mucin 17, cell surface associated (MUC17), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDU305DZ01N&log%24=nuclalign&blast_rank=97&list_uids=9 +1982771 3 GI:154448895 NM_001100162.1 snoRNA 25, 26 and 27 Query 2 CCTGGAGTCGAGTG 15 |||||||||||||| Sbjct 146 CCTGGAGTCGAGTG 133 Homo sapiens exportin 7 (XPO7), transcript variant 3, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=2&list_uids=15 +4448895 31 4 different hits GI:153945877 NM_002458.1 snoRNA 25, 26 and 27 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 6818 CTGGAGTCGAGTG 6806 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 8489 CTGGAGTCGAGTG 8477 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 10589 CTGGAGTCGAGTG 10577 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 12260 CTGGAGTCGAGTG 12248 Homo sapiens mucin 5B, oligomeric mucus/gel-forming (MUC5B), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=9&list_uids=15 +3945877 4 GI:150418008 NM_206862.2 snoRNA 25, 26 and 27 Query 1 ACCTGGAGTCGAG 13 ||||||||||||| Sbjct 4775 ACCTGGAGTCGAG 4763 Homo sapiens transforming, acidic coiled-coil containing protein 2 (TA +CC2), transcript variant 1, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=10&list_uids=1 +50418008
each record starts with a number on a separate line and this number is not unique, and each records ends in a weblink too, I tried to use that atttribute as a record separator, but splitting around the number as a boundary is not possible through a regexp since "split /^\d+$/;" would leave a space where the number is and hence I can not use it any further, instead I tried a range matching.

I read this data from a file and processed it into the above form, however, I want to use each line that starts with a snoRNA as a hash key which corresponds to a record entity relative to the headers, each key can have more than one instance of the following associated with it..

  • Exon #
  • Gene id
  • Nm_id
  • text Sequence Query subject
  • Gene name and weblink
    • My approach to sovling this problem can be the reason I am stuck and hence I would implore you to assist me into breaking this into records and give me ideas on what I can do to enhance my approach

      Here is my code so far:

      #!/usr/local/bin/perl use strict; use warnings; open (FH,'<',"F:/Bioinformatics_NCBI/20MARCH_10/PERL Analysis/test.txt +") or die("$!\n"); open(FO, '>',"F:/Bioinformatics_NCBI/20MARCH_10/PERL Analysis/testOut. +txt") or die ("$!\n"); #TESTING my (@snoRNA, @geneID, @productID, @geneNames, @references,@queries,@su +bjects); while(<FH>){ chomp; if(/(?=^\d+$)/../(?=http:.*)\n/){ #range matching # s/\W+\n+!\W+//; next unless /(\w+ |\| | \n+)/x; #except for words | pi +pes | \n print FO $_, "\n" ; } if(/snoRNA(\s+|\d+)[\s|-|\d]/){ #snoRNA push @snoRNA, $_; } if(/^\d+$/){ #exon Numbers push @exonNumbers, $_; } if(/^GI:\d+[\.\d+]/){ #gene Names push @geneID , $_; } if(/^NM_\d+[\.\d+]/){ #gene product ID my $name = $_; $name =~ s/\s+$//; #substitute the trailing blanks.. push @productID, $name; } if(/homo sapiens[\w+\W+]/i){ #gene name, Need MultiLine suppor +t.. my $name = $_; push @geneNames, $name; } if(/http:.*/){ #web refs, need multiline support.. my $name = $_; push @references, $name; } if(/^Query(\s+)\d+\s+[agtc]/i){ #Prepare the query and s +ubject arrays my $queryName = $_; $queryName =~ s/$1//; push @queries, $queryName; } if(/^sbjct(\s+)\d+\s+[agtc]/i){ my $sbjctName = $_; $sbjctName =~ s/$1//; push @subjects, $sbjctName; } }

      UPDATE:Fixed some posting error...

      UPDATE: BrowserUK I am obliged and thanks to jethro and wfsp too, I was running in circles...



      Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.

Comment on split a file into records and process it
Select or Download Code
Re: split a file into records and process it
by BrowserUk (Pope) on Mar 24, 2010 at 14:31 UTC

    Is this part of your first code block actually a part of the data file, or just some annotation you've added for posting?

    #### The file has the following headers for each record### Exon # Gene id Nm_id snoRNA Key text Sequence Query, subject Gene name and weblink ##Start the records###

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      These are just annotations to give descriptions about the data in order to give a feel on the columnar nature that corresponds to the original data file table...



      Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
Re: split a file into records and process it
by BrowserUk (Pope) on Mar 24, 2010 at 16:13 UTC

    What about the blank line in the first record after snoRNA that doesn't appear in any of the other records? Artifact of posting or a genuine possibility in all records?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Interesting observation, that happened inadvertently, not a possibility to account for, the presented data shall indeed reflect all the possibilities I am gonna come across processing this file ...



      Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.

        Then I'd suggest that rather than have a single loop reading lines, and then having conditionals to decide what to do with each type of line, you use a loop that terminates on eof and reads the individual lines of each record in line. This makes for a more robust parser with less confusing conditional code and line to line state.

        This doesn't do the final extraction of the required parts from individual liens of the records, which is easily added, but serves to demonstrate the technique:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; my %records; until( eof( DATA ) ) { chomp( my $exon = <DATA> ); push @{ $records{ $exon } }, {}; my $seqs = 1; my $line = <DATA>; if( $line =~ m[(\d+) different hits] ) { $seqs = $1; chomp( $records{ $exon }[ -1 ]{ gene_id } = <DATA> ); } else { chomp( $records{ $exon }[ -1 ]{ gene_id } = $line ); } chomp( $records{ $exon }[ -1 ]{ Nm_id } = <DATA> ); chomp( $records{ $exon }[ -1 ]{ snoRNA_key } = <DATA> ); for( 1 .. $seqs ) { chomp( my $query = <DATA> ); scalar (<DATA>); chomp( my $sbjct = <DATA> ); push @{ $records{ $exon }[ -1 ]{ seqs } }, { $query => $sbjct +}; } chomp( $records{ $exon }[ -1 ]{ gene_name } = <DATA> ); chomp( $records{ $exon }[ -1 ]{ web_link } = <DATA> ); } pp \%records; __DATA__ 3 GI:91982771 NM_001040105.1 snoRNA 10 Query 4 TGGAGTCAAT 13 |||||||||| Sbjct 4854 TGGAGTCAAT 4845 Homo sapiens mucin 17, cell surface associated (MUC17), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDU305DZ01N&log%24=nuclalign&blast_rank=97&list_uids=9 +1982771 3 GI:154448895 NM_001100162.1 snoRNA 25, 26 and 27 Query 2 CCTGGAGTCGAGTG 15 |||||||||||||| Sbjct 146 CCTGGAGTCGAGTG 133 Homo sapiens exportin 7 (XPO7), transcript variant 3, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=2&list_uids=15 +4448895 31 4 different hits GI:153945877 NM_002458.1 snoRNA 25, 26 and 27 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 6818 CTGGAGTCGAGTG 6806 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 8489 CTGGAGTCGAGTG 8477 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 10589 CTGGAGTCGAGTG 10577 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 12260 CTGGAGTCGAGTG 12248 Homo sapiens mucin 5B, oligomeric mucus/gel-forming (MUC5B), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=9&list_uids=15 +3945877 4 GI:150418008 NM_206862.2 snoRNA 25, 26 and 27 Query 1 ACCTGGAGTCGAG 13 ||||||||||||| Sbjct 4775 ACCTGGAGTCGAG 4763 Homo sapiens transforming, acidic coiled-coil containing protein 2 (TA +CC2), transcript variant 1, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=10&list_uids=1 +50418008

        Output:

        C:\test>junk55 { 3 => [ { Nm_id => "NM_001040105.1", gene_id => "GI:91982771", gene_name => "Homo sapiens mucin 17, cell surface associa +ted (MUC17), mRNA.", seqs => [ { "Query 4 TGGAGTCAAT 13" => "Sbjct + 4854 TGGAGTCAAT 4845", }, ], snoRNA_key => "snoRNA 10", web_link => "http://www.ncbi.nlm.nih.gov/sites/entrez?cm +d=Retrieve&db=nucleotide&dopt=GenBank&RID=UDU305DZ01N&log%24=nuclalig +n&blast_rank=97&list_uids=91982771", }, { Nm_id => "NM_001100162.1", gene_id => "GI:154448895", gene_name => "Homo sapiens exportin 7 (XPO7), transcript +variant 3, mRNA.", seqs => [ { "Query 2 CCTGGAGTCGAGTG 15" => "Sbj +ct 146 CCTGGAGTCGAGTG 133", }, ], snoRNA_key => "snoRNA 25, 26 and 27", web_link => "http://www.ncbi.nlm.nih.gov/sites/entrez?cm +d=Retrieve&db=nucleotide&dopt=GenBank&RID=UDW41RSS01S&log%24=nuclalig +n&blast_rank=2&list_uids=154448895", }, ], 4 => [ { Nm_id => "NM_206862.2", gene_id => "GI:150418008", gene_name => "Homo sapiens transforming, acidic coiled-co +il containing protein 2 (TACC2), transcript variant 1, mRNA.", seqs => [ { "Query 1 ACCTGGAGTCGAG 13" => "Sbj +ct 4775 ACCTGGAGTCGAG 4763", }, ], snoRNA_key => "snoRNA 25, 26 and 27", web_link => "http://www.ncbi.nlm.nih.gov/sites/entrez?cm +d=Retrieve&db=nucleotide&dopt=GenBank&RID=UDW41RSS01S&log%24=nuclalig +n&blast_rank=10&list_uids=150418008", }, ], 31 => [ { Nm_id => "NM_002458.1", gene_id => "GI:153945877", gene_name => "Homo sapiens mucin 5B, oligomeric mucus/gel +-forming (MUC5B), mRNA.", seqs => [ { "Query 3 CTGGAGTCGAGTG 15" => "Sbj +ct 6818 CTGGAGTCGAGTG 6806", }, { "Query 3 CTGGAGTCGAGTG 15" => "Sbj +ct 8489 CTGGAGTCGAGTG 8477", }, { "Query 3 CTGGAGTCGAGTG 15" => "Sb +jct 10589 CTGGAGTCGAGTG 10577", }, { "Query 3 CTGGAGTCGAGTG 15" => "Sb +jct 12260 CTGGAGTCGAGTG 12248", }, ], snoRNA_key => "snoRNA 25, 26 and 27", web_link => "http://www.ncbi.nlm.nih.gov/sites/entrez?cm +d=Retrieve&db=nucleotide&dopt=GenBank&RID=UDW41RSS01S&log%24=nuclalig +n&blast_rank=9&list_uids=153945877", }, ], }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: split a file into records and process it
by jethro (Monsignor) on Mar 24, 2010 at 16:24 UTC
    When the parsing gets as complicated as in your case I tend to use a finite state machine

    If the parsing gets even more complicated and not line oriented any more Parse::RecDescent gets interesting, but I think you are still in fsm territory here.

Re: split a file into records and process it
by wfsp (Abbot) on Mar 24, 2010 at 16:29 UTC
    And after 31 is the line
    4 different hits
    That type of line only appears if there is more than one 'hit'? Always in that form? If so /^(\d+) different hits$/ would be useful?

    Normalising - don't you just love it :-)

    When I get to about 95% success I consider doing the rest by eye. How much of this stuff do you have?

      Oh, I have some of it, I was gonna get into normalization and polishing my code later... that note appears when there is more than one hit, but I can clean it up since it would be obvious that there's actually more than one hit, hence it is needless ...

      Here's my so-messed-up code which I would tend to after having figured with the wiser monks a way addressing my original query..



      Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
Re: split a file into records and process it
by BrowserUk (Pope) on Mar 24, 2010 at 17:11 UTC

    A more complete parser now we've established the data format:

    (Updated: simplified the references, by using a local %record during parsing. Same output)

    (Update2: Removed need for duplicated Gene_id regex.)

    (Update3: Corrected dumb context assignment error introduced during simplification.)

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my %records; until( eof( DATA ) ) { my %record; chomp( my $exon = <DATA> ); my $seqs = 1; my $line = <DATA>; if( $line =~ m[(\d+) different hits] ) { $seqs = $1; $line = <DATA>; } ( $record{ gene_id } ) = ( $line =~ m[GI:(\d+)] ); ( $record{ Nm_id } ) = ( <DATA> =~ m[(NM_\d[\d]+)] ); push @{ $record{ snoRNA_key } }, ( <DATA> =~ m[(\d+)]g ); for( 1 .. $seqs ) { my $query = [ split ' ', <DATA> ]; shift @$query; scalar (<DATA>); my $sbjct = [ split ' ', <DATA> ]; shift @$sbjct; push @{ $record{ seqs } }, { query => $query, sbjct => $sbjct +}; } chomp( $record{ gene_name } = <DATA> ); chomp( $record{ web_link } = <DATA> ); push @{ $records{ $exon } }, \%record; } pp \%records; __DATA__

    Output:


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: split a file into records and process it
by rubasov (Friar) on Mar 24, 2010 at 17:42 UTC
    I've tried to write a little parser by adapting the code from an earlier post of mine (Re: The story of a strange line of code: pos($_) = pos($_);). As others already noted there are several not-so-clear points in your format specification, however the code below tries to be easily adjustable to your real needs.
    #! /usr/bin/perl use strict; use warnings; use Data::Dumper; # May appear an attribute more than once in a record? # Sorry this is not DRY, the attribute names are duplicated # here and in the parser regex. my %is_single = ( 'exon' => 1, 'gene_id' => 1, 'product_id' => 1, 'sno_rna' => 1, 'query_subject' => 0, 'gene_name' => 1, 'link' => 1, 'other' => 0, ); # You can use split, just match the null string # before the real match in a look-ahead. my @records = split /^(?=\d+$)/m, do { local $/; <DATA> }; # An array of hash of something, one item / record. my @parsed_records; #my %sno_records; for (@records) { my %record; # You probably want to eliminate those ugly trailing spaces first # and then leave out the '\s*' parts just before '$'. my $re = qr{ (?: ^ (?<exon> \d+ ) \s* $ ) | (?: ^ GI:\s* (?<gene_id> \d+ ) \s* $ ) | (?: ^ NM_ (?<product_id> \d+\.\d ) \s* $ ) | (?: ^ snoRNA\s+ (?<sno_rna> .+ ) \s* $ ) | (?s: ^ (?<query_subject> Query .*? Sbjct .*? ) \s* $ ) | (?i: ^ (?<gene_name> Homo \s sapiens .* ) \s* $ ) | (?: ^ (?<link> http://.* ) \s* $ ) | (?: ^ (?<other> .+ ) \s* $ ) # Order of branches matters, leave (?<other>) at the very end. }mx; while (m/$re/gc) { my ( $key ) = keys %+; my ( $val ) = values %+; # If a key can appear only once then simply store it. if ( $is_single{$key} ) { $record{$key} = $val; } # Else put it into an array. else { push @{ $record{$key} }, $val; } } # This @parsed_records is _not_ keyed by sno_rna, as it # seemed unnatural for me with the provided sample data. push @parsed_records, \%record; # But you can easily transform it to a data structure keyed by sno_r +na # just uncomment the lines related to %sno_records. #push @{ $sno_records{ $record{sno_rna} } }, \%record; #delete $record{sno_rna}; } print Dumper( \@parsed_records ); #print Dumper( \%sno_records ); __DATA__
    update: I've put the DATA section behind a readmore tag.
Re: split a file into records and process it
by Anonymous Monk on Mar 26, 2010 at 13:17 UTC
    What generated this output? If it's something like blast or ssearch then try getting tabular output. This can be a lot easier to parse. Also, have you checked that bioperl doesn't already parse this format for you?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://830575]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2014-08-01 02:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (256 votes), past polls