Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

writing to arrays

by Superman (Acolyte)
on Dec 25, 2002 at 18:46 UTC ( [id://222231]=perlquestion: print w/replies, xml ) Need Help??

Superman has asked for the wisdom of the Perl Monks concerning the following question:

hi monks...

i have a plain text file that looks something like...
>DATA SET 1 HSAJDHSDHSADHDSALHDASLDHSALDH HGDKJSHDSADHSALDHLHLDHASDLSAH HKJAHCADHALIDHALSDHLSADHALHDA >DATA SET 2
and what i want to get done is to take each set of data by reading the lines as a string and pushing them into separate arrays

so what i have is something like

if ($string =~ /^\>/){ next unless $string =~/^\>/; push @array, $string; }

does this look sensible? will this actually do what i want it to? i want my 1st array to contain data set 1 and so forth...

please help as i have a deadline for the new year

regards
S

Edit: Added <code> tags and some formatting. larsen

Replies are listed 'Best First'.
Re: writing to arrays
by pg (Canon) on Dec 25, 2002 at 18:57 UTC
    Try to base your code on the attached. Couple of things:
    1. You have to consider that, your data may come in out of sequence, DATA SET 10 might come before DATA SET 9, so you have to determine the index, before you can add the string to the array.
    2. Also you may have some data missing, for example, you may not have DATA SET 7 between DATA SET 6 and 8, you need to take this into consideration. Again this makes it a must for you to determine the index on fly.
    3. Is it possible for you to have the same DATA SET number more than once? For example, you had ">DATA SET 10", and later have "DATA SET 10" again. If yes, you have to think of a way to handle it base on the requirement. In my code, I just concat the latest with all old ones.
    4. In my code, I do not chomp away the newlines. If you want to remove them, uncomment that chomp.
    5. In my code, I left the array element at index zero undef all the time. If you want, you can use it, just substract 1 from the number you read from the file.
    6. in your m//, that \ before > is not needed, although it does not cause problem.
    use Data::Dumper; use strict; my @data; my $cur_index; open(DATA, "<", "data.txt"); while (<DATA>) { #chomp; if (m/^>DATA SET (\d+)/) { $cur_index = $1; } else { $data[$cur_index] .= $_; } } close(DATA); print Dumper(@data);
    (UPDATE: When the question was first posted, the data part was not well formatted as it is now, and my first solution assumed that the real data came right after the "DATA SET n", without line broken. Obviously that solution was wrong.

    Thanks to gjb, he sent me a message, and pointed out that my solution didn't make sense to him. Then I checked the original question, and realized the format is now different after adding tags.

    I really appreciate, not just gjb's tech point, but more important the way he handled it, which clearly shows his pleasant personality.)

      thanks 4 ur reply but i think that what u r suggesting is a bit more complex than what i may actually need. Each line in my file is a $string, and what i need to do is to start at ">" and read every string i find from then into an @array until i find another ">"
Re: writing to arrays
by Arien (Pilgrim) on Dec 25, 2002 at 20:45 UTC

    I would use an array of arrays like this:

    my @data; my $curr; while (<DATA>) { $curr = $1, next if /^>DATA\s+SET\s+(\d+)/; push @{$data[$curr]}, $_; }

    use Data::Dumper if you are unsure about the structure of @data.

    — Arien

Re: writing to arrays
by John M. Dlugosz (Monsignor) on Dec 25, 2002 at 20:22 UTC
    If I understand the question, you want a separate array for each DATA SET, and each line is one array entry in the proper set.

    my @array_set; my $array; while (<DATA>) { if (/^>DATA SET (\d+)) { $array= \$array_set[$1]; } else { push @$array, $_; } }
    Something like that; I may have typos and such. How this works is that if a >DATA SET xxx line is seen, then $array is set to point to the proper array. Otherwise, a line is added to the current array.
Re: writing to arrays
by snafu (Chaplain) on Dec 25, 2002 at 21:20 UTC
    Using the flip-flop operator:

    #!/usr/bin/perl -w use strict; my @darray; my $set; my $eo; while ( <DATA> ) { chomp(); $eo = ( />DATA SET/ .. />DATA SET/ ); /DATA SET (\d+)/; if ( $1 ) { $set = $1; } if ( $eo =~ /E0/ ) { $darray[$set] .= $_."\n"; next; } $darray[$set] .= $_."\n" if ( $set ); } for ( my $c = 0 ; $c <= $#darray ; $c++ ) { print "array element: $c\n"; print "$darray[$c]\n"; } __DATA__ >DATA SET 1 HSAJDHSDHSADHDSALHDASLDHSALDH HGDKJSHDSADHSALDHLHLDHASDLSAH HKJAHCADHALIDHALSDHLSADHALHDA >DATA SET 2 HSAJDHSDHSADHDSALHDASLDHSALDH HGDKJSHDSADHSALDHLHLDHASDLSAH HKJAHCADHALIDHALSDHLSADHALHDA >DATA SET 3 HSAJDHSDHSADHDSALHDASLDHSALDH HGDKJSHDSADHSALDHLHLDHASDLSAH HKJAHCADHALIDHALSDHLSADHALHDA >DATA SET 4 HSAJDHSDHSADHDSALHDASLDHSALDH HGDKJSHDSADHSALDHLHLDHASDLSAH HKJAHCADHALIDHALSDHLSADHALHDA >DATA SET 5 HSAJDHSDHSADHDSALHDASLDHSALDH HGDKJSHDSADHSALDHLHLDHASDLSAH HKJAHCADHALIDHALSDHLSADHALHDA >DATA SET 6 HSAJDHSDHSADHDSALHDASLDHSALDH HGDKJSHDSADHSALDHLHLDHASDLSAH HKJAHCADHALIDHALSDHLSADHALHDA

    _ _ _ _ _ _ _ _ _ _
    - Jim
    Insert clever comment here...

Re: writing to arrays
by Wonko the sane (Deacon) on Dec 26, 2002 at 13:56 UTC
    Another way to do this, keeping the index of array as the number of the DATA SET

    #!/usr/local/bin/perl -w use strict; use Data::Dumper; my @records; { local $/ = '>'; # record separator. while ( <DATA> ) { push( @{$records[$1]}, split(/(?:\n+|>)/) ) if ( s/DATA SET ([0-9]+)\n+// ); } } print Dumper( \@records ); __DATA__ >DATA SET 1 1aHSAJDHSDHSADHDSALHDASLDHSALDH 1bHGDKJSHDSADHSALDHLHLDHASDLSAH 1cHKJAHCADHALIDHALSDHLSADHALHDA >DATA SET 2 2aHSAJDHSDHSADHDSALHDASLDHSALDH 2bHGDKJSHDSADHSALDHLHLDHASDLSAH 2cHKJAHCADHALIDHALSDHLSADHALHDA >DATA SET 3 3aHSAJDHSDHSADHDSALHDASLDHSALDH 3bHGDKJSHDSADHSALDHLHLDHASDLSAH 3cHKJAHCADHALIDHALSDHLSADHALHDA

    Some extra juggling is done to clean output. This is what it looks like.

    :!./test.pl $VAR1 = [ undef, [ '1aHSAJDHSDHSADHDSALHDASLDHSALDH', '1bHGDKJSHDSADHSALDHLHLDHASDLSAH', '1cHKJAHCADHALIDHALSDHLSADHALHDA' ], [ '2aHSAJDHSDHSADHDSALHDASLDHSALDH', '2bHGDKJSHDSADHSALDHLHLDHASDLSAH', '2cHKJAHCADHALIDHALSDHLSADHALHDA' ], [ '3aHSAJDHSDHSADHDSALHDASLDHSALDH', '3bHGDKJSHDSADHSALDHLHLDHASDLSAH', '3cHKJAHCADHALIDHALSDHLSADHALHDA' ] ];

    Best Regards,
    Wonko

Re: writing to arrays
by tandemrepeat (Initiate) on Dec 26, 2002 at 16:39 UTC
    This looks like a multi-fasta file holding DNA or protein sequence data (with sequence ID after the >). I use one of the following two ways to get this info into an array before piping into blast or other sequence manipulations. The first is a loop from some hand-me-down code that works quite well (but any comments etc on optimization etc v. welcome...)
    open( FASTAFILE, $ARGV[0] ); while (<FASTAFILE>) { if ( /^>/ && $seqflag == 1 ) { push ( @sequences, $fasta ); $fasta = ""; $fasta = $_; } elsif (/^>/) { $fasta = $_; $seqflag = 1; } else { $fasta .= $_; } } push ( @sequences, $fasta ); #then iterate @sequences to run over BLAST
    The other (better?) way is the very nice Bioperl modules that have methods that specifically handle multifasta flat files. Also check out EMBOSS, a sequence analysis suite that interfaces with BioPerl...EMBOSS + BioPerl makes life sooo much easier... From the bioperl tutorial...
    # script 1: create the index use Bio::Index::Fasta; # using fasta file format $Index_File_Name = shift; $inx = Bio::Index::Fasta->new( -filename => $Index_File_Name, -write_flag => 1); $inx->make_index(@ARGV); # script 2: retrieve some files use Bio::Index::Fasta; $Index_File_Name = shift; $inx = Bio::Index::Fasta->new($Index_File_Name); foreach $id (@ARGV) { $seq = $inx->fetch($id); # Returns Bio::Seq object # do something with the sequence }
    Hope this helps,

    tandemrepeat
      T.R Thanks for your comments. Indeed the file that i am playing with is a FASTA file which will be put thru BLAST eventuallay to generate some output. Thanks a lot 4 ur help! No more answers for this question reqd monks...thx 2 every1 that replied!
Re: writing to arrays
by Anonymous Monk on Dec 26, 2002 at 07:31 UTC
    while (<FILE>){
    chomp;
    next if $_ =~/\>DATA SET/;
    push @array,$_;
    }

    Everyone seems to be making this more complicated than it
    needs to be.

    Superman
    Make sure you qualify your data if possible.
    Cheching string length may be one option
    Its always good to check incomming data you may not have control of.

      All this code does is push the whole file into a single array haveing removed every line that contains the text ">DATA SET". Your code does not even ensure that this is found at the start of the line.

      What you end up with is an array of everything munged together, all grouping information lost with no way to recover it.

      It's hard to believe that this will meet the OP's requirements.


      Examine what is said, not who speaks.

        Yep, I know. I did not realize he/she needed seperate arrays
        for each dataset. Realized after posting, Please ignore
        previous post.

        sorry- hand officially slaped

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://222231]
Approved by pg
Front-paged by pg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (8)
As of 2024-03-28 09:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found