Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Thank you for clearing up the file layout question. I don't know if the following approach will be helpful to you but I wonder if, rather than an array, a hash table might be something to explore. A hash (also sometimes called an associative array) is arranged as key and value pairs. Here I initialise scalars to hold the sequence name and the sequence characters and also the hash in which sequences will be stored. Then I read the file line by line using chomp to remove line endings then storing the ">......." line as the key and concatenating the following lines together to form the value. When I encounter the next ">......." line or when I reach the end of file I call the addSequence() subroutine to add the key/value pair to the hash. Note that when the first ">......." line is read addSequence() is called but nothing as added to the hash as the stored sequence title is empty.

The output shows the sorted keys from the resultant hash then I use the Data::Dumper module to show the actual hash in the form 'key' => 'value', the output now wrapping because of the long, concatenated lines. If it was preferred to keep the sequence data lines separated in an array this could be easily accomplished.

use 5.026; use warnings; use Data::Dumper; open my $inFH, q{<}, \ <<__EOD__ or die $!; >NM_030643.4 Homo sapiens apolipoprotein L4 (APOL4) GAGGTGCTGGGGAGCAGCGTGTTTGCTGTGCTTGATTGTGAGCTGCTGGGAAGTTGTGACTTTCATTTTA CCTTTCGAATTCCTGGGTATATCTTGGGGGCTGGAGGACGTGTCTGGTTATTATATAGGTGCACAGCTGG AGGTGAGATCCACACAGCTCAGACCAGCTGGATCTTGCTCAGTCTCTGTCAGAGGAAGATCCCTTGGAGG AGGCCCCGCAGCGACATGGAGGGAGCTGCTTTGCTGAAAATCTTTGTCGTCTGCATCTGGAACCAAAATC >NM_001198855.1 Homo sapiens cytochrome P450 family 2 subfamily C memb +er 8 (CYP2C8) ACATGTCAAAGAGACACACACTAAATTAGCAGGGAGTGTTATAAAAACTTTGGAGTGCAAGCTCACAGCT GTCTTAATAAGAAGAGAAGGCTTCAATGGAACCTTTTGTGGTCCTGGTGCTGTGTCTCTCTTTTATGCTT CTCTTTTCACTCTGGAGACAGAGCTGTAGGAGAAGGAAGCTCCCTCCTGGCCCCACTCCTCTTCCTATTA >NR_029834.1 Homo sapiens microRNA 200a (MIR200A), microRNA CCGGGCCCCTGTGAGCATCTTACCGGACAGTGCTGGATTTCCCAGCTTGACTCTAACACTGTCTGGTAAC GATGTTCAAAGGTGACCCGC >AC067940.1 Homo sapiens clone RP11-818E9, LOW-PASS SEQUENCE SAMPLING AAATACAACTTTAAATCAAAACGGTAAAAATTCCACTCTTTCATACTAACTTCAAAAGTATTTGCTTTAA AAAAAAAGNNNNNNNNNNAAACTGAATTTCTATTAAGCATCTATTTATAGAAGAGAGTAAACACCCCGTG AATAAAAGACAGAGAATTGTAGCAGCCCGAAGTCCCTTTTCTCTCCTCCCAAGCATTTGGCTCTGGTCCA AATTCACATATCCTGCTCCGTAAAACAAAGTGCCTTGGTTAACCTAACGTTATTCCTTGAACAGTAGTTT AGTGATCAACTAGTTTTTGTTGTTGTTGTTGTTTGAGACAGAGTCTCACTCTGTCGCCCAGGCTGGAGTG CAGTGGCGAGATCTCAGCTCACTGCAACCTCTGCTGCCCAGGTTCAAGGGATTCTCCTGCCTCAGCCTCC CAAGTAGCTGGTATTACAGGCACCTGCCACCGCGCCTGGCTAATTTTTTTTTTTTTTTTTTTTTGTATTT __EOD__ my $seqTitle = q{}; my $accumulator = q{}; my %sequences = (); while ( <$inFH> ) { chomp; if ( m{^>} ) { addSequence(); } else { $accumulator .= $_; } } addSequence(); close $inFH or die $!; say for sort keys %sequences; say q{-} x 50; print Data::Dumper ->new( [ \ %sequences ], [ qw{ *sequences } ] ) ->Sortkeys( 1 ) ->Dumpxs(); sub addSequence { $sequences{ $seqTitle } = $accumulator if $seqTitle; $seqTitle = $_; $accumulator = q{}; }

The output.

>AC067940.1 Homo sapiens clone RP11-818E9, LOW-PASS SEQUENCE SAMPLING >NM_001198855.1 Homo sapiens cytochrome P450 family 2 subfamily C memb +er 8 (CYP2C8) >NM_030643.4 Homo sapiens apolipoprotein L4 (APOL4) >NR_029834.1 Homo sapiens microRNA 200a (MIR200A), microRNA -------------------------------------------------- %sequences = ( '>AC067940.1 Homo sapiens clone RP11-818E9, LOW-PASS SE +QUENCE SAMPLING' => 'AAATACAACTTTAAATCAAAACGGTAAAAATTCCACTCTTTCATACTA +ACTTCAAAAGTATTTGCTTTAAAAAAAAAGNNNNNNNNNNAAACTGAATTTCTATTAAGCATCTATTTA +TAGAAGAGAGTAAACACCCCGTGAATAAAAGACAGAGAATTGTAGCAGCCCGAAGTCCCTTTTCTCTCC +TCCCAAGCATTTGGCTCTGGTCCAAATTCACATATCCTGCTCCGTAAAACAAAGTGCCTTGGTTAACCT +AACGTTATTCCTTGAACAGTAGTTTAGTGATCAACTAGTTTTTGTTGTTGTTGTTGTTTGAGACAGAGT +CTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGAGATCTCAGCTCACTGCAACCTCTGCTGCCCAGGT +TCAAGGGATTCTCCTGCCTCAGCCTCCCAAGTAGCTGGTATTACAGGCACCTGCCACCGCGCCTGGCTA +ATTTTTTTTTTTTTTTTTTTTTGTATTT', '>NM_001198855.1 Homo sapiens cytochrome P450 family 2 +subfamily C member 8 (CYP2C8)' => 'ACATGTCAAAGAGACACACACTAAATTAGCAGGG +AGTGTTATAAAAACTTTGGAGTGCAAGCTCACAGCTGTCTTAATAAGAAGAGAAGGCTTCAATGGAACC +TTTTGTGGTCCTGGTGCTGTGTCTCTCTTTTATGCTTCTCTTTTCACTCTGGAGACAGAGCTGTAGGAG +AAGGAAGCTCCCTCCTGGCCCCACTCCTCTTCCTATTA', '>NM_030643.4 Homo sapiens apolipoprotein L4 (APOL4) ' +=> 'GAGGTGCTGGGGAGCAGCGTGTTTGCTGTGCTTGATTGTGAGCTGCTGGGAAGTTGTGACTTTCA +TTTTACCTTTCGAATTCCTGGGTATATCTTGGGGGCTGGAGGACGTGTCTGGTTATTATATAGGTGCAC +AGCTGGAGGTGAGATCCACACAGCTCAGACCAGCTGGATCTTGCTCAGTCTCTGTCAGAGGAAGATCCC +TTGGAGGAGGCCCCGCAGCGACATGGAGGGAGCTGCTTTGCTGAAAATCTTTGTCGTCTGCATCTGGAA +CCAAAATC', '>NR_029834.1 Homo sapiens microRNA 200a (MIR200A), mic +roRNA' => 'CCGGGCCCCTGTGAGCATCTTACCGGACAGTGCTGGATTTCCCAGCTTGACTCTAACA +CTGTCTGGTAACGATGTTCAAAGGTGACCCGC' );

This is all speculative. You may have a very good reason to keep all of the lines together as one element of an array but if you need to access individual sequences by name I think that a hash table is the way to go.

Cheers,

JohnGG


In reply to Re: Assigning multiple lines into first element of array by johngg
in thread Assigning multiple lines into first element of array by shabird

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (1)
As of 2021-08-03 01:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My primary motivation for participating at PerlMonks is: (Choices in context)








    Results (32 votes). Check out past polls.

    Notices?