Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

How to improve my code? main concern:array as hash element

by xargon (Initiate)
on Nov 23, 2011 at 15:21 UTC ( #939684=perlquestion: print w/ replies, xml ) Need Help??
xargon has asked for the wisdom of the Perl Monks concerning the following question:

I have come to seek the wisdom of the Perl

I'm working on a little project of mine (i'm a biologist, as you will notice :P). I'm trying to read through a pre-formatted text file, locate specific words and replace them, then print the changed lines.

Here's what I came up with:

@gi=("Galpha-i1", "Galpha-i2", "Galpha-i3"); @gt=("Galpha-t1", "Galpha-t2", "Galpha-t3"); %gp = ( 'G11' => 'Galpha-11', 'G12' => 'Galpha-12', 'G13' => 'Galpha-13', 'G14' => 'Galpha-14', 'G15' => 'Galpha-15', 'G16' => 'Galpha-16', 'Gs' => 'Galpha-s', 'Gz' => 'Galpha-z', 'Golf' => 'Galpha-olf', 'Go' => 'Galpha-o', 'Gq' => 'Galpha-q', 'Gi' => "@gi", 'Gt'=> "@gt",) ; #the program is about g-proteins.the keys of the hash are the words i #want to find and the values the words i'd like to replace them with while (<>) { #while loop to go through the file and the pattern matching to get the #correct tab if ($_=~/^.*\t.*\t(.*)\t.*\t.*/mgi) { $i=$1; $f=$_; chomp $f; $i=~s/ //g; $j=$gp{$i}; @values= split (' ',$j); foreach $val (@values) { $k=$f; $k=~s/$i/$val/mg; print "$k\n"; } } }

This works perfectly if i'm not gravely mistaken, yet it seems rather unorthodox. I mean, i'm creating a table (eg @gi)which is read in the hash as a scalar and i split it again to get a new table, so i can get all the different outputs. If i use Gi=>@gi , without "", then i only get the gi[0] instead of the whole @gi table. Any ideas how to improve on this code?

UPDATE:

Thank you for your responses.

You are right of course about the data. As hbm guessed it looks like this:

RandomID Name Gi Tissue
RandomID Name Gt Tissue
RandomID Name Gs Tissue
(all separated by tabs to be able to use the file easily on excel)

I read all about references before writing the original code but as i was pretty inexperienced in using them, i did away with them.

++wsfp, what you suggested was actually my first try (trying to figure out if my variable was an array or scalar) but i had no idea how to do hat, hence the work-around :)

++hbm, sorry about the modifiers, old habits you see :P

I'll give it a run and let you know how it turns out. Thank you

Yet another UPDATE:
++Util the uniformity helped me better understand the whole code
++Marshall the interpretation is correct :) And the tips have proven very helpful indeed. Also i found the "next if" statement very interesting and i'll start implementing something like that more often in my progams

Thank you all

Comment on How to improve my code? main concern:array as hash element
Download Code
Re: How to improve my code? main concern:array as hash element
by choroba (Abbot) on Nov 23, 2011 at 15:40 UTC
    You can store the reference to an array in your hash:
    %gp = ( Gi => [qw[Galpha-i1 Galpha-i2 Galpha-i3]], ); print $gp{Gi}[1];
    See also perlref.
Re: How to improve my code? main concern:array as hash element
by vinian (Beadle) on Nov 23, 2011 at 15:41 UTC
    Gi => @gi is different from Gi => "@gi"

    both make @gi interpolate
    the first one is

    Gi => $gi[0], $gi[1] => $gi[2],
    however the second is
    Gi => "$gi[0] $gi[1] $gi[2]"

Re: How to improve my code? main concern:array as hash element
by wfsp (Abbot) on Nov 23, 2011 at 15:56 UTC
    You probably need
    my %gp = ( ... Gi => \@gi, # an array ref Gt => \@gt, );
    Hash values can only be scalars. An array reference is a scalar so we can use that.

    In your while loop you will first need to test if it is an array ref and, if so, dereference it. Something like

    ... $j = $gp{$i}; my @values; if (ref $j eq 'ARRAY'){ @values = @{$gp{$i}}; } else { @values = ($j); } for $val (@values){ ...
    (not tested)
    update: fixed typo
Re: How to improve my code? main concern:array as hash element
by wfsp (Abbot) on Nov 23, 2011 at 16:27 UTC
    Out of curiosity, what does your input look like? If you could give an example that shows what the string you're searching looks like, particularly those that contain Gi or Gt there could be other approaches to consider.

    update:
    This is now bugging me. :-) I can't see what the for loop is doing. I really need to see the data. Give me the data. :-)

Re: How to improve my code? main concern:array as hash element
by hbm (Hermit) on Nov 23, 2011 at 16:30 UTC

    If your code is very representative, it seems you can do away with your hash and arrays, and do something like this where you first assign $j:

    ($j=$i) =~ s/G(.+)/Galpha-$1/; my @values = ($j =~ /(Galpha-[it])/ ? map{$1.$_}1..3 : ($j) );

    Also, read up on modifiers in perlre. You have this:

    if ($_=~/^.*\t.*\t(.*)\t.*\t.*/mgi)

    All the modifiers are unnecessary:

    • m, because you are matching against a single line;
    • g, because you match the (one) beginning of the line;
    • i, because you are matching .*, i.e., in any case

     

    Update: Clearly, I don't have a clue what your data looks like, but I'm thinking something like this:

    use strict; use warnings; while (<DATA>){ chomp; next unless my $i = (split/\t/,$_)[2]; my $f = $_; $i =~ s/ //g; (my $j = $i) =~ s/G(.+)/Galpha-$1/; my @values = ($j =~ /(Galpha-[it])/ ? map{$1.$_}1..3 : ($j) ); for my $val (@values) { my $k = $f; $k =~ s/$i/$val/g; print "$k\n"; } } __DATA__ biologist xargon Gi question perl monks G11 answer? # prints biologist xargon Galpha-i1 question biologist xargon Galpha-i2 question biologist xargon Galpha-i3 question perl monks Galpha-11 answer?
Re: How to improve my code? main concern:array as hash element
by Util (Priest) on Nov 23, 2011 at 16:59 UTC

    ++wfsp. For uniformity, I would store *every* value of %gp as an arrayref, rather than need conditional logic on every use of a value.

    Here is my quick refactor, with ++hbm's test data (padded to 5 columns):

    use strict; use warnings; my %gp = ( G11 => [qw( Galpha-11 )], G12 => [qw( Galpha-12 )], G13 => [qw( Galpha-13 )], G14 => [qw( Galpha-14 )], G15 => [qw( Galpha-15 )], G16 => [qw( Galpha-16 )], Gs => [qw( Galpha-s )], Gz => [qw( Galpha-z )], Golf => [qw( Galpha-olf )], Go => [qw( Galpha-o )], Gq => [qw( Galpha-q )], Gi => [qw( Galpha-i1 Galpha-i2 Galpha-i3 )], Gt => [qw( Galpha-t1 Galpha-t2 Galpha-t3 )], ); my $re = qr{ \A ( [^\t]* \t [^\t]* ) \t ( [^\t]* ) \t ( [^\t]* \t .* ) \z }msx; LINE: while ( my $line = <DATA> ) { chomp $line; my ( $col_1_2, $col_3, $col_rest ) = ( $line =~ /$re/ ) or do { warn "Line $. did not match: '$line'"; next LINE; }; $col_3 =~ tr{ }{}d; my $aref = $gp{$col_3} or do { # No replacement needed? Is line still wanted? warn "Did not find '$col_3' in %gp"; print $line, "\n"; next LINE; }; for my $val ( @{ $aref } ) { print join( "\t", $col_1_2, $val, $col_rest ), "\n"; } } __DATA__ biologist xargon Gi question col5 perl monks G11 answer? col5

Re: How to improve my code? main concern:array as hash element
by aaron_baugher (Deacon) on Nov 23, 2011 at 23:05 UTC

    If a line in your input is:

    RandomID    Name    Gi    Tissue

    then what do you want your output to be? Is the "Gi" supposed to be replaced with one of the three items in your @gi array, or all three at once, or should the line be printed three times, once for each value? My guess based on your code is that it should become three lines.

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

      yes, sorry about that.As you correctly said, i wanted to replace Gi with every item in the corresponding array and then print it once with each changed value (aka 3 lines)

Re: How to improve my code? main concern:array as hash element
by Marshall (Prior) on Nov 24, 2011 at 04:41 UTC
    The problem statement was vague. My interpretation is:

    Problem:
    I have a tab separated csv file. If the term in column 3 is in my translation table, instead of printing that line I want to print a new CSV line using each of the equivalent translation term(s). If a CSV line has less than 5 columns or the term cannot be translated - no processing is done and line is not printed (not sure about this). Below I have used | as the separator instead of \t so that the data is easier to see and work with...

    If my understanding of what you want is wrong, then please correct me and we'll go from there.

    The best data structure appears to be a HashOfArray (HoA). This eliminates the need for a special case of one term vs more than one term. ++Util Using a HoA in this translation sense is common and is a reasonable approach.

    I see no need for any kind of regex at all. The right tool here appears to be split, not regex. Check if the number of columns is enough and if so, then check if the term in column 3 can be translated. If both of these are true, then just print one line per translation term.

    If you want case insensitive comparisons, then convert the translation keys to all one case (upper or lower) and also case the column 3 term the same way.

    #!/usr/bin/perl -w use strict; my @gi=("Galpha-i1", "Galpha-i2", "Galpha-i3"); my @gt=("Galpha-t1", "Galpha-t2", "Galpha-t3"); my %gp = ( G11 => [qw( Galpha-11 )], G12 => [qw( Galpha-12 )], G13 => [qw( Galpha-13 )], G14 => [qw( Galpha-14 )], G15 => [qw( Galpha-15 )], G16 => [qw( Galpha-16 )], Gs => [qw( Galpha-s )], Gz => [qw( Galpha-z )], Golf => [qw( Galpha-olf )], Go => [qw( Galpha-o )], Gq => [qw( Galpha-q )], Gi => [@gi], Gt => [@gt], ); while (<DATA>) { chomp; my @columns = split(/\|/, $_); next if ( @columns <5 or !exists $gp{$columns[2]}); foreach my $replacement (@{$gp{$columns[2]}}) { print "$columns[0]|$columns[1]|$replacement|", join("|",@columns[3..@columns-1]),"\n"; } } =prints biologist|xargon|Galpha-i1|question|col5 biologist|xargon|Galpha-i2|question|col5 biologist|xargon|Galpha-i3|question|col5 bobby|jane|Galpha-11|somewthing|col5|col6 =cut __DATA__ biologist|xargon|Gi|question|col5 bobby|jane|G11|somewthing|col5|col6 perl|monks|G11|too_short
    As a note, using | instead of \t often works much better as a separator because you cannot tell the difference easily between a tab and a space when you look at the file in a normal text editor. And for example, my program editor is set to convert all tabs to spaces. There is no standard for how many spaces a tab should be and formatting gets messed up - so the net of this is that \t separated files are hard to work with.

    Update:

    Gq => [qw( Galpha-q )], Gi => [@gi],
    What this means: The square brackets allocate new anonymous memory for an array (a hunk of memory that has no programmatic predefined name). Each value of %gp is a reference to memory allocated in that way. What Gi => [@gi] does is: allocate new array memory and then copy @gi into it. The hash key, Gi points to that memory. The reference to that memory is a single value and that is why this works in a hash table.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://939684]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-12-25 13:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (160 votes), past polls