Re: How to improve my code? main concern:array as hash element

The problem statement was vague. My interpretation is:

Problem:
I have a tab separated csv file. If the term in column 3 is in my translation table, instead of printing that line I want to print a new CSV line using each of the equivalent translation term(s). If a CSV line has less than 5 columns or the term cannot be translated - no processing is done and line is not printed (not sure about this). Below I have used | as the separator instead of \t so that the data is easier to see and work with...

If my understanding of what you want is wrong, then please correct me and we'll go from there.

The best data structure appears to be a HashOfArray (HoA). This eliminates the need for a special case of one term vs more than one term. ++Util Using a HoA in this translation sense is common and is a reasonable approach.

I see no need for any kind of regex at all. The right tool here appears to be split, not regex. Check if the number of columns is enough and if so, then check if the term in column 3 can be translated. If both of these are true, then just print one line per translation term.

If you want case insensitive comparisons, then convert the translation keys to all one case (upper or lower) and also case the column 3 term the same way.

#!/usr/bin/perl -w
use strict;

my @gi=("Galpha-i1", "Galpha-i2", "Galpha-i3");
my @gt=("Galpha-t1", "Galpha-t2", "Galpha-t3");

my %gp = (
    G11  => [qw( Galpha-11  )],
    G12  => [qw( Galpha-12  )],
    G13  => [qw( Galpha-13  )],
    G14  => [qw( Galpha-14  )],
    G15  => [qw( Galpha-15  )],
    G16  => [qw( Galpha-16  )],
    Gs   => [qw( Galpha-s   )],
    Gz   => [qw( Galpha-z   )],
    Golf => [qw( Galpha-olf )],
    Go   => [qw( Galpha-o   )],
    Gq   => [qw( Galpha-q   )],
    Gi   => [@gi],
    Gt   => [@gt],
);

while (<DATA>)
{
    chomp;
    my @columns = split(/\|/, $_);

    next if ( @columns <5 or !exists $gp{$columns[2]});
   
    foreach my $replacement (@{$gp{$columns[2]}})
    {
        print "$columns[0]|$columns[1]|$replacement|",
              join("|",@columns[3..@columns-1]),"\n";
    }
}
=prints
biologist|xargon|Galpha-i1|question|col5
biologist|xargon|Galpha-i2|question|col5
biologist|xargon|Galpha-i3|question|col5
bobby|jane|Galpha-11|somewthing|col5|col6
=cut

__DATA__
biologist|xargon|Gi|question|col5
bobby|jane|G11|somewthing|col5|col6
perl|monks|G11|too_short
[download]

As a note, using | instead of \t often works much better as a separator because you cannot tell the difference easily between a tab and a space when you look at the file in a normal text editor. And for example, my program editor is set to convert all tabs to spaces. There is no standard for how many spaces a tab should be and formatting gets messed up - so the net of this is that \t separated files are hard to work with.

Update:

Gq   => [qw( Galpha-q   )],
Gi   => [@gi],
[download]

What this means: The square brackets allocate new anonymous memory for an array (a hunk of memory that has no programmatic predefined name). Each value of %gp is a reference to memory allocated in that way. What Gi => [@gi] does is: allocate new array memory and then copy @gi into it. The hash key, Gi points to that memory. The reference to that memory is a single value and that is why this works in a hash table.

Comment on Re: How to improve my code? main concern:array as hash element Select or Download Code


go ahead... be a heretic
	PerlMonks