Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

iterating hash keys?

by R56 (Acolyte)
on Sep 26, 2013 at 14:41 UTC ( #1055846=perlquestion: print w/ replies, xml ) Need Help??
R56 has asked for the wisdom of the Perl Monks concerning the following question:

Hey all! I'm starting to learn Perl to help me on my job and I stumbled across a problem that I can't seem to solve.

First, I have a list of names that have an ID associated, like:

bananas 456

oranges 23

peaches 897236

kiwis 3726

(...)

Then, i have a list of those names, that need to be replaced with the associated ID.

bananas,oranges

peaches,peaches,peaches

kiwis

oranges

kiwis,oranges,bananas,bananas

(...)

My first tought was to put the 'name-id' data in a hash, since every name only has one ID, and they are all unique values.

I would then push the 'names to be replaced' file into an array, and iterate through all the hash keys, while looping the array, matching each key and replacing it with its value.

Sounded like a good ideia, but can't seem to iterate and replace both things effectively.

Maybe I'm overcomplicating a simple problem, but can you guys point me in the right direction?

Thanks in advance!

-R

Comment on iterating hash keys?
Re: using hashes
by mtmcc (Hermit) on Sep 26, 2013 at 14:51 UTC

      Something like this: (assuming @lines as the array that has the input)

      for my $line (@lines) { while(my ($find, $replace) = each %ids) { s/$find/$replace/g } }
        This should work, and is clear to read. While it is not optimally efficient, efficiency shouldn't be your concern at this stage. If this isn't working, you need to post more information about your actual script. Posting real input, expected output, and actual code (all wrapped in <code> tags) will greatly facilitate the debugging. As discussed in How do I post a question effectively?.

        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        For an effective solution to your problem, see BrowserUK's comment below. As to why the code you've shown doesn't work, it's probably because you're storing each line of your file/array in $line, but doing your substitution against $_. Try this: $line =~ s/$find/$replace/g.

Re: using hashes
by BrowserUk (Pope) on Sep 26, 2013 at 14:56 UTC
    and iterate through all the hash keys

    Don't ever iterate hash keys! (Well, hardly ever :)

    The major purpose of hashes is that you can lookup the value associated with any key directly, avoiding iteration.

    For your purpose, the major part of the code should be something like:

    while( <$names_to_be_replaced_file> ) { ## read each line s[\b([a-z]+)\b][ $name_id{ $1 } ]ge; ## find words, look them up + and replace them with the id print; ## Send the modified lines +to stdout }

    Simple and very efficient.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thanks for the help Browser, but apparently I'm way behind in Perl knowledge yet, as I don't really get the code... That's how I know I'm overcomplicating something that is really simple :|

      Is the $1 var pointing to the value of the hash?

        Is the $1 var pointing to the value of the hash?

        $1 captures the words in the string one at a time. This $hash{ $1 } looks that word up in the hash and returns the associates value (id). The ge causes the ids to be substituted for every word in the line.

        Perhaps this will clarify things?

        %hash = ( brown=>1, fox=>2, quick=>3, the=>4 );; $line = 'the quick brown fox';; $line =~ s[\b([a-z]+)\b][ $hash{ $1 } ]ge;; print $line;; 4 3 1 2

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
            s[\b([a-z]+)\b][ $name_id{ $1 } ]ge;

        The 's' at the beginning says to find a pattern and replace it. The 'g' at the end says to repeat this process as many times as possible. The 'e' at the end says that the replacement part should be evaluated as code, not treated as literal text.

        In the first part, the pattern, the \b matches a "word boundary," the boundary between word characters and non-word characters like your commas. [a-z]+ means a string of 1 or more consecutive lowercase letters. The parentheses around that capture whatever is matched within them and save it in the special variable $1.

        In the replacement part, $1 contains the matched word, so this becomes a simple lookup for that word as a key in the %name_id hash, replacing it with the value corresponding to that key. As mentioned before, because of the 'g', this entire process is repeated for each match found in the line.

        Aaron B.
        Available for small or large Perl jobs; see my home node.

        In order to make things even more complicated I recommend to replace $hash{ $1 } with

        $hash{ $1 } // $1

        which means if $1 is not found in your hash, then replace your word with itself, ie leave it unchanged.

Re: using hashes
by kennethk (Monsignor) on Sep 26, 2013 at 15:06 UTC

    First, what mtmcc said.

    Second, a quote from the illustrious prophet: Doing linear scans over an associative array is like trying to club someone to death with a loaded Uzi. -- TimToady

    You should put your keys into a hash, yes, but then just iterate over your array. The array values are exactly what you need to access the hash values. So it might look like:

    my %id = (bananas => 456, oranges => 23, peaches => 897236, kiwis => 3726, ); my @replaces = ('kiwis','oranges','bananas','bananas'); for my $i (0 .. $#replaces) { $replaces[$i] = $id{$replaces[$i]}; }
    If I were going to actually write this, I'd take advantage of the fact that the loop iterator for Foreach Loops is an lvalue for the array element ($_ = $id{$_} for @replaces;), but that might be a little to magical for your taste given your familiarity with the language.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Thanks Kenneth, I understand your code, but I may have more than one name in the same line, such as:

      bananas,peaches,kiwis

      peaches,peaches

      pineapple

      (...)

      So we couldn't use that kind of cycling on the array positions, right? Or am I missing something?

        ... am I missing something?

        You're missing what BrowserUk said here, an approach that processes an entire line at a time.

        The next part to think about is what happens if you encounter a 'word' in a line that doesn't exist in your translation hash, e.g., the line
            "peaches,peaches,foobar,kiwis\n"
        (hints: exists, next, maybe  // (defined-or) or  ?: (ternary/conditional operator) – see perlop for the latter two).

Re: iterating hash keys?
by kcott (Abbot) on Sep 26, 2013 at 18:59 UTC

    G'day R56,

    Welcome to the monastery.

    Firstly, a word about your data. The term list has a special meaning in Perl: see "perldata: List value constructors". I've taken what you've described as lists to be records in files. Given you wrote "... the 'names to be replaced' file ...", that seems correct for the second list; although, until I had read that far, I initially thought you might have been talking about a list of lists (which is something different — see perllol).

    Anyway, this means you (probably) have a CSV (comma-separated values) file which is best read using a module like Text::CSV. The reason for this is that there are all sorts of gotchas with CSV files which have already been coded for in these modules. As an example, consider two records: "apples, red,cherries" and "apples, red cherries". If you had an ID for "apples, red", how would you handle the replacement in those two records.

    So, I'd suggest you check whether your data really is as simple as the examples you've posted; and consider the chances of it staying that way in the future. You may need to revisit whatever solution you choose based on those findings. The solution I provide below assumes nothing more complex than what you currently show.

    Here's my take on a solution. I create a hash mapping names to IDs (same as you). Next, I use the keys of that hash to create a regex with an alternation (e.g. bananas|oranges|...) such that only the names with IDs will be matched. Finally, the replacements are made and the new data is output.

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $in_file_name_id = 'pm_1055846_name_id_data.txt'; my $in_file_name_replace = 'pm_1055846_name_replace_data.txt'; my $out_file_name_replaced = 'pm_1055846_name_replaced_out.txt'; open my $in_id_fh, '<', $in_file_name_id; my %id_for = map { split } <$in_id_fh>; close $in_id_fh; my $re = '\b(' . join('|', keys %id_for) . ')\b'; open my $in_replace_fh, '<', $in_file_name_replace; open my $out_replaced_fh, '>', $out_file_name_replaced; while (<$in_replace_fh>) { s/$re/$id_for{$1}/g; print $out_replaced_fh $_; }

    Here's the files. Notice I added "pineapples", which didn't have an ID, and so wasn't replaced.

    $ cat pm_1055846_name_id_data.txt bananas 456 oranges 23 peaches 897236 kiwis 3726
    $ cat pm_1055846_name_replace_data.txt bananas,oranges peaches,peaches,peaches kiwis oranges kiwis,oranges,bananas,bananas bananas,oranges,pineapples,peaches,kiwis
    $ cat pm_1055846_name_replaced_out.txt 456,23 897236,897236,897236 3726 23 3726,23,456,456 456,23,pineapples,897236,3726

    -- Ken

      Hey Ken, good to be here :) Thank you for the patience to write all that. I don't know yet if the data will be this simple at all times, but it's always better to cover all the options if it doesn't sacrifice speed. Will definitely try out your code to see if I can improve this!

      Well, comparing to what I had, your code is faster than the speed of light!

      Is there a simple way for the s// to also include names with hyphens in the middle?

        "Well, comparing to what I had, your code is faster than the speed of light!"

        That's a good start. :-)

        "Is there a simple way for the s// to also include names with hyphens in the middle?"

        The short answer is: yes. The longer answer depends on details. I found a reference you made to input data with hyphens in "Re^8: using hashes"; however, you provided no indication of the output you wanted (except that 20-10,25 was the wrong output when bana-na,banana was the input).

        The following is based on the code I provided earlier. Given these input files:

        $ cat pm_1055846_name_id_data.txt bananas 456 oranges 23 peaches 897236 kiwis 3726 banana 25 bana 20 bana-na 15 na 10
        $ cat pm_1055846_name_replace_data.txt bananas,oranges peaches,peaches,peaches kiwis oranges kiwis,oranges,bananas,bananas bananas,oranges,pineapples,peaches,kiwis bana-na,banana ba-na-na,bana-bana,bana-nana

        If you want output like this:

        $ cat pm_1055846_name_replaced_out.txt 456,23 897236,897236,897236 3726 23 3726,23,456,456 456,23,pineapples,897236,3726 15,25 ba-10-10,20-20,20-nana

        Change

        my $re = '\b(' . join('|', keys %id_for) . ')\b';

        to

        my $re = '\b(' . join('|', sort { $b cmp $a } keys %id_for) . ')\b';

        If you want output like this:

        $ cat pm_1055846_name_replaced_out.txt 456,23 897236,897236,897236 3726 23 3726,23,456,456 456,23,pineapples,897236,3726 15,25 ba-na-na,bana-bana,bana-nana

        Change

        my $re = '\b(' . join('|', keys %id_for) . ')\b';

        to

        my $re = '(^|,)(' . join('|', sort { $b cmp $a } keys %id_for) . ')(?= +,|$)';

        and

        s/$re/$id_for{$1}/g;

        to

        s/$re/$1$id_for{$2}/g;

        If you want something different to these, and are unable to work it out for yourself, provide details as outlined in the "How do I post a question effectively?" guidelines.

        It would also be useful to advise what version of Perl you're using: I wrote those changes for v5.8; a more efficient version could have been written for a later version. As a hint for doing this yourself, see (?<=pattern) \K under Look-Around Assertions in "perlre: Extended Patterns" — \K was introduced in v5.10.0 (see "perl5100delta: Regular expressions" for this, and other, regex enhancements).

        -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1055846]
Approved by hdb
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2014-12-19 07:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (72 votes), past polls