http://www.perlmonks.org?node_id=1055882


in reply to iterating hash keys?

G'day R56,

Welcome to the monastery.

Firstly, a word about your data. The term list has a special meaning in Perl: see "perldata: List value constructors". I've taken what you've described as lists to be records in files. Given you wrote "... the 'names to be replaced' file ...", that seems correct for the second list; although, until I had read that far, I initially thought you might have been talking about a list of lists (which is something different — see perllol).

Anyway, this means you (probably) have a CSV (comma-separated values) file which is best read using a module like Text::CSV. The reason for this is that there are all sorts of gotchas with CSV files which have already been coded for in these modules. As an example, consider two records: "apples, red,cherries" and "apples, red cherries". If you had an ID for "apples, red", how would you handle the replacement in those two records.

So, I'd suggest you check whether your data really is as simple as the examples you've posted; and consider the chances of it staying that way in the future. You may need to revisit whatever solution you choose based on those findings. The solution I provide below assumes nothing more complex than what you currently show.

Here's my take on a solution. I create a hash mapping names to IDs (same as you). Next, I use the keys of that hash to create a regex with an alternation (e.g. bananas|oranges|...) such that only the names with IDs will be matched. Finally, the replacements are made and the new data is output.

#!/usr/bin/env perl use strict; use warnings; use autodie; my $in_file_name_id = 'pm_1055846_name_id_data.txt'; my $in_file_name_replace = 'pm_1055846_name_replace_data.txt'; my $out_file_name_replaced = 'pm_1055846_name_replaced_out.txt'; open my $in_id_fh, '<', $in_file_name_id; my %id_for = map { split } <$in_id_fh>; close $in_id_fh; my $re = '\b(' . join('|', keys %id_for) . ')\b'; open my $in_replace_fh, '<', $in_file_name_replace; open my $out_replaced_fh, '>', $out_file_name_replaced; while (<$in_replace_fh>) { s/$re/$id_for{$1}/g; print $out_replaced_fh $_; }

Here's the files. Notice I added "pineapples", which didn't have an ID, and so wasn't replaced.

$ cat pm_1055846_name_id_data.txt bananas 456 oranges 23 peaches 897236 kiwis 3726
$ cat pm_1055846_name_replace_data.txt bananas,oranges peaches,peaches,peaches kiwis oranges kiwis,oranges,bananas,bananas bananas,oranges,pineapples,peaches,kiwis
$ cat pm_1055846_name_replaced_out.txt 456,23 897236,897236,897236 3726 23 3726,23,456,456 456,23,pineapples,897236,3726

-- Ken

Replies are listed 'Best First'.
Re^2: iterating hash keys?
by R56 (Sexton) on Sep 27, 2013 at 14:09 UTC

    Well, comparing to what I had, your code is faster than the speed of light!

    Is there a simple way for the s// to also include names with hyphens in the middle?

      "Well, comparing to what I had, your code is faster than the speed of light!"

      That's a good start. :-)

      "Is there a simple way for the s// to also include names with hyphens in the middle?"

      The short answer is: yes. The longer answer depends on details. I found a reference you made to input data with hyphens in "Re^8: using hashes"; however, you provided no indication of the output you wanted (except that 20-10,25 was the wrong output when bana-na,banana was the input).

      The following is based on the code I provided earlier. Given these input files:

      $ cat pm_1055846_name_id_data.txt bananas 456 oranges 23 peaches 897236 kiwis 3726 banana 25 bana 20 bana-na 15 na 10
      $ cat pm_1055846_name_replace_data.txt bananas,oranges peaches,peaches,peaches kiwis oranges kiwis,oranges,bananas,bananas bananas,oranges,pineapples,peaches,kiwis bana-na,banana ba-na-na,bana-bana,bana-nana

      If you want output like this:

      $ cat pm_1055846_name_replaced_out.txt 456,23 897236,897236,897236 3726 23 3726,23,456,456 456,23,pineapples,897236,3726 15,25 ba-10-10,20-20,20-nana

      Change

      my $re = '\b(' . join('|', keys %id_for) . ')\b';

      to

      my $re = '\b(' . join('|', sort { $b cmp $a } keys %id_for) . ')\b';

      If you want output like this:

      $ cat pm_1055846_name_replaced_out.txt 456,23 897236,897236,897236 3726 23 3726,23,456,456 456,23,pineapples,897236,3726 15,25 ba-na-na,bana-bana,bana-nana

      Change

      my $re = '\b(' . join('|', keys %id_for) . ')\b';

      to

      my $re = '(^|,)(' . join('|', sort { $b cmp $a } keys %id_for) . ')(?= +,|$)';

      and

      s/$re/$id_for{$1}/g;

      to

      s/$re/$1$id_for{$2}/g;

      If you want something different to these, and are unable to work it out for yourself, provide details as outlined in the "How do I post a question effectively?" guidelines.

      It would also be useful to advise what version of Perl you're using: I wrote those changes for v5.8; a more efficient version could have been written for a later version. As a hint for doing this yourself, see (?<=pattern) \K under Look-Around Assertions in "perlre: Extended Patterns" — \K was introduced in v5.10.0 (see "perl5100delta: Regular expressions" for this, and other, regex enhancements).

      -- Ken

        It's the second case: total recognition of the exact pattern, or just let it go.

        I'm using 5.16, but will take a look into those changes though.

        Once again, many thanks for your help Ken :)

Re^2: iterating hash keys?
by R56 (Sexton) on Sep 27, 2013 at 12:11 UTC
    Hey Ken, good to be here :) Thank you for the patience to write all that. I don't know yet if the data will be this simple at all times, but it's always better to cover all the options if it doesn't sacrifice speed. Will definitely try out your code to see if I can improve this!