http://www.perlmonks.org?node_id=398245

bwelch has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

Parsing a document and searching for key terms, which of these is best for speed of execution and memory utilization?

Option 1: Checking a hash of about 100,000 entries:

foreach $word( @termList ) { if ( exists( $hash{ $word } ) ) { # Record occurrence of term } }

Option 2: 100,000 global substitutions

foreach $word ( keys $termHash ) { $document =~ s/$word/$termHash{ $word }/g; }
thanks much,
bw
updated: Correcton to option 2. Thanks to Roy Johnson

Replies are listed 'Best First'.
Re: Hash versus substitutation efficiency?
by dragonchild (Archbishop) on Oct 11, 2004 at 19:03 UTC
    Your two options do different things. One searches a given document and the other modifies the given document.

    For your first option, that's one of the better ways to do it.

    For your second option, I would do

    while (my ($word, $replacement) = each %{$termHash}) { $document =~ s/$word/$replacement/g; }
    That uses much less memory. for will build a list in memory. each will iterate through that list without building it in memory first.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Hash versus substitutation efficiency?
by pg (Canon) on Oct 11, 2004 at 19:30 UTC

    As others have pointed out, it is not clear what you are trying to compare.

    My answer is use Benchmark. It is a very powerful tool to compare perlformance. Take a look at the module's document, and if you do a search, you will find lots of examples how others use it. With it, you shall be able to find the answer yourself.

Re: Hash versus substitutation efficiency?
by Roy Johnson (Monsignor) on Oct 11, 2004 at 19:08 UTC
    What are you doing, that these two options are in any way comparable to each other, functionally?

    Caution: Contents may have been coded under pressure.
      In option one, it searches for all terms and every variation individually. Option two converts all synonyms of terms to their respective root prior to gathering word frequencies. In short, would it be more efficient to search 100,000 hash keys in option one, or use option two to do global substitutions on the documents so as to reduce the hash to only 10,000 keys?

      Basically, I have a large set of documents and want to categorize them and be able to explore and search based on terms. Terms would include root terms and their synonyms. For example, "meat" as a root term could mean "ham", "beef", or "chicken". This could be stored in a hash so synonyms all lead to the root term:

      $termHash{ "ham" } = "meat"; $termHash{ "beef" } = "meat";
      In this way, different terms that mean or relate to basically the same thing will be counted all as occurrences of the base term. When searching for "ham", one might see documents that contain "beef" or "meat" as well as "ham".

      The numbers are a bit overwhelming:

      • 26 million documents
      • 10,000 base terms
      • 100,000 synonyms
      This will run on a cluster where each node works independently on a set of the documents. It's looking like the term hash will be a built separately and stored as either a persistent hash or an xml file.
        If I understand correctly, you want to build a searchable index of your documents, with 10,000 distinct indexable terms, and 100,000 synonyms for those terms. The global substitution seems like a bad way to go, and it's not clear to me how you're going to only do 10,000 subs. It seems like you should have 100,000 keys and 10,000 values in your %termhash.

        I think what you'd want to do is, given a synonym hash, make a reverse mapping that keys the unique values to a regex that alternates among all the keys.

        my %synhash = ( meat => 'meat', ham => 'meat', beef => 'meat'); my %revsynhash; # Collect the synonyms as an array while (my ($k,$v) = each %synhash) { push @{$revsynhash{$v}}, $k; } # Turn the arrays into alternative-lists, longest first while (my $k = each %revsynhash) { $revsynhash{$k} = join '|', sort {length $b <=> length $a} @{$revsyn +hash{$k}}; } # Now you have # %revsynhash = ( meat => 'beef|meat|ham')
        So you run your indexing scheme using the 10,000 items in %revsynhash, the keys being the indexed term, while the values are the regexes to count as hits for those terms. When someone wants to look up a term, you use %synhash to translate it, and hit your index for what it translated to.

        Caution: Contents may have been coded under pressure.
Re: Hash versus substitutation efficiency?
by TedPride (Priest) on Oct 11, 2004 at 19:40 UTC
    Option 1, especially if you can eliminate some words (a simple ranking of words by frequency will show you which ones to eliminate - such as "a", "an", "the", etc). Both methods are going to use a significant amount of memory and processor time to construct the hash and array, however, so this sort of thing should only be done as a batch process rather than a page by page run of your script. Some more details on what exactly you're trying to do and why might be helpful.
Re: Hash versus substitutation efficiency?
by tilly (Archbishop) on Oct 11, 2004 at 21:28 UTC
    A random tip. You are not declaring your loop variables. That practically guarantees that you don't use strict.pm.

    I strongly suggest adding

    use strict;
    to the top of your programs, and then inserting my liberally as in:
    foreach my $word ( @termList ) { ... }
    Yes, this requires work. But it is worth it to get a significant fraction of your typos automatically caught for you.
      "You are not declaring your loop variables. That practically guarantees that you don't use strict.pm."

      How do you know that he didn't declare his loop variables? also what made you believe that he didn't use strict? His code could be something like this:

      use strict; use warnings; my $i; my @a = (1,2,3); for $i (@a) { print $i; }

      The real point here is that, it is always a better idea to have the smallest possible scope for your variables, so it is nicer to code like this:

      use strict; use warnings; my @a = (1,2,3); for my $i (@a) { print $i; }
        You're right that I don't actually know. But in past instances where I've given this tip, I've had several people tell me privately that they hadn't used strict, and none tell me that I was wrong, they actually were using strict. Therefore I think it very likely that he isn't using strict.
      Thanks, tilly, that's good advice! I do use strict and warnings in all my code, but since I was just posting code fragments I didn't think to do so.

      Whenever I receive any perl code, before doing anything I run perltidy on the code and add "use strict" and "use warnings". It's easily worth the effort.

        Ideally you're using an editor that does the indenting for you, and your fingers just type my appropriately by habit.