Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^4: matching the words

by BrowserUk (Pope)
on Feb 16, 2013 at 22:17 UTC ( #1019082=note: print w/ replies, xml ) Need Help??


in reply to Re^3: matching the words
in thread matching the words

Rather than loading the real words as an array, I'd load them as a single whitespace delimited string.

It is a couple of hundred times faster to invoke the regex engine, once, to search for one word in a string containing hundreds or thousands of words; than to invoke it hundreds or thousands of times to match against one word at a time:

#! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $words = do{ local( @ARGV, $/ ) = 'words.txt'; <> }; our @words = split ' ', $words; our $P //= 0.01; our @toLookFor = map { rand() > $P ? () : do { my $w = $_; my $p = int( rand length()-1 ); $w =~ s[.{$p}\K.][.]; $w; }; } @words; printf "Looking for %d terms amongst %d words\n", scalar @toLookFor, scalar @words; cmpthese 1, { a => q[ for my $re ( @toLookFor ) { m[^$re$] #and print "a:$re :: $_" for @words; } ], b => q[ $words =~ m[\b($_)\b] #and print "b:$_ :: $1" for @toLookFor; ], } __END__ C:\test>junk42 Looking for 1846 terms amongst 178691 words s/iter a b a 85.2 -- -95% b 3.94 2065% -- C:\test>junk42 -P=0.02 Looking for 3564 terms amongst 178691 words s/iter a b a 166 -- -95% b 7.83 2022% --

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
cmpthese


Comment on Re^4: matching the words
Download Code
Re^5: matching the words
by Kenosis (Priest) on Feb 17, 2013 at 00:30 UTC

    This makes much sense. Will add an adapted version to my original response. Greatly appreciate this!

      Of course, if numbers of words and/or terms is very large, and performance is critical, and you are prepared to throw memory at the problem, there is another, yet faster way. (How much faster is difficult to express in terms that most of us would understand).

      The basic principle is an old favorite of mine: don't search; index. In this case that means building a hash that contains every possible variation of each word in the wordlist. For 'fred', that means adding '_red', 'f_ed', 'fr_d' & 'fre_' as keys to the hash with 'fred' as the values. Then, each of the wildcarded terms can be looked up directly. The result:

      C:\test>junk42 -P=0.02 Looking for 3651 terms amongst 178691 words s/iter a b + c a 167 -- -95% + -100% b 8.20 1931% -- + -100% c 1.00e-015 16662499999999995902% 820299999999997696% + --

      Which makes the process 8 quadrillion times faster than the regex against a string; and (an unbelievable) 166 quadrillion times faster than the regex engine over an array.

      If my math is right, that means it can do in 1 second; what the regex engine would take 5 billion years to do!

      (And I've checked and checked, and it really does appear to be true?)

      The benchmark:

      #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $words = do{ local( @ARGV, $/ ) = 'words.txt'; <> }; close $ARGV; our @words = split ' ', $words; our %words; @words{ do{ my $word = $_; map{ my $copy; substr( $copy = $word, $_, 1, '.' ); $copy; } 0 .. length() -1 }} = ($_) x ( length() -1 ) for @words; our $N //= 3; our $P //= 0.01; our @toLookFor = map { rand() > $P ? () : do { my $w = $_; my $p = int( rand length()-1 ); $w =~ s[.{$p}\K.][.]; $w; }; } @words; printf "Looking for %d terms amongst %d words\n", scalar @toLookFor, scalar @words; cmpthese $N, { a => q[ for my $re ( @toLookFor ) { m[^$re$] #and print "a:$re :: $_" for @words; } ], b => q[ $words =~ m[\b($_)\b] #and print "b:$_ :: $1" for @toLookFor; ], c => q[ # print $words{ $_ } or 1 for @toLookFor; ], } __END__ C:\test>junk42 -P=0.02 Looking for 3651 terms amongst 178691 words s/iter a b + c a 167 -- -95% + -100% b 8.20 1931% -- + -100% c 1.00e-015 16662499999999995902% 820299999999997696% + --

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1019082]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (15)
As of 2014-08-27 19:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (252 votes), past polls