Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Efficient matching with accompanying data

by BrowserUk (Pope)
on Jul 11, 2013 at 02:32 UTC ( #1043613=note: print w/ replies, xml ) Need Help??


in reply to Efficient matching with accompanying data

Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives?

Ostensibly all you need is:

my %lexicon = map{ <word> => <supplementary data> } $csv -> parse # + words to match against, and supplementary data to go with matches foreach <tweet> { foreach <word_in_tweet> { if( exists $lexicon{ <word_in_tweet> } ) { save $lexicon{ <word_in_tweet> } TO tweet.result_data; } } }

A hash will out perform the regex engine trie hands down in terms of speed.

A quick test shows that running the 216,000 words in an html version of The Origin of Species against my 179,000 word dictionary using a hash takes 0.17 seconds.

However, using the regex engines built-in trie (to hold the 179,000 word dictionary), is processing ~10 lines per second which means it should be finished after ~38 minutes (Update:it took 34.5 minutes):

#! perl -slw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; chomp( my @words = do{ local @ARGV = 'c:/test/words.txt'; <> } ); my %lexicon; $lexicon{ $_ } = 'suplementary data' for @words; my $re = '(' . join( '|', sort{ length( $a ) <=> length( $b ) } @words + ) . ')'; #print $re; exit; open my $infile, '<', $ARGV[ 0 ] or die $!; my $start1 = time; seek $infile, 0, 0; my( $words, $found1 ) = ( 0, 0 ); while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; for my $word ( split ) { ++$words; ++$found1 if exists $lexicon{ $word }; } } my $end1 = time; printf "Finding $found1 words (of $words) took %f seconds using a hash +\n", $end1 - $start1; my $start2 = time; seek $infile, 0, 0; $. = 1; my $found2 = 0; while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; tr[A-Z][a-z]; ++$found2 while m[$re]g; } my $end2 = time; printf "Finding $found2 words took %f seconds using a trie(via regex e +ngine)\n", $end2 - $start2; __END__ C:\docs\OriginOfSpecies(Darwin)\2009-h>\perl5.18\bin\perl.exe \test\10 +43602.pl 2009-h.htm Finding 203474 words (of 216808) took 0.173504 seconds using a hash Finding 203474 words took 2072.099258 seconds using a trie(via regex e +ngine)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: Efficient matching with accompanying data
Select or Download Code
Re^2: Efficient matching with accompanying data
by Anonymous Monk on Jul 11, 2013 at 13:11 UTC
    Given that timing I suspect the trie optimization was not used. If the regexp is too long, perl gives up on the trie optimization and resorts to standard alternation. I ran into this a while ago and split the patterns into a set of regexps that where small enough to do the optimization. I then ran that set one after the other. The regexp debug output will tell you whether you are actually getting the tries.
      > Given that timing I suspect the trie optimization was not used.

      Sure, see also ${^RE_TRIE_MAXBUF} in perlvar

      ${^RE_TRIE_MAXBUF} Controls how certain regex optimisations are applied an +d how much memory they utilize. This value by default is 6553 +6 which corresponds to a 512kB temporary cache. Set this to a h +igher value to trade memory for speed when matching large alternations. Set it to a lower value if you want the optimisations to be as conservative of memory as possib +le but still occur, and set it to a negative value to prevent +the optimisation and conserve the most memory. Under norma +l situations this variable should be of no interest to yo +u.

      Disabling the buffer is also a mean to show the effect of a missing optimization.

      It's also much faster to join all strings to be inspected on the LHS and to run the regex just once.

      But if only word-boundaries are needed, I suppose hash-lookups can still be faster, of course depending on "density" and "distance" of the searched patterns.

      Cheers Rolf

      ( addicted to the Perl Programming Language)

      Hm. I just tried setting ${^RE_TRIE_MAXBUF} to every power of 2 from 18 through 32 and it didn't make a blind bit of difference.

      ... our $M //= 16; ${^RE_TRIE_MAXBUF} = 2**$M; ... __END__ C:\test>for /l %i in (18,1,32) do @\perl5.18\bin\perl.exe -s 1043602.pl -M=%i C:\docs\OriginOfSpecie +s(Darwin)\2009-h\2009-h.htm Finding 203474 words (of 216808) took 0.175862 seconds using a hash 201 Finding 203474 words (of 216808) took 0.174275 seconds using a + hash 249 Finding 203474 words (of 216808) took 0.173872 seconds using a + hash 174 Finding 203474 words (of 216808) took 0.174906 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.175149 seconds using a + hash 205 Finding 203474 words (of 216808) took 0.172946 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.176250 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.175480 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.174416 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.179129 seconds using a + hash 210 Finding 203474 words (of 216808) took 0.174478 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.174674 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.177501 seconds using a + hash 225 Finding 203474 words (of 216808) took 0.174395 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.178181 seconds using a + hash 221 C:\test>

      Each run was manually killed after checking that the setting made no difference to the memory usage and ~20 seconds had elapsed. As you can see, it was still only processing ~10 line/s.

      Personally, I've yet to see a situation where the so-called trie optimisation benefited my code. (See also Re: pattern match, speed problem)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I have no time to look into your code but your benchmarks always talk about "... seconds using *a hash*"

        Cheers Rolf

        ( addicted to the Perl Programming Language)

Re^2: Efficient matching with accompanying data
by Endless (Beadle) on Jul 11, 2013 at 23:17 UTC
    Thanks for your suggestion! The hash is, indeed, enormously more effective than either Trie or Regex turned out to be for me.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1043613]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (9)
As of 2014-07-13 14:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (249 votes), past polls