Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Efficient matching with accompanying data

by Endless (Beadle)
on Jul 11, 2013 at 00:23 UTC ( #1043602=perlquestion: print w/ replies, xml ) Need Help??
Endless has asked for the wisdom of the Perl Monks concerning the following question:

Hello friends,

I am converting a lexical processor I wrote in Java to Perl; text processing is supposed to be very good in Perl, and I'm using it as an opportunity to learn Perl. However, although my initial write-up produces the right output, it does it around 70 times slower than my Java implementation where I was using a home-made Trie. According to Diag::NYTProf, the hangup is in _walk_tree of Tree::Trie, which brings me to my question: what is a highly time-effective way to perform matching for words and/or phrases against a target sentence, where the match will also return/allow access to supplementary data on the matched item?

Here is the algorithm I need to implement efficiently:

my $lexicon = $csv -> parse; # words to match against, and suppleme +ntary data to go with matches foreach <tweet> { foreach <word_in_tweet> { if ($lexicon includes <word_in_tweet>) { save match.supplementary_data TO tweet.result_data; } } }

Supplementary data includes topics and sentiment values corresponding to each word/phrase in my dictionary. In the end, I need to know all the topics that match in each tweet.

Important caveat: The dictionary may include multi-word entries, so these need to be matched as well and preferred over shorter matches.

The Question

What might be the best Perl structure to fulfill my needs for:
  1. Matching each and every word and multi-word phrase from my dictionary?
  2. Retrieving supplementary dictionary data for each matched word/phrase?
  3. Fulfilling these points with high efficiency?

Is there a more efficient tree implementation? Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives? Can you think of something I'm missing?

Thank you very much for your help!

Update:

For my project, the best results were in line with BrowserUK's suggestion: hashes were vastly superior, although a little trickier to get multi-word matches than regex would have been. Switching from Trie to Hash improved my speed by a factor of nearly 800.

Comment on Efficient matching with accompanying data
Download Code
Re: Efficient matching with accompanying data
by LanX (Canon) on Jul 11, 2013 at 00:46 UTC
    perl-versions >5.9.2 have a trie optimization within the regex engine.

    That is /(aaa|aab|aca)/ is internally optmized to (a(a(a|b)|ca))

    so if you organize your $lexicon in a way where supplementary dictionary data are listed after the target-words and delimited with something like "\0" you can search quite efficiently

    $patterns = join "|",@patterns; @matches = ($lexicon =~ /\0\0($patterns)\0([^\0]+)/g );

    (untested)

    I successfully wrote a module parsing DB-dumps very efficiently like this.

    Unfortunately the rights belong to my last employer, so you need to reinvent the wheel...:(

    UPDATE

    after rereading your post I have the impression that it's your lexicon which is static while the "tweets" always change.

    In this case you have the swap the logic, just once produce a long regex out of the phrases in your lexicon and match them against all tweets.

    Take care to sort the phrases by length, cause the first match will rule. Like this you don't to embed the dictionary data, just do a hash lookup with the matching word-groups.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      proof of concept

      DB<137> %lexicon=("day"=>1,"night"=>2,"knight"=>3) => ("day", 1, "night", 2, "knight", 3) DB<138> $pattern = join "|",sort {length($b)<=>length( $a) } keys %l +exicon => "knight|night|day" DB<139> $tweet= "today I will knight a guy I met last night" => "today I will knight a guy I met last night" DB<140> @matches = ( $tweet =~ /($pattern)/g ) => ("day", "knight", "night") DB<141> @lexicon{@matches} => (1, 3, 2)

      if you need word boundaries try map { "\\b$_\\b" } between join and sort

      DB<146> $pattern = join "|", map { "\\b$_\\b" } sort {length($b)<=>l +ength( $a) } keys %lexicon => "\\bknight\\b|\\bnight\\b|\\bday\\b" DB<147> @matches = ( $tweet =~ /($pattern)/g ) => ("knight", "night")

      Cheers Rolf

      ( addicted to the Perl Programming Language)

Re: Efficient matching with accompanying data (tree)
by Anonymous Monk on Jul 11, 2013 at 00:53 UTC

    I don't understand why you're using a tree

    Consider
    my @dictionary = qw/ fa la barf /;
    my $in_dictionary = join '|', map \&quotemeta, @dictionary;
    for my $string ( @strings ){
    while( $string =~ m{($in_dictionary)}g ){
    print "word=$1\n"
    }

Re: Efficient matching with accompanying data
by BrowserUk (Pope) on Jul 11, 2013 at 02:32 UTC
    Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives?

    Ostensibly all you need is:

    my %lexicon = map{ <word> => <supplementary data> } $csv -> parse # + words to match against, and supplementary data to go with matches foreach <tweet> { foreach <word_in_tweet> { if( exists $lexicon{ <word_in_tweet> } ) { save $lexicon{ <word_in_tweet> } TO tweet.result_data; } } }

    A hash will out perform the regex engine trie hands down in terms of speed.

    A quick test shows that running the 216,000 words in an html version of The Origin of Species against my 179,000 word dictionary using a hash takes 0.17 seconds.

    However, using the regex engines built-in trie (to hold the 179,000 word dictionary), is processing ~10 lines per second which means it should be finished after ~38 minutes (Update:it took 34.5 minutes):

    #! perl -slw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; chomp( my @words = do{ local @ARGV = 'c:/test/words.txt'; <> } ); my %lexicon; $lexicon{ $_ } = 'suplementary data' for @words; my $re = '(' . join( '|', sort{ length( $a ) <=> length( $b ) } @words + ) . ')'; #print $re; exit; open my $infile, '<', $ARGV[ 0 ] or die $!; my $start1 = time; seek $infile, 0, 0; my( $words, $found1 ) = ( 0, 0 ); while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; for my $word ( split ) { ++$words; ++$found1 if exists $lexicon{ $word }; } } my $end1 = time; printf "Finding $found1 words (of $words) took %f seconds using a hash +\n", $end1 - $start1; my $start2 = time; seek $infile, 0, 0; $. = 1; my $found2 = 0; while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; tr[A-Z][a-z]; ++$found2 while m[$re]g; } my $end2 = time; printf "Finding $found2 words took %f seconds using a trie(via regex e +ngine)\n", $end2 - $start2; __END__ C:\docs\OriginOfSpecies(Darwin)\2009-h>\perl5.18\bin\perl.exe \test\10 +43602.pl 2009-h.htm Finding 203474 words (of 216808) took 0.173504 seconds using a hash Finding 203474 words took 2072.099258 seconds using a trie(via regex e +ngine)

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Given that timing I suspect the trie optimization was not used. If the regexp is too long, perl gives up on the trie optimization and resorts to standard alternation. I ran into this a while ago and split the patterns into a set of regexps that where small enough to do the optimization. I then ran that set one after the other. The regexp debug output will tell you whether you are actually getting the tries.
        > Given that timing I suspect the trie optimization was not used.

        Sure, see also ${^RE_TRIE_MAXBUF} in perlvar

        ${^RE_TRIE_MAXBUF} Controls how certain regex optimisations are applied an +d how much memory they utilize. This value by default is 6553 +6 which corresponds to a 512kB temporary cache. Set this to a h +igher value to trade memory for speed when matching large alternations. Set it to a lower value if you want the optimisations to be as conservative of memory as possib +le but still occur, and set it to a negative value to prevent +the optimisation and conserve the most memory. Under norma +l situations this variable should be of no interest to yo +u.

        Disabling the buffer is also a mean to show the effect of a missing optimization.

        It's also much faster to join all strings to be inspected on the LHS and to run the regex just once.

        But if only word-boundaries are needed, I suppose hash-lookups can still be faster, of course depending on "density" and "distance" of the searched patterns.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        Hm. I just tried setting ${^RE_TRIE_MAXBUF} to every power of 2 from 18 through 32 and it didn't make a blind bit of difference.

        ... our $M //= 16; ${^RE_TRIE_MAXBUF} = 2**$M; ... __END__ C:\test>for /l %i in (18,1,32) do @\perl5.18\bin\perl.exe -s 1043602.pl -M=%i C:\docs\OriginOfSpecie +s(Darwin)\2009-h\2009-h.htm Finding 203474 words (of 216808) took 0.175862 seconds using a hash 201 Finding 203474 words (of 216808) took 0.174275 seconds using a + hash 249 Finding 203474 words (of 216808) took 0.173872 seconds using a + hash 174 Finding 203474 words (of 216808) took 0.174906 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.175149 seconds using a + hash 205 Finding 203474 words (of 216808) took 0.172946 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.176250 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.175480 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.174416 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.179129 seconds using a + hash 210 Finding 203474 words (of 216808) took 0.174478 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.174674 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.177501 seconds using a + hash 225 Finding 203474 words (of 216808) took 0.174395 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.178181 seconds using a + hash 221 C:\test>

        Each run was manually killed after checking that the setting made no difference to the memory usage and ~20 seconds had elapsed. As you can see, it was still only processing ~10 line/s.

        Personally, I've yet to see a situation where the so-called trie optimisation benefited my code. (See also Re: pattern match, speed problem)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for your suggestion! The hash is, indeed, enormously more effective than either Trie or Regex turned out to be for me.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1043602]
Approved by ww
Front-paged by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2014-09-21 07:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (167 votes), past polls