Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Efficient matching with accompanying data

by LanX (Canon)
on Jul 11, 2013 at 00:46 UTC ( #1043604=note: print w/ replies, xml ) Need Help??


in reply to Efficient matching with accompanying data

perl-versions >5.9.2 have a trie optimization within the regex engine.

That is /(aaa|aab|aca)/ is internally optmized to (a(a(a|b)|ca))

so if you organize your $lexicon in a way where supplementary dictionary data are listed after the target-words and delimited with something like "\0" you can search quite efficiently

$patterns = join "|",@patterns; @matches = ($lexicon =~ /\0\0($patterns)\0([^\0]+)/g );

(untested)

I successfully wrote a module parsing DB-dumps very efficiently like this.

Unfortunately the rights belong to my last employer, so you need to reinvent the wheel...:(

UPDATE

after rereading your post I have the impression that it's your lexicon which is static while the "tweets" always change.

In this case you have the swap the logic, just once produce a long regex out of the phrases in your lexicon and match them against all tweets.

Take care to sort the phrases by length, cause the first match will rule. Like this you don't to embed the dictionary data, just do a hash lookup with the matching word-groups.

Cheers Rolf

( addicted to the Perl Programming Language)


Comment on Re: Efficient matching with accompanying data
Select or Download Code
Re^2: Efficient matching with accompanying data
by LanX (Canon) on Jul 11, 2013 at 01:43 UTC
    proof of concept

    DB<137> %lexicon=("day"=>1,"night"=>2,"knight"=>3) => ("day", 1, "night", 2, "knight", 3) DB<138> $pattern = join "|",sort {length($b)<=>length( $a) } keys %l +exicon => "knight|night|day" DB<139> $tweet= "today I will knight a guy I met last night" => "today I will knight a guy I met last night" DB<140> @matches = ( $tweet =~ /($pattern)/g ) => ("day", "knight", "night") DB<141> @lexicon{@matches} => (1, 3, 2)

    if you need word boundaries try map { "\\b$_\\b" } between join and sort

    DB<146> $pattern = join "|", map { "\\b$_\\b" } sort {length($b)<=>l +ength( $a) } keys %lexicon => "\\bknight\\b|\\bnight\\b|\\bday\\b" DB<147> @matches = ( $tweet =~ /($pattern)/g ) => ("knight", "night")

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1043604]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2014-07-22 21:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (129 votes), past polls