Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^6: Efficient matching with accompanying data

by LanX (Saint)
on Jul 12, 2013 at 01:40 UTC ( [id://1043853]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Efficient matching with accompanying data
in thread Efficient matching with accompanying data

Unfortunately I've spend some time I don't have into this! :(

As it seems that trie optimization brakes for more than 15000 alternative patterns.

up to this number the performance is competitive:

481Finding 883 words (of 5038) took 0.039814 seconds using a hash 482Finding 784 words took 0.027246 seconds using a trie(via regex engi +ne)

please note that I somewhere when experimenting I introduced a bug which skipped 100 matches...

couldn't find the whole of Darwin's text and took chapter2, and as a dictionary I took (and cleaned) the en-US.dic in mozillas path.

Please note some changes I did:

1. I sorted descending by size not ascending, to fulfill the OP's requirements of multiword matches.

2. You didn't use word delimiter, such that the regex has to start matching in the middle of the word. Thats why I put spaces around the parens.

3. I don't know why you lowercased the input, my dict is case-sensitive, as a side note perl has lc

4. I don't see the point of looping over m//g if you can get all matches in one attempt.

5. I precompiled the regex to avoid recompilations.

> I'd love to see a demonstration of the trie optimisation actually doing something.

OK, I'm posting the code now as it is as a starting point for everyone who wants to experiment with it: Disabling the buffer (negative value) will show the effect of missing trie.

have fun!

use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; chomp( my @words = do{ local @ARGV = 'en-US.dic'; <> } ); ${^RE_TRIE_MAXBUF}=2**16; $|=1; my %lexicon; my $limit=10000; for (@words) { s/\/.*$//; # next if length($_)<3; last unless $limit--; $lexicon{ $_ } = 'suplementary data'; } #print join "\t", grep {length() <3 } keys %lexicon ;exit; my $re = ' (' . join( '|', sort{ length( $b ) <=> length( $a ) } keys +%lexicon ) . ') '; my $cre = qr/$re/; #print $re; exit; open my $infile, '<', $ARGV[ 0 ] or die $!; my @matches1; my $start1 = time; seek $infile, 0, 0; my( $words, $found1 ) = ( 0, 0 ); while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; # tr[A-Z][a-z]; for my $word ( split ) { ++$words; if (exists $lexicon{ $word }) { $found1++; push @matches1,$word; } } } my $end1 = time; printf "Finding $found1 words (of $words) took %f seconds using a hash +\n", $end1 - $start1; my $start2 = time; seek $infile, 0, 0; $. = 1; my $found2 = 0; my $text=""; while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; tr[A-Z][a-z]; # ++$found2 while m[$cre]g; $text.=$_." "; } my @matches2 = $text =~ /$cre/g; $found2=scalar @matches2; my $end2 = time; printf "Finding $found2 words took %f seconds using a trie(via regex e +ngine)\n", $end2 - $start2; my %matches; @matches{@matches1}=(); print scalar keys %matches; delete @matches{@matches2}; print "missing matches:\n"; pp \%matches; #print $text;

Cheers Rolf

( addicted to the Perl Programming Language)

PS: The requirements of the OP weren't clear for me, if only whole words are of interest I recommend sticking with the hash method.

Replies are listed 'Best First'.
Re^7: Efficient matching with accompanying data
by BrowserUk (Patriarch) on Jul 12, 2013 at 01:55 UTC
    it seems that trie optimization brakes for more than 10000 alternatives.

    Seems the goalposts have changed slightly since it was first implemented, it was between 12,000 and 13,000 back then .

    For most of my stuff, that limit is way to low to consider.

    It'd be nice if you'd include the results of your benchmark in your post.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      actually it breaks around 15000 for me, I think it depends on the total length of the regex-string.

      15000 alternatives:

      lanx@nc10-ubuntu:~/B/PL/PM/buk_trie$ perl buk_trie.pl chapter2.txt 481Finding 1027 words (of 5038) took 0.041860 seconds using a hash 482Finding 923 words took 0.025128 seconds using a trie(via regex engi +ne)

      15500 alternatives

      lanx@nc10-ubuntu:~/B/PL/PM/buk_trie$ perl buk_trie.pl chapter2.txt 481Finding 1078 words (of 5038) took 0.040339 seconds using a hash 482Finding 958 words took 10.042522 seconds using a trie(via regex eng +ine)

      changing buffer-size didn't help.

      This is perl, v5.10.0 built for i486-linux-gnu-thread-multi

      > For most of my stuff, that limit is way to low to consider.

      well as already stated by AnoMonk, you can split the set into different regexes with < 10000 alternatives.

      Regexes can do things which hashes can't...

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        Thanks for posting the stats.

        Regexes can do things which hashes can't...

        Such as .. :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1043853]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-04-19 02:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found