Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^3: Efficient matching with accompanying data

by BrowserUk (Patriarch)
on Jul 11, 2013 at 15:28 UTC ( [id://1043760]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Efficient matching with accompanying data
in thread Efficient matching with accompanying data

Hm. I just tried setting ${^RE_TRIE_MAXBUF} to every power of 2 from 18 through 32 and it didn't make a blind bit of difference.

... our $M //= 16; ${^RE_TRIE_MAXBUF} = 2**$M; ... __END__ C:\test>for /l %i in (18,1,32) do @\perl5.18\bin\perl.exe -s 1043602.pl -M=%i C:\docs\OriginOfSpecie +s(Darwin)\2009-h\2009-h.htm Finding 203474 words (of 216808) took 0.175862 seconds using a hash 201 Finding 203474 words (of 216808) took 0.174275 seconds using a + hash 249 Finding 203474 words (of 216808) took 0.173872 seconds using a + hash 174 Finding 203474 words (of 216808) took 0.174906 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.175149 seconds using a + hash 205 Finding 203474 words (of 216808) took 0.172946 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.176250 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.175480 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.174416 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.179129 seconds using a + hash 210 Finding 203474 words (of 216808) took 0.174478 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.174674 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.177501 seconds using a + hash 225 Finding 203474 words (of 216808) took 0.174395 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.178181 seconds using a + hash 221 C:\test>

Each run was manually killed after checking that the setting made no difference to the memory usage and ~20 seconds had elapsed. As you can see, it was still only processing ~10 line/s.

Personally, I've yet to see a situation where the so-called trie optimisation benefited my code. (See also Re: pattern match, speed problem)


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^4: Efficient matching with accompanying data
by LanX (Saint) on Jul 11, 2013 at 15:59 UTC
    I have no time to look into your code but your benchmarks always talk about "... seconds using *a hash*"

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      your benchmarks always talk about "... seconds using *a hash*"

      Did you read the line:

      Each run was manually killed after checking that the setting made no difference to the memory usage and ~20 seconds had elapsed. As you can see, it was still only processing ~10 line/s.

      The benchmark code is the same as I posted above. It first tests the hash method (and prints out the timing); then the BigOR regex method, and would then print out its timing, except that as it takes the regex engine 34.5 minutes to complete the test (that the hash does in 0.17 seconds), I couldn't be bothered to wait for the 8 hours it would take to complete all 14 runs, so I monitored the tests and when the running line count from the regex test showed that it was still running at ~1 lines per second, I aborted that run (via the task manager) after ~20 seconds.

      I was also monitoring the memory usage of the processes in anticipation that if the trie optimisation was being skipped because it would require more memory than preset limit, once that preset limit had been raised high enough that the trie was built, there would be a very obvious jump in memory usage No such jump ever occurred. All 14 instances of the program showed an identical 77MB max memory usage.

      If 2^16 equates to 512MiB, then 2^32 must equate to 2^32*8192 -> 32Tib? (which I obviously do not have), but somewhere in between 2^16 & 2^32, there should have been some indication that the trie was being built, and there was not. (I suspect that it also has some hard upper limit to the number of alternations it will try to handle.)

      I'd love to see a demonstration of the trie optimisation actually doing something.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Unfortunately I've spend some time I don't have into this! :(

        As it seems that trie optimization brakes for more than 15000 alternative patterns.

        up to this number the performance is competitive:

        481Finding 883 words (of 5038) took 0.039814 seconds using a hash 482Finding 784 words took 0.027246 seconds using a trie(via regex engi +ne)

        please note that I somewhere when experimenting I introduced a bug which skipped 100 matches...

        couldn't find the whole of Darwin's text and took chapter2, and as a dictionary I took (and cleaned) the en-US.dic in mozillas path.

        Please note some changes I did:

        1. I sorted descending by size not ascending, to fulfill the OP's requirements of multiword matches.

        2. You didn't use word delimiter, such that the regex has to start matching in the middle of the word. Thats why I put spaces around the parens.

        3. I don't know why you lowercased the input, my dict is case-sensitive, as a side note perl has lc

        4. I don't see the point of looping over m//g if you can get all matches in one attempt.

        5. I precompiled the regex to avoid recompilations.

        > I'd love to see a demonstration of the trie optimisation actually doing something.

        OK, I'm posting the code now as it is as a starting point for everyone who wants to experiment with it: Disabling the buffer (negative value) will show the effect of missing trie.

        have fun!

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        PS: The requirements of the OP weren't clear for me, if only whole words are of interest I recommend sticking with the hash method.

        If 2^16 equates to 512MiB, then 2^32 must equate to 2^32*8192 -> 32Tib? (which I obviously do not have),

        Nitpicking: according to the manual fragment quoted above, 2^16 equates to 512 KB, so 2^32 should equate to 32 GB

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1043760]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-03-29 12:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found