Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: Efficient matching with accompanying data

by Anonymous Monk
on Jul 11, 2013 at 13:11 UTC ( #1043719=note: print w/ replies, xml ) Need Help??


in reply to Re: Efficient matching with accompanying data
in thread Efficient matching with accompanying data

Given that timing I suspect the trie optimization was not used. If the regexp is too long, perl gives up on the trie optimization and resorts to standard alternation. I ran into this a while ago and split the patterns into a set of regexps that where small enough to do the optimization. I then ran that set one after the other. The regexp debug output will tell you whether you are actually getting the tries.


Comment on Re^2: Efficient matching with accompanying data
Re^3: Efficient matching with accompanying data (${^RE_TRIE_MAXBUF})
by LanX (Canon) on Jul 11, 2013 at 13:39 UTC
    > Given that timing I suspect the trie optimization was not used.

    Sure, see also ${^RE_TRIE_MAXBUF} in perlvar

    ${^RE_TRIE_MAXBUF} Controls how certain regex optimisations are applied an +d how much memory they utilize. This value by default is 6553 +6 which corresponds to a 512kB temporary cache. Set this to a h +igher value to trade memory for speed when matching large alternations. Set it to a lower value if you want the optimisations to be as conservative of memory as possib +le but still occur, and set it to a negative value to prevent +the optimisation and conserve the most memory. Under norma +l situations this variable should be of no interest to yo +u.

    Disabling the buffer is also a mean to show the effect of a missing optimization.

    It's also much faster to join all strings to be inspected on the LHS and to run the regex just once.

    But if only word-boundaries are needed, I suppose hash-lookups can still be faster, of course depending on "density" and "distance" of the searched patterns.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re^3: Efficient matching with accompanying data
by BrowserUk (Pope) on Jul 11, 2013 at 15:28 UTC

    Hm. I just tried setting ${^RE_TRIE_MAXBUF} to every power of 2 from 18 through 32 and it didn't make a blind bit of difference.

    ... our $M //= 16; ${^RE_TRIE_MAXBUF} = 2**$M; ... __END__ C:\test>for /l %i in (18,1,32) do @\perl5.18\bin\perl.exe -s 1043602.pl -M=%i C:\docs\OriginOfSpecie +s(Darwin)\2009-h\2009-h.htm Finding 203474 words (of 216808) took 0.175862 seconds using a hash 201 Finding 203474 words (of 216808) took 0.174275 seconds using a + hash 249 Finding 203474 words (of 216808) took 0.173872 seconds using a + hash 174 Finding 203474 words (of 216808) took 0.174906 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.175149 seconds using a + hash 205 Finding 203474 words (of 216808) took 0.172946 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.176250 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.175480 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.174416 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.179129 seconds using a + hash 210 Finding 203474 words (of 216808) took 0.174478 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.174674 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.177501 seconds using a + hash 225 Finding 203474 words (of 216808) took 0.174395 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.178181 seconds using a + hash 221 C:\test>

    Each run was manually killed after checking that the setting made no difference to the memory usage and ~20 seconds had elapsed. As you can see, it was still only processing ~10 line/s.

    Personally, I've yet to see a situation where the so-called trie optimisation benefited my code. (See also Re: pattern match, speed problem)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I have no time to look into your code but your benchmarks always talk about "... seconds using *a hash*"

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        your benchmarks always talk about "... seconds using *a hash*"

        Did you read the line:

        Each run was manually killed after checking that the setting made no difference to the memory usage and ~20 seconds had elapsed. As you can see, it was still only processing ~10 line/s.

        The benchmark code is the same as I posted above. It first tests the hash method (and prints out the timing); then the BigOR regex method, and would then print out its timing, except that as it takes the regex engine 34.5 minutes to complete the test (that the hash does in 0.17 seconds), I couldn't be bothered to wait for the 8 hours it would take to complete all 14 runs, so I monitored the tests and when the running line count from the regex test showed that it was still running at ~1 lines per second, I aborted that run (via the task manager) after ~20 seconds.

        I was also monitoring the memory usage of the processes in anticipation that if the trie optimisation was being skipped because it would require more memory than preset limit, once that preset limit had been raised high enough that the trie was built, there would be a very obvious jump in memory usage No such jump ever occurred. All 14 instances of the program showed an identical 77MB max memory usage.

        If 2^16 equates to 512MiB, then 2^32 must equate to 2^32*8192 -> 32Tib? (which I obviously do not have), but somewhere in between 2^16 & 2^32, there should have been some indication that the trie was being built, and there was not. (I suspect that it also has some hard upper limit to the number of alternations it will try to handle.)

        I'd love to see a demonstration of the trie optimisation actually doing something.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1043719]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2014-09-18 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (126 votes), past polls