Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Re^4: How to use "less than" and "greater than" inside a regex for a $variable number

by Polyglot (Pilgrim)
on Oct 04, 2012 at 19:42 UTC ( #997305=note: print w/replies, xml ) Need Help??

in reply to Re^3: How to use "less than" and "greater than" inside a regex for a $variable number
in thread How to use "less than" and "greater than" inside a regex for a $variable number

I've implemented this approach, as it seems fairly close to the sort of solution I was looking for. Unfortunately, it is still rather slow. I started the process 2.5 days ago now (it's been running over 60 hours) and it is about half-way through the material. So it appears with this method it will take 5 days of 100% CPU on one of four cores of my Dell PowerEdge server. That's a little disappointing. My ugly approach, which may be slightly less thorough, finished after about three days. So it was 40% quicker.

Given the complexity of the regex, I suppose I cannot blame perl or the program itself, it's just the way it is. But without the attempt to narrow the search to finding numbers between their respective forerunners/postrunners, the whole search can complete in less than five minutes.

Anyway, at least I have learned something and I much appreciate your patience in demonstrating this method for me. I may still be able to use this as a final check over a long weekend or something, or perhaps I can limit the amount of material to be checked at a time (~130 books total). Thank you!



  • Comment on Re^4: How to use "less than" and "greater than" inside a regex for a $variable number

Replies are listed 'Best First'.
Re^5: How to use "less than" and "greater than" inside a regex for a $variable number
by AnomalousMonk (Chancellor) on Oct 06, 2012 at 10:41 UTC

    Polyglot: I don't know if the following will be of any use to you, but I was curious to play with some different approaches to what I conceive to be your problem. You may as well have the results. All these work (for some definition of 'work').

    The first new approach is a variation on something I've already posted: two different replacement strings for the sequential versus non-sequential page number cases. In the case of sequential page numbers, the replacement string is the empty string, which may be something the regex engine can effectively 'optimize away' at run time.

    The second new approach is to try to avoid altogether the replacement clause of the substitution in the case of sequential page numbers. This approach uses some of the newer, more exotic regex constructs introduced with 5.10. The problem with these is that their newness means that they may not be as efficiently recognized and optimized by the regex compiler, hence slower overall. I have done no benchmarking whatsoever.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://997305]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (10)
As of 2018-05-24 10:42 GMT
Find Nodes?
    Voting Booth?