Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: Question about regex performance (too small)

by tye (Sage)
on Jun 18, 2014 at 02:19 UTC ( #1090238=note: print w/replies, xml ) Need Help??

in reply to Question about regex performance

The results mostly show that you are concentrating on the wrong things.

Even "69971/s" vs. "130236/s", although conceptually "twice as fast", is unlikely to have other than trivial impact on the time required to run a useful script. Things that run that quickly are just hard to make add up to much of the run time of a useful script (in particular because "subtracts overhead" which makes such measurements rather distant from practical reality).

So, you've done quite a bit of work to determine that, if you write a script that does something useful that still mostly consists of trimming whitespace from even tons of strings of about that length, then which method you chose won't have much impact.

If you re-run the benchmark with much, much longer strings, then you'll likely get results that have more practical meaning (as the "overhead" is relatively less significant so the misguided "subtraction" process is unlikely to drastically change the results).

This makes sense as the usual way that you notice that a regex is excessively inefficient is because you ran it against particularly long strings, not because you ran it against a huge number of very short strings, again, because the overhead tends to make the inefficiency have a much smaller impact on the total run time. But also because one of the biggest ways to make a regex very inefficient is "bad backtracking" which scales linearly with the number of string but can scale O(n**2) or much worse with the string length.

But if you aren't going to be running these regexes against really long strings, then none of that actually matters.

If you are actually trying to make some script run faster, then, instead of picking random small operations and trying dozens of variations hoping to find "the fastest", you should profile the code to see what (if anything) is actually taking up any significant part of the run time. And/or looking at the larger structure, where optimizations actually have a chance of having significant impact on the total run time.

If you are just hoping to find "the best way" with results being a significant part of your grading process, then I also think you are paying too much attention to the wrong things. A construct that is 5% faster to write or understand or get correct is likely to have a much bigger impact than whether the computer can run it 20% faster, IME, especially since such constructs so rarely add up to more than a small part of the total run time.

Also, lots of times there is no one "fastest" approach. For example, it is pretty common for method X to be faster than method Y when a string has tons of extra whitespace while method Y is faster when the string has no extra whitespace.

So, if I were bored, I'd have tried a variation that does nothing when nothing needs to be done, like: s/ {2,}/ /g. I'm sure you can come up with at least half a dozen variations on that to add to your quest. :)

- tye        

Replies are listed 'Best First'.
Re^2: Question about regex performance (too small)
by ted.byers (Monk) on Jun 18, 2014 at 14:08 UTC


    You are absolutely right. The fact is that if I were working on improving the speed of production code, I would begin with profiling. In THAT activity, there is a balance of costs between how quickly code can be developed versus how fast the code can be. I have seen cases where the fastest code is extremely hard for junior or even intermediate coders to understand. Sometime it is necessary to implement that anyway because the cost in lost time due to slower algorithms is much greater. But sometimes, the simpler, slower code is desired because it can be implemented quickly by your least experienced staff.

    Alas, you missed the point of the exercise. My objective is to understand these regular expressions better. I rarely develop them, and when I do, I must have the manual for regular expressions open, in order to figure out how to develop one that meets my needs. I do not need such assistance when I deal with solving systems of linear equations, numeric quadrature, or statistical analysis. I am, in a sense, pushing myself out of my comfort zone. In this context, the benchmark scripts I show are mere devices to provide one way of evaluating the merits of the different algorithms I found or developed. And, while I do intend to modify them to work with longer strings, what I was especially hoping for is some insight into the reasons for the difference in performance and how to combine the regular expressions that trim leading and trailing white space with those that eliminate redundant white space characters within strings; or if such a combination even makes sense.



Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1090238]
[Discipulus]: shmem the first time I was in germany I was so smart to learn by heart the address of the hostel: Einbahnstraße..
[Discipulus]: james28909 you are welcome
james28909 doesnt drink much alcohol
james28909 does enjoy blazing up though
james28909 also walks around hunting for his cell phone while using its flashlight function

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2018-05-20 16:19 GMT
Find Nodes?
    Voting Booth?