Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
good chemistry is complicated,
and a little bit messy-LW
 
PerlMonks

sexeger

by japhy (Canon)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Sep 21, 2000 at 01:56 UTC ( #33410=perlmeditation: print w/ replies, xml ) Need Help??

as modified from my Perl5-Porters post


If we're able to look at the specific nodes of a regular expression, then we should be able to create a regex that works from the BACK of a string to the FRONT. What do I mean?
($last_numbers) = $string =~ /(\d+)(?!\D*\d)/;
That regex finds the last occurrence of digits in $string. Now, it's not all that bad, but it could be made a great deal simpler by:
  1. reversing the input string
  2. reversing the nodes of the regex (optimize it, too)
  3. reversing the match
"How in the world is that simpler?" you ask. "Just watch!" I reply.
$last_numbers = scalar reverse( # reverse the match reverse($string) =~ # reverse the input string /(\d+)/ # reverse the regex );
Now, the regex (\d+)(?!\D*\d) can also be written as (\d+)(?=\D*$). If digits aren't followed by another digit, then they are followed by only non-digits, if anything. In the process of reversal, the look-ahead can be eliminated, since the regex would be \D*(\d+) which has a redundant leading "node", \D*. The regex's meaning is not changed by its removal.

We already have re.pm which can pick a regex apart during compilation and during the actual matching stage. This has the ability to pick apart individual nodes of a regular expression.
Note: the term "node" refers to a combination of regex atoms and quantifiers which can match a given substring backwards or forwards. In the regex \d+[abc]def(foo|bar), the nodes are:
  • \d+
  • [abc]
  • d
  • e
  • f
  • f
  • o *
  • o *
  • b
  • a
  • r
* technically, the two o nodes can be combined here

This regex, reversed, would be (rab|oof)fed[abc]\d+
For the mean time, there is no way to automatically reverse a regex. You have to do it by hand. But the benefits of it are amazing. Maybe they need to be seen to be believed. If so, here is an example.

You have a simple programming language, called KC ("kinda crummy"). It has a VERY simple grammar:
  • comments are matched by <[^>]*>
  • strings are matched by "[^"\\]*(?:\\.[^"\\]*)*"
  • everything else is matched by [^<>"]+
The string definition is a double-quoted strings with escapes allowed in it, like "foo \"bar\" \\ blat". All programs written in KC can be matched by the following regex:
$CODE =~ m{ \A (?: [^<>"]+ # non-string, non-comment | "[^"\\]*(?:\\.[^"\\]*)*" # string | <[^>]*> # comment )* \z }x;
We can write a regex to match an entire file (shown above). Oh, before I go any further, here's the sample program:
<this is a sample program> int x = 10; <what a silly grammar> str y = "cool \" beans"; if (len(y) GREATER_THAN x) { <can't use gt and lt symbols... heehee> <empty comment coming up ! print y, " is longer than ", x, " characters"; <> <and more stuff to come soon> chop(y,x); print "I sliced 'y' down to ", x, " characters for you"; }
Let's say we wanted to extract the LAST comment from this program as well as be sure it was coded properly. Here's my regex. Note: I used the "cut" regex operator (?>...) which matches without allowing a backtrack, to speed up the non-string, non-comment part.
($last_comment) = $CODE =~ m{ \A (?: <[^>]*> | "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )* (<[^>]*>) (?: "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )* \z }x;
I'll also implement a reversed version. The reversed terms for the grammar are:
  • comments are matched by >[^>]*<
  • strings are matched by "(?:[^"\\]*.\\)*[^"\\]*"
  • everything else is matched by [^<>"]+
$last = scalar reverse(reverse($CODE) =~ m{ \A (?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) )* (>[^>]*<) (?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) | >[^>]*< )* \z }x);
If you look long enough (it shouldn't take too long) you'll see that EACH node of the terms has been reversed properly. Now, I ran a benchmark (available on my web site -- follow the links for "Sex, Eger!") and it showed that the reverse method was actually faster than the forward version, EVEN with having to reverse the input string AND the value of $1. When I implemented a cached version, where the reverse of $CODE was already stored, it was even faster (obviously). Here are the wonderful results:

Running for at least 15 seconds
Method Hz # iters
forwards1072.42/s16097
backwards1448.33/s21725
back+cache1613.53/s24574


This is, to say the least, very good. The backwards version runs %33 faster, and the cached version runs %50 faster!

If we already know the KC code is well-formed (that is, it will match the entire-code regex), then we can eliminate a lot of the code in the regexes. First, to match the last comment, forwards:
($last_comment) = $CODE =~ m{ (<[^>]*>) (?: "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )* \z }x;
And now backwards:
$last = scalar reverse(reverse($CODE) =~ m{ \A (?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) )* (>[^>]*<) }x);
The benchmark gave me the following data:

Running for at least 15 seconds
Method Hz # iters
forwards1438.02/s21786
backwards2888.13/s43322
back+cache3481.73/s52226


Another tremendous difference. The backwards version is %100 faster, and the cached version is %150 faster!

What does this mean? Perhaps you should look into rethinking how you match your strings. If you're looking for the last of something, or for something towards the end of a string, and the string is potentially large, thinking about regex reversal. It could save you a lot of time.

This text will be updated and modified at my web site. I'll put headings in, and all that fancy stuff. :)

$_="goto+F.print+chop;\n=yhpaj";F1:eval
Jeff "japhy" Pinyan
japhy@pobox.com
http://www.pobox.com/~japhy/

Comment on sexeger
Select or Download Code
RE: sexeger
by ncw (Friar) on Sep 21, 2000 at 02:58 UTC
    An excellent Aha! I shall put that trick into my big bag of regular expression tips. (I opened it the other day and it looked just like my 2 year old was hammering on the keyboard ;-)
[reply]
RE: sexeger
by japhy (Canon) on Sep 21, 2000 at 05:34 UTC
    There is a minor error in the "KC" program. There's a ! in the place of a >. I've fixed it, and modified the regexes slightly to be better. Check out my web site for updates.

    $_="goto+F.print+chop;\n=yhpaj";F1:eval
[reply]
RE: sexeger
by extremely (Priest) on Sep 21, 2000 at 06:45 UTC

    I can't wait to get home and play with this, in the meantime please accept the ++ and this note ;)

    sexeger? we can't allow that. Too pseudo-dirty for such clear thinking. And is that the plural? Should the singular form be exeger? What we need is a better name, like "revex". Come on this is a language with Linguistics at the heart of it =)

    On a more serious note this should be a hack in the "re" code itself. Implement it with a "r" flag. Just have it go thru the string from back to front for me and I'll rewrite the regex myself. Having all the tools available from the regular engine just eating the string backwards would make me happy.

    If it can be made to do the whole she-bang that japhy promotes here, should it also auto flip on the flag in the case of  $a =~ m/letters$/; I know that a great deal of end terminated optimizing has been done. Could that be simplified away with this and the regular forward optimizing that is done on  $a=~ m/^letters/; style regexes?

    This is making my head spin, back to CGI =)

    --
    $you = new YOU;
    honk() if $you->love(perl)

[reply]
[d/l]
[select]

Back to Meditations


Login:
Password
remember me
What's my password?
Create A New User

Node Status
node history
Node Type: perlmeditation [id://33410]
Approved by holli
Front-paged by holli
help
Community Ads
Chatterbox
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users
Others imbibing at the Monastery: (8)
GrandFather
wfsp
FunkyMonk
atcroft
herveus
Spooty
Eyck
gnosti
As of 2009-11-21 09:10 GMT
Sections
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth

Future historians will find that the material characteristic of the current era is...

Aluminium
Plastic
Oil
Water
Carbon dioxide
Copper
Iron
Silicon
Salt
Uranium
Hydrogen
Other

Results (729 votes), past polls