Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Apocalypse 5 and regexes

by c-era (Curate)
on Jun 05, 2002 at 13:30 UTC ( #171822=perlmeditation: print w/ replies, xml ) Need Help??

I was anxious to start reading Apocalypse 5, but by the time I reached page 3 I was wondering how much thing will change. As we know, thing will change a lot.

Personally I use regexes regularly (and not the simple ones). My first thought was converting all of my old code (it may be time to look for a new job ;-). Then I read how you can assign variables in the regexes. I especially liked %hash:=[(Ö)Ö(Ö)] construct for filling hashes. If this is close to the same speed as doing a loop with split, Iíll be very happy. I also think that having /x on be default is a good choice and will help make things easier to read. The option to have a non-assigning group is also welcome.

Iím undecided on how well the grammar rules will work, and if they will solve more problems then they create. Iím sure that Damianís Exegesis 5 will shed some more light and give us even more to talk about.

In all, Iím looking forward to playing with the new regexes, but dreading the day I move my code to perl6 and will have to change and test (again) all of those regexes.

What are some of your reactions to the change?

Comment on Apocalypse 5 and regexes
Download Code
Re: Apocalypse 5 and regexes
by Juerd (Abbot) on Jun 05, 2002 at 14:01 UTC

    What are some of your reactions to the change?

    It's time to get rid of traditional regexes, as they are too limiting. Change is needed, and I think Larry Wall did a very good job with this redesign. When I first read the recent Apocalypse, I hated the <s and >s, because it was too html-like. When reading more, it becomes clear that it doable.

    <before:> and <after:> puzzle me. I had a feeling they had to be the other way around, but even though this is just a change of perspective, that is even worse. Somehow I never had this problem with the punctuation-laden (?=) and (?<=).

    The colon overloading I do not like. Colon will already be very important in Perl 6 as the abverbial colon and probably some other meanings, but in regexes it is used for too many things. :, ::, ::: and :::: should be words, in my opinion. I dislike m/:i foo/. It's confusing, and dictating m:i/foo/ would be better imho. Especially m:w/foo/ and m/:w foo/ being inequal is weird.

    Being able to capture into variables simplifies a lot. I like that feature, and will probably use it quite a lot. You say you're concerned about speed, but my guess is that it will be faster than a loop, because more can be handled in the core. Perl 6 itself is expected to be faster than 5 too, so I don't think speed will be a problem.

    Best thing is the introduction of closures and boolean expression assertions. Currently, they are available, but the feature is experimental, confusing and inconsistent.

    I like the new regexes.

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

Re: Apocalypse 5 and regexes
by VSarkiss (Monsignor) on Jun 05, 2002 at 14:52 UTC

    This was the one that made me a Perl 6 convert! Up through A3, I was "unsure". With A4, I thought, "This could be OK." A5 made me stand up and yell, "Hallelujah!" (After which I had to convince the other people in the office to let me up off the ground....) I haven't gotten all the way through it even now, but I'm very impressed so far.

    I use regexes a lot, and frankly, I hate the existing syntax. Simply recognizing that regexes are programs with an awful grammar ("They're programs, not strings") is a big step forward. Regularizing the interpolation rules and bracketing constructs is a great simplification. And I love the changes in the semantics of what constitutes a "line", the use of ^ . \Z and so on. It seems like every time I need to use /s or /m, first I have to go read perldoc perlre again to remember which is which.

    Kudos to Larry for the guts and insight to replace an old crufty structure with this beautiful edifice. I have a little bit of precious free time coming up, and this convinced me to spend it helping out in the effort, doing whatever is needed.

Re: Apocalypse 5 and regexes
by ariels (Curate) on Jun 05, 2002 at 15:58 UTC
    I use regular expressions a lot, and pretty much love the syntax. Except when I hate it, of course...

    But I also loved Apocalypse 5 (much more than I did, say, #4). Many moons ago, I wrote a paper titled ``Structured Pattern Matching for Forth'' for Forth Dimensions, and Apoc5's named rules are very similar. (If you must know, the Forth version was much more "atomic", backwards too was it and).

    What really struck me upon careful reading was that TheDamian's Parse::RecDescent appears to have been assimilated and eliminated! This is excellent news for anyone who loved the syntax but hated the performance -- Apoc5 ILREs (increasingly-less-regular-expressions) appear totally to subsume the functionality of P::RD.

    Which also explains <this syntax>, at least for things like <cut>.

Re: Apocalypse 5 and regexes
by chromatic (Archbishop) on Jun 05, 2002 at 16:19 UTC
    Why will you have to convert your code? This is not a flippant question -- I fully expect that Perl 5 will exist alongside Perl 6 for at least a couple of years.
      I have to modify my code quite frequently. When I start using perl 6, I'm not going to want to program in perl 5&6. I have a hard enough time switching between perl, java, and c# ($,;,#,@,/*,// missing when their needed, and there when their not). Usually I'll change one object, or a sub. When this comes around, I'll have to convert an entire program at once, unless they have some funky way of having perl 5&6 code in the same program (although that would probably be bad form).

        If I read it right, there'll be a :p5 modifier to say "interpret this regex according to p5 rules". Quote Larry:

        So:
            m:w/ foo\ bar \h* \: (baz)*/
        really means (expressed in Perl 5 form):
            m:p5/\s*foo bar[\040\t\p{Zs}]*:\s*(baz)*/
        I know which of those forms I'd rather use.

        In general, though, I agree with you that it'd be better to convert an entire program at once instead of piecemeal. Unless you lack time and resources....

        Update
        Larry explicitly says later in the document:

        Finally, there's the :p5 modifier, which causes the rest of the regex (or group) to be parsed as a Perl 5 regular expression, including any interpolated strings. (But it still doesn't enable Perl 5's trailing modifiers.)
        Should've read more before I typed....

Re: Apocalypse 5 and regexes
by FoxtrotUniform (Prior) on Jun 05, 2002 at 17:24 UTC
      What are some of your reactions to the change?

    It occurs to me that context-switching between Perl and vi regex dialects is going to be even more difficult. Not a particularly grave consequence, of course.

    --
    The hell with paco, vote for Erudil!
    /msg me if you downvote this node, please.
    :wq

      I believe that it would actually be easier, as one is less likely to confuse two very different languages than two obscurely different ones.
         MeowChow                                   
                     s aamecha.s a..a\u$&owag.print
Re: Apocalypse 5 and regexes
by kvale (Monsignor) on Jun 05, 2002 at 18:30 UTC
    All the apocalypsi up to this point have been incremental, but this one changes everything in the regex world. In fact the system that Larry proposes is so far beyond what computer sceince calls regular expressions, it should be given a new name.

    I would propose grammar expressions, or gramexes, because this system will allow one to parse full recursive grammars easily. I'd guess Larry chose angle brackets because that is what is often used in specifying nonterminals in a BNF grammar.

    From the implementation point of view, such a system will require the full intermixing of perl and gramex code at the Parrot bytecode level, which will probably slow down matching relative to the perl5 regex engine.

    From a user point of view, I think the difference between temp and let and local and my will be pretty confusing for most users.

    Previously, grammars had been parsed by perl code for the parsing and regexes for the lexing. Larry seems to be betting that a specialized subsystem for grammars will be more readable and perhaps faster than the old way.

    Times will certainly be interesting for perl6 gramex implementors.
      'gramexes' just doesn't quite have the ring of 'reglexes' and 'regyaxen' proposed in this node. Something like 'yacclexen' might be more correct but sounds even worse than gramexes. Anyone else got a good nickname for Perl6's non-regular expressions?
      "Interesting" is one way to put it, yep. :)

      Still, it shouldn't be that bad. Most of what A5 calls for that's new is needed for Parrot's parser anyway, so there's a serious overlap there, which means we can do it all just once and save some time and effort.

      Perl 6's regexes may be slow, but I'm not sure that'll really happen. One thing that will help is the JIT, which will reduce some of the time eaten up by bandwidth issues. The single biggest time sink with making these Parrot opcodes is that everything involved with parrot opcodes is 32 bit, while perl 5's regex bytecode engine is all 8-bit. (And yes, the perl 5 regex engine is a bytecode engine too) That means wer're moving four times as much data across the processor bus. OTOH, it's naturally aligned, which helps, and if things stay in L1 cache it's not a problem anyway. It's the initial cache load, and the times when things spill out of L1 cache, that really hurt.

      There's a fair amount we can do about that, and I'm more than happy to put in as much cheating as we need to for speed reasons. (At the moment, the single biggest reason that perl 5's regex engine is faster than a comparable one in Parrot is that perl 5's regex engine does a lot of optimization of the search, which'll make a huge difference)

      I also think "gramex" is a bit clunky. But grammars mean parsing. I assume that "parsex" would be a bit too "volatile" and open to unrelated criticism.

      The shorter "parex" might be about right; there might be wars about the "correct" pronunciation ("Parsing :: parex." "No! Parrot :: parex!"). So what?

      Eventually, of course, we'll all want to shoot the person who puns about the "parex par excellence!" for the thousandth time...

      "Gramex" sounds... weird. To me, at least. And (to steal a phrase off the Wall) it's bad Huffman coding.

      Why not "grex" (and, presumably, many "grexen")? It's better, after all, than what you use to grep on UN*X.

Re: Apocalypse 5 and regexes
by erikharrison (Deacon) on Jun 05, 2002 at 19:56 UTC

    Alright, I've been thinking hard and here is my praise.

    Perl 5 cannot parse with just it's regexes. Really what we do is tokenize. For simple formats and such then the two are very nearly synonymous. But Perl often needs to do so much more. And all this power was TOO concise and just about never self documenting. It's easy to write bugy regexes because the syntax itself provides nothing to allow you to build regexes out of little parts. And finally, regexen are a cargo cultish magic. Those who can master them don't have an easy way of providing handy little tools to their lessers, and instead we have broken code decending from broken code - Matt Wright and CGI parsing as an example.

    Potentially, Apoc 5 solves all of this. I love Apoc 5. My analysis follows.

    Okay, alot of this is a nice series of syntactic changes. Trailing modifiers moving to the front prevents action from a distance problems and increases clarity without changing much. /x being on at all times is a lovely cultural change, but not earth shattering. Named captures increases clarity, and making them alises to the number vars is the best way to handle them. Regexes being first class objects is really useful, but is there only if you need it. Embedded code is handled in a lovely way, and making the closures into anonymous methods means that parser writer have clean access to the power of regex objects while keeping the syntax clean. This also allows me to properly debug my regexes incrementally by good old print statement embedding. All of this is nice and well handled, but really not what I'm excited about.

    It's all about grammers baby. rule declarations and angle brackets solve a problem so deep that it's hard to even see that it's there. One, they give true parsing power to the regex engine. They allow me to build up regexes incrementally from component rules in a way that is both easier and self documenting. The syntax looks like BNF grammers that parser writers are already used to.

    My favorite aspect of all this is that we can standerdize rules into modules and pass these in clear ways to the Perl community. This provides a great deal of clarity, allows regex masters to share their skill with others while preventing cargo cult practices. And grammers no allow for the best of object oriented design, not only by opaque "rules", but also by grammer inheritance. Simple by adding or modifying of rules can make extending already existing regex based tools much easier. People who have to maintain code are gonna love this.

    All in all, I am very fond of the new system. I'm sure that there are nits to be picked and bugs for the community to hunt down (much like how currying syntax changed post Exegesis because of some really smart people on the Perl 6 language list) - and I hope that they do. But the ideas are incredible and in general the execution is brilliant.

    Cheers,
    Erik

    Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

Re: Apocalypse 5 and regexes (modifier reform)
by grinder (Bishop) on Jun 06, 2002 at 09:43 UTC

    There's one thing that bugs me in the page on modifier reform, and that's the syntax for specifying when to match.

    and I quote:

    A modifier that starts with a number causes the pattern to match that many times. It may only be used outside the regex. It may not be bundled, because ordinals are distinguished from cardinals. That is, how it treats those multiple matches depends on the next character. If you say

    s:3x /foo/bar/

    then it changes the first 3 instances. But if you say

    s:3rd /foo/bar/

    it changes only the 3rd instance. You can say

    s:1st /foo/bar/

    I love the concept, but I'm not too keen on the use of 1st, 2nd, 3rd, 4th... I think it's just a bit too clever for its own good. Wearing my programmer's hat, I would rather see a more generic method. Here, 1, 2 and 3 are exceptions. It's not until 4 that the pattern repeats. Wearing my linguist's hat, at least for the only other language I feel qualified to speak for, French, the pattern is simpler: 1er, 2eme, 3eme... (accents and gender notwithstanding). So by some unit of measure, the chosen syntax is more difficult than it could be.

    Knowing the shorthand for cardinals is not one of the simpler things foreign students learn about in English, thus non-English speakers are at more of a disadvantage than for other keywords such as die, split, rename... at least you can find those in a dictionary.

    A more linguistically neutral approach would be to use n: 1n, 2n, 3n. It looks more mathematical and ties in with "to the nth degree" thereby retaining some notion of cardinality.

    Ok, I admit this is a minor nit, but I did find it quite jarring when I read the Apocalypse.


    print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
      This is a good point.

      Of course, I suspect we're not going to be snobs about the suffixes, and can simply suggest that ...th is the standard "skipping suffix" for anyone who can't handle all four. That is:

      s:1th/pattern/string/; s:2th/pattern/string/; s:3th/pattern/string/; s:4th/pattern/string/; # etc.

      would be perfectly valid.

      Alternatively, the rule might be that any suffix other than ...x means repetition:

      s:1um/pattern/string/; s:2ieme/pattern/string/; s:3itter/pattern/string/; s:4issimo/pattern/string/; # etc.
      ;-)
Re: Apocalypse 5 and regexes
by Abigail-II (Bishop) on Jun 06, 2002 at 14:41 UTC
    In this apocalypse, I read:
    In real life, tokens are more recognizable if they are separated by whitespace.

    The culture is biased in the wrong direction. Whitespace around tokens should be the norm, not the exception. It should be acceptable to use whitespace to separate tokens that could be confused.

    and then I remembered than in the previous apocalypse, Larry decided to make Perl whitespace sensitive and disallowed using whitespace between an aggragate and its index. %hash {key} doesn't index the hash in perl6.

    Not even Python has such idiotic whitespace rules.

    I've been programming for 20 years. I cannot recall ever using a language that disallowed whitespace between an aggragate and its index. I've been programming Perl for a mere 7 years, less than half of my programming life. I ain't going to break a good habit of twenty years.

    But then, perl6 is still vaporware and its progress is slow. I'll keep using perl5 and if its development ceases, I'll learn a new language.

    Abigail

      Given that the penultimate style rule in perlstyle is 'Be consistent', it does seem odd that Larry would remove whitespace sensitivity from one part of the language but add it to another.
      The open way that development is being done on around Perl 6 is completely new in the history of programming languages. Considering how much money has been invested in its development, I'm rather impressed with it's progress.
      ()-()
       \"/
        `                                                     
      
      I can't imagine why I would want to put a space between an aggregate and its index (personally, YMMV). That makes it appear as if there are two entities there, when in fact it is one. I can imagine why I would want whitespace in a regex though (and the proposed RE system looks really great). I'm also under the impression that disallowing a space between %hash and {key} allows %hash {block} to perform a block action. The big mistake here is not a whitespace issue, though-- or an inconsistency WRT whitespace. It's a symbol issue.

      Apparently Larry has a {} fetish. There is not a single reason why %hash[key] is less acceptable than %hash{key} in Perl6, yet it looks as those the latter will be the way to go. If you kick out %hash{key} in favor of %hash[key], you immediately open up the possibility that %hash [key] can be allowed (unless I've missed something that indicated [key] on its own to mean something the way {key} on its own means something).
        {I}{like}{to}{put}{a}{space}{between}{an}{aggregate}{and}{its}{index}{ +because}{if}{you}{do}{not}{things}{get}{very}{hard}{to}{read}{Remembe +r}{that}{Larry}{used}{to}{use}{English}{as}{an}{inspiration}{when}{he +}{designed}{Perl}{Too}{bad}{he}{has}{choosen}{German}{as}{the}{way}{f +or}{Perl6}{I}{fail}{to}{see}{why}{you}{think}{it}{is}{just}{one}{enti +ty}{Unlike}{Perl1}{hashes}{are}{first}{class}{citizens}{A}{hash}{is}{ +something}{but}{a}{hash}{element}{is}{something}{else}{Just}{like}{an +}{object}{is}{different}{from}{a}{method}{and}{a}{method}{is}{differe +nt}{than}{its}{arguments}{Abigail}
        There is not a single reason why %hash[key]? is less acceptable than %hash{key} in Perl6

        You're perfectly correct. There's not a single reason; there are (at least) five:

        1. We're trying to retain at least some backwards compatibility in the syntax for variable accesses,

        2. Not distinguishing statically between hash and array look-ups severely limits opportunities for fundamental optimizations,

        3. Not distinguishing statically between hash and array look-ups reduces our ability to do compile-time error detection,

        4. Even without error checking, programmers are much less likely to make mistakes if there is three characters' difference between a hash look-up and an array look-up, rather than just a single sigil (that's why we inflect both noun and verb when making plurals in English),

        5. Unless we distinguish hash and array look-ups, there's no way to create objects that can be used as both, according to need (e.g. we can't fix caller to allow named return elements, and we can't allow regexes to simultaneously return their numbered captures as an array and their named captures as a hash.
        $fullname = $fullname{$id}; $email = $email {$id};


        We are using here a powerful strategy of synthesis: wishful thinking. -- The Wizard Book

        Actually this has been hashed (I say, I say, that's a joke there, boy) out pretty thoroughly on the language list. The braces are neccessary because otherwise anonymous hash constructors seem to have no syntactic justification, and we need the brackets for anonymous array constructors. Plus there are strong historical reasons at work.

        As for Larry's braces fetish, this is absolutely not true. Having designed a few mini languages myself, I can tell you that there isn't anything else to use. Braces already mean code, parens mean args, brackets mean array ref, and angles are ugly, so filehandles got the diamond and hashes got the braces because in early Perl it was pretty clear what they meant. The only reason Larry seems to have a braces fetish is that alot of the Apocalypses have dealt with blocks - closure, special CAPITAL blocks, switch statements, embedded code.

        Cheers,
        Erik

        Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://171822]
Approved by Aristotle
Front-paged by Petruchio
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2014-09-16 05:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (156 votes), past polls