Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Demarcate Regexes with Unicode

by toro (Beadle)
on Sep 16, 2011 at 06:08 UTC ( #926323=perlmeditation: print w/ replies, xml ) Need Help??

Let's call this a mini-meditation.

One lamented aspect of Perl is that regexes get really hard to read. That's why we end up writing =~ m#http://\w+[.]\w{2,3}/[^ ]+# with #'s instead of /'s.

But since # does have a meaning in Perl, and in fact occurs frequently — why not use a sigil , gnaborretni ⸘, or the tetragram for ease 𝌜? These are all possible with the pragma use utf8;. Then that regex looks like:

=~ m𝌜http://\w+[.]\w{2,3}/[^ ]+𝌜

I find this a bit easier to read. Maybe you will too.

Comment on Demarcate Regexes with Unicode
Select or Download Code
Re: Demarcate Regexes with Unicode
by Anonymous Monk on Sep 16, 2011 at 07:07 UTC

    One lamented aspect of Perl is that regexes get really hard to read. ....

    If you learn to regex, they're really easy to read :)

    why not use a sigil ...

    Because its not on the standard keyboard!

    Also, its a delimiter not a sigil

    Which is why I prefer to use @ for quoting like s@@@ </sarcasm>

    But seriously, this is exactly why i prefer \ or

    $ perl -MO=Deparse -e " s\\\g " s///g; -e syntax OK $ perl -MO=Deparse -e " sg " s///g; -e syntax OK
    or even </sarcasm>

    But seriously, between balanced delimiters like

    perl -MO=Deparse -e " s {}//g " perl -MO=Deparse -e " s {}\\g " perl -MO=Deparse -e " s {}vvg " perl -MO=Deparse -e " s {}()g " perl -MO=Deparse -e " s {}[]g " perl -MO=Deparse -e " s {}<>g " perl -MO=Deparse -e " s<><>g "

    I stick to keyboard characters

    s///x
    s===x
    s,,,x
    s!!!x
    s~~~x
    s>>>x
    s}}}x
    and the special case s'''x

    The x means magic

      its a delimiter not a sigil

      Well, it's a section sign, but lawyers sometimes call it Sigil. (I know that namespace is already occupied in this circle.)

      Because its not on the standard keyboard!

      If you're on a Mac it's quite easy to make, and if you're on Ubuntu it's pretty easy to make. On Windows too, I still remember

      Alt + Num0141
      from typing Spanish on a US keyboard.

      Anyway, I admit this approach is not for everybody. I like your suggestions (but I don't have a key for ).

Re: Demarcate Regexes with Unicode
by moritz (Cardinal) on Sep 16, 2011 at 07:55 UTC

    On one of my machines, two of the characters you proposed aren't displayed correctly, because there's no font installed that contains them.

    As a maintainer of code like that I would be unhappy to be faced with characters that I don't know how to produce with the keyboard.

    In my humble opinion, the real problem with regex readability is that people tend to not reuse regexes, so everything is pieced together from the primitives.

    I find

    use Regexp::Common qw /URI/; if ($string =~/$RE{URI}{HTTP}/) { ... }

    more readable than any of the alternatives you have offered, and there are no "weird" characters involved.

      I hadn't heard of Regexp::Common. Awesome! You've just saved me a lot of time moritz, thank you.
Re: Demarcate Regexes with Unicode
by hardburn (Abbot) on Sep 16, 2011 at 13:10 UTC

    My radical idea for regex readability is to integrate tablets into development. Touchscreens can be a convenient way to manipulate a Finite State Machine (which is what regexen are). Once saved, it would create a compiled form of the NFA that can be accessed using some API call (similar to Regexp::Common).

    The main difficulty is that Perl's regexen go way beyond just FSMs. How would you represent capturing on the tablet, for instance? I'm also not sure if Perl5's regex engine can be easily given a serialized input.


    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      needs more cloud
        Aw, dammit... Coke, keyboard.... :-)
Re: Demarcate Regexes with Unicode
by JavaFan (Canon) on Sep 16, 2011 at 17:55 UTC
    What you find easier to read, I see more or less as:
              +---+
              |01D|
              |31C|
              +---+
    

    I have my own set of layout rules. For most rules there may be exceptions in which I break them. Two of them, I never break:

    • No line shall exceed 80 characters in length.
    • No non-ASCII character shall appear in the source code.
      You wouldn't use the bullet operator to form a list in a long comment? Say there was a key for it in your system (e.g. you're using a Mac).
        Eh, no. An asterisks will do fine, although I usually will use a dash or numbers in such a case. (I've a Macbook, I haven't seen a key for a "bullet operator").
Reaped: Re: Demarcate Regexes with Unicode
by NodeReaper (Curate) on Sep 18, 2011 at 13:59 UTC
Reaped: Re: Demarcate Regexes with Unicode
by NodeReaper (Curate) on Sep 19, 2011 at 13:54 UTC
Re: Demarcate Regexes with Unicode
by DrHyde (Prior) on Sep 20, 2011 at 10:41 UTC
    NEVER use non-ASCII characters in your source code, not even in quoted text. Why? Several reasons:
    1. any given machine may not be configured to understand your character set;
    2. any given machine may not have an appropriate font;
    3. any given editor may not know how to handle that character set;
    4. for some characters, users may not be able to see the differences easily (this is no doubt a function of familiarity)

    If you need to spit out non-ASCII characters, then they should live in a language-specific resource file. This even applies to code that is only for your own consumption where the bizarro-characters are for your own language, to protect you from the pain of editors that don't know your character set on other peoples' machines, or on mobile devices, or ...

    Any use of non-ASCII characters in code is a bug, and any support for non-ASCII characters in code is also a bug because it encourages the writing of buggy code.

Re: Demarcate Regexes with Unicode
by misterwhipple (Monk) on Sep 22, 2011 at 17:27 UTC
    I like using {}, for two reasons:
    • The delimiters act balanced, so they might as well look like it.
    • It makes the two parts of an s/// very distinct, visually.
    m{rancho}
    s{cucamonga}{dressing}
    

    --
    Any sufficiently interesting Perl project will depend upon at least one module that doesn't run on Windows.

Re: Demarcate Regexes with Unicode
by tweetiepooh (Friar) on Sep 26, 2011 at 13:41 UTC
    Hmm! Don't think I'll have these on a Solaris console. Even if they were displayable I don't know how to access from a SUN keyboard or from a console session on PC via serial cable. Not everyone has graphical interfaces.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://926323]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (11)
As of 2014-12-18 10:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (49 votes), past polls