http://www.perlmonks.org?node_id=183830

Yes, I'm still writing my book. And there's going to be a cookbook of sorts at the end, a compendium of useful regexes for all sorts of occasions. (I'm afraid I will be including some tag-parsing ones, but I'll make it clear that HTML and XML should be parsed by modules, etc.)

So I ask you, my fellow amonkicans, to help me. What regexes have you found yourselves using? Not simple dinky ones, but perhaps regexes that got you out of a bind, or were quite sneaky at what they did, or you find yourself using a lot. I'd much appreciate your input, and the proper acknowledgements will be made in my book. Thank you.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Regular Expressions: Call for Examples
by jryan (Vicar) on Jul 21, 2002 at 19:08 UTC

    Well, I went truding around and I found these:

    This first one was from a client who accidentally did something like: s/\n//g; to a few hundred text files. The files happened to be lists:

    1) foo and his friend bar 2) Stuff and more (stuff) 3) More (and30) more (and) more 4) garbage (8) 5) some other things 6) (7lalala)
    which turned into:
    1) foo and his friend bar2) Stuff and more (stuff)3) More (and30) more + (and) more4) garbage (8)5) some other things6) (7lalala)
    Anyways, it was my job to fix them. I was having a difficult time correctly parsing, but ended up solving the problem using your sexeger technique.
    $text = reverse $copy; $text =~ s/ (?<= \) ) (\d+) (?= [^()]* \) ) /$1\n/gx; $text = reverse $text;
    A few weeks later, for fun, I was able to solve the problem using a forward regex:
    my $bal = # this is from perlre qr/ \( (?: (?> [^()]+ ) | (??{$bal}) )* \) /x; $text =~ s/ ( (?: (??{$bal}) [^(\d]* )* ) (\d+) (?= \) ) /$1\n$2/xg;

    Which is exponentially uglier, and proves just how useful sexeger really is.

    Another possibly useful example is a dealing from irc, where the person needed to perform a crude form escaping that involved stripping all backslashes that were between brackets. This solved his problem:

    $text =~ s/ (?<= \[ ) ([^\]]*) (?= \] ) / strip_slash($1) /gex; sub strip_slash { $_=pop; s/\\//g; $_; }

    Finally, theres a bunch of stuff at the end of Parsing with Perl 6 you might find useful...

      Since I'm not yet the regex master I aspire to be, I can't authoritatively state that this solution is better, but it seems to work.

      If you're working with ordered item labels you can make your assertion more specific:

      $n = 1; s/((??{$n+1})\))(?{$n++})/\n$1/g;

      The first iteration matches "2)" and replaces it with "\n2)", the second "3)", and so on.

      conv

      Update: I should know better than to post when I'm tired. Someone just pointed out to me that it would be much neater to do:

      $n = 2; ++$n while s/$n(?=\))/\n$n/

      Thanks, Aristotle, you're right. The while loop substitution isn't equivalent because it will make replacements in any order (at any position in the string) while the original substitution I posted will not.

        Actually, they are not interchangeable: the latter loses the "ordered items" assumption. Observe what they do with 2) bar 3) asfgh 7) lorem 6) ipsum 1) foo 5) baz 4) blah

        I tried fixing that using \G, but didn't come up with anything useful in 5 minutes and gave up since it would have been a lot more complicated than your first regex which I believe is just perfect.

        japhy: I like the scenario presented here. This is a regex (series) I'd propose you pick up; it's simple in premise and not far from something one might actually have to do one day, and it's not hard even for a novice to follow along on the subleties in the differences of each approach. A perfect teaching example, if you ask me.

        Makeshifts last the longest.

(jeffa) Re: Regular Expressions: Call for Examples
by jeffa (Bishop) on Jul 21, 2002 at 22:09 UTC
    Here are 3 regexes i used recently for Node Link Checker. The problem is to turn PM link settings into their respective HTML links. First, the lookup table:
    my %TAG = ( ftp => 'ftp://', http => 'http://', https => 'https://', kobe => 'http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?filet +ype=+distribution+name+or+description&j&case=clike&search=', kobes => 'http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?filet +ype=+distribution+name+or+description&j&case=clike&search=', cpan => 'http://search.cpan.org/search?mode=module&query=', isbn => 'http://shop.barnesandnoble.com/booksearch/isbnInquiry.a +sp?isbn=', google => 'http://www.google.com/search?q=', lucky => 'http://www.google.com/search?btnI=I&q=', jargon => 'http://www.science.uva.nl/cng/search/htsearch.CGI?restr +ict=%2F%7Emes%2F&jargon%2Fwords=', id => '/index.pl?node_id=', pad => '/index.pl?node_id=108949&user=', DEFAULT => '/index.pl?node=', );
    Next, the regexes:
    # takes care of [tag://target|alt] $chunk =~ s/\[(\w+):\/\/(.*?)\|([^\]]+)\]/<a href="$TAG{$1}$2">$3<\/a> +/g; # takes care of [tag://target] $chunk =~ s/\[(\w+):\/\/([^\]]+)\]/<a href="$TAG{$1}$2">$2<\/a>/g; # takes care of [target] $chunk =~ s/\[([^\]]+)\]/<a href="$TAG{DEFAULT}$1">$1<\/a>/g;
    The 3 regexes have to be executed in that order. Maybe they could be combined into one regex, but this worked for me. :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      I wrote this mostly to annoy jeffa because of:

      they could be combined into one regex
      But, he insisted that I post this here :)
      $chunk =~ s! \[ (?(?= [^:]*://) (?: (\w+):// ([^|\]]+) (?: \| ([^\[]+) )? ) | (\w+) ) \] ! $_ = "<a href=\"".((defined$1)?(qq($TAG{$1}$2">).((defined$3)?$3:$2)): qq($TAG{DEFAULT}$4">$4))."</a>"!gex;

      I truely think that jeffa's approach is better than the above. It's much smarter to break a 3 case problem into 3 steps rather than use 1 gigantic regex. Just look at the "substitution" section; its hidious (I'd normally have used a sub to handle the above "substitution" section, but then wouldn't be "one regex" :)

      .
Re: Regular Expressions: Call for Examples
by Abigail-II (Bishop) on Jul 22, 2002 at 12:05 UTC
    Well, I like the regex that's constructed in http://perl.plover.com/NPC/NPC-3SAT.html. It's the most interesting thing that I've ever done with Perl.

    Then there's the (in)famous prime checker:

    perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'
    And then there's my URL matcher. A bit outdated, as it only matches HTTP, FTP, News, NNTP, telnet, gopher, WAIS, mailto, file, prospero, LDAP, z39.50, CID, MID, VEMMI, IMAP and NFS URLs. Many other URLs schemes have seen the light the last 5 years. One of these days, I'll update the regex....

    Here it is, just remove the newlines....

    Abigail

    (?:http://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\. )*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+) ){3}))(?::(?:\d+))?)(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F \d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{ 2}))|[;:@&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{ 2}))|[;:@&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(? :%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a- fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|- )*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(? :\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+! *'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'() ,]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?: (?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;/?:&=])+@(?:(?:( ?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[ a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3})))|(?:[a-zA-Z]( ?:[a-zA-Z\d]|[_.+-])*)|\*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[ a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d ])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:[a-zA-Z](?:[a-zA-Z \d]|[_.+-])*)(?:/(?:\d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+ !*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'() ,]|(?:%[a-fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a -zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d] )?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))/?)|(?:gopher://(?:(?: (?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?: (?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+ ))?)(?:/(?:[a-zA-Z\d$\-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))(?:(?:(?:[ a-zA-Z\d$\-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))*)(?:%09(?:(?:(?:[a-zA -Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*)(?:%09(?:(?:[a-zA-Z\d$ \-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))*))?)?)?)?)|(?:wais://(?:(?:(?: (?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?: [a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))? )/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)(?:(?:/(?:(?:[a-zA -Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|( ?:%[a-fA-F\d]{2}))*))|\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d] {2}))|[;:@&=])*))?)|(?:mailto:(?:(?:[a-zA-Z\d$\-_.+!*'(),;/?:@&=]|(?:% [a-fA-F\d]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d] |-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?: (?:\d+)(?:\.(?:\d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'() ,]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|( ?:%[a-fA-F\d]{2}))|[?:@&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-zA-Z \d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-) *[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:(?:(?:(? :[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a- zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:(?:;(?:(?:(?:[ a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&])*)=(?:(?:(?:[a-zA-Z\d $\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&])*)))*)|(?:ldap://(?:(?:(?:(?: (?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?: [a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))? ))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]) )|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%2 0)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F \d]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(? :(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID |oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa]) ?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*)(?:( ?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:(?:( ?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|o id)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?( ?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*))(?:(?:(?: %0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Z\d]|%( ?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?: \.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a -zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*))*(?:(?:(?:%0[Aa])?(?:%2 0)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:\?(?:(?:(?:(?:[a-zA-Z\d$\-_.+ !*'(),]|(?:%[a-fA-F\d]{2}))+)(?:,(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-f A-F\d]{2}))+))*)?)(?:\?(?:base|one|sub)(?:\?(?:((?:[a-zA-Z\d$\-_.+!*'( ),;/?:@&=]|(?:%[a-fA-F\d]{2}))+)))?)?)?)|(?:(?:z39\.50[rs])://(?:(?:(? :(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(? :[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+)) ?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:\+(?:(?: [a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*(?:\?(?:(?:[a-zA-Z\d$\-_ .+!*'(),]|(?:%[a-fA-F\d]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Z\d$\-_.+!*'(), ]|(?:%[a-fA-F\d]{2}))+))?(?:;rs=(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA -F\d]{2}))+)(?:\+(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*) ?))|(?:cid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:@&= ])*))|(?:mid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:@ &=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:@&=] )*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z \d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\ .(?:\d+)){3}))(?::(?:\d+))?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a -fA-F\d]{2}))|[/?:@&=])*)(?:(?:;(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a -fA-F\d]{2}))|[/?:@&])*)=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d ]{2}))|[/?:@&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+ !*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:\*|(?:( ?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+))))?)|(?:(?:;[ Aa][Uu][Tt][Hh]=(?:\*|(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2 }))|[&=~])+)))(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[ &=~])+))?))@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d]) ?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?: \d+)){3}))(?::(?:\d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?: %[a-fA-F\d]{2}))|[&=~:@/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][Tt]| [Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2})) |[&=~:@/])+)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[ &=~:@/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1- 9]\d*)))?)|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~ :@/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9]\d* )))?(?:/;[Uu][Ii][Dd]=(?:[1-9]\d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo][Nn ]=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~:@/])+)))?)) )?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA- Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?: \.(?:\d+)){3}))(?::(?:\d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Z\d\$\-_.!~*' (),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\-_.!~*'(), ])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA-Z\d \$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\ -_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?))|(?:(?:(?:(?:(?:[a-zA- Z\d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\d \$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?)))
      I need to create an ASCII art regex. That is, a regex that works (but perhaps doesn't have a good purpose) that, when viewed as an X-by-Y grid of characters, makes a cute picture. I swear I see something in your monstrous regex. Perhaps its my frayed ends of sanity.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Regular Expressions: Call for Examples
by hossman (Prior) on Jul 21, 2002 at 18:31 UTC
    this is from 180953 .. it pulls out IMDB movie titles from the genre file...

    next unless m{ ^\> \s+ # starts with "> " (.*? # main part of title ($1 = title) \((\d+)(/.*?)?\) # year inside parens, might be (1999/I) # ($2 = year, $3 = crap) (\s+\(.*?\))? # ($4 = crap .. movies might be tv/vids/games) )\t+Sci\-Fi$ # must end in Sc-Fi }x;
    I use it as an example when peope ask me what /x is for, and when they ask what *? means.

    (It could also be a good example of non-capturing parens -- if it was changed to acctualy use them -- I sometimes prefer to capture and ignore unless needed ... one mans crap, is another mans treasure.)

Re: Regular Expressions: Call for Examples
by mojotoad (Monsignor) on Jul 21, 2002 at 23:24 UTC
    I'm pretty sure that japhy is aware of Regexp::Common by Damian Conway, but fellow monks may not be. It's a nice trove of commonly desired, but not necessarily simple, tricks of the regexen trade.

    Matt

      Can I add a plea to those who do have treasured regexen they're willing to share:
      • download Regex::Common,
      • see how easy it is to integrate your regexes into the module (using the pattern subroutine),
      • do so,
      • then send me a patch!
      I very much want to update the module with new and useful patterns, but I don't have time to reinvent them. With your help and contributions, Regex::Common could become a significant community resource. BTW,

      I'm also looking for someone to take on the maintenance and extension of the module. Japhy volunteered previously, but has more than enough on his regex plate at the moment.

        Followup: Abagail has volunteered to take over Regexp::Common.

        Be afraid, be very afraid!

Re: Regular Expressions: Call for Examples
by VSarkiss (Monsignor) on Jul 21, 2002 at 23:53 UTC

    Well, it's not mine, it's by Abigail-II, and it may be sneakier than you intended, but the regular expression that made my jaw drop recently was the n-queen problem solver in Backtracking through the regex world. I'm not sure of its pedagogic value for beginners, but it's certainly a mind-expander.

Re: Regular Expressions: Call for Examples
by stefp (Vicar) on Jul 21, 2002 at 20:56 UTC
    Can I throw a problem at you that I never took the time to think out and that may add some salt to your book? In perl5, it is impossible to get at capture in an embedded regular expression:

    my $qr = qr|whatever(somecapture_re)whatever|; my ($captured) = m/$qr/; # does not work

    A few week ago, I looked at your draft (forgot the URL). It looks promising. My problem is: find an API to stash away that information and get it back. Probably, it will not very pretty but there no way to "modularize" regexen in perl5. I mean by that, to build interesting regexen from simpler ones.

    I once looked at your draft (forgot the URL), it seemed pretty interesting.

    -- stefp -- check out TeXmacs wiki

      my $string = "whateversomecapture_rewhatever"; my $qr = qr|whatever(somecapture_re)whatever|; my ($captured) = $string =~ m/$qr/; print $captured, " ", $1;
      prints:
      somecapture_re somecapture_re
Re: Regular Expressions: Call for Examples
by Notromda (Pilgrim) on Jul 22, 2002 at 00:52 UTC
    I needed to get some values out of a logfile, for some realtime reporting of spam blocked by our mail server.

    Here's the regex:
    if (/bouncer postfix\S+ reject: RCPT from (\S+) (530|554|450) (\S+): (.*) from=<(.*?)> to=<(.*?)>/) {

    Here's what it was decoding:

    Jul 3 11:19:00 bouncer postfix/smtpd[14071]: reject: RCPT from unknow +n[123.123.123.12]: 530 <qwertyy@domain.tld>: Recipient address reject +ed: Cannot find your hostname, [123.123.123.12]. Ask your system mana +ger to fix your reverse domain name registration. If you are sending + spam, go away. ; from=<aaaaaaaaaaaaaaaaaaaaaaaaaa@aaaa.aaa-aaaaa.com +> to=<qwertyy@domain.tld>

    For monks not familiar with regex, here's a brief runthrough.

    First it looks for "bouncer postfix" and then some non-whitespace stuff, " reject: RCPT from ", more non-whitespace(and keep track of it), " ", one of ( 530,554,450 ) and keep track of it , " ", more non-whitespace(keep track of it, ": ", anything, "from=<", anything(keep track of it) non-greedy, "> to=<", anything(keep track of it) non-greedy, ">"

    In other words, from the example above, $1, $2 etc contain "unknown123.123.123.12:","530", "<qwertyy@domain.tld>", the error message, "aaaaaaaaaaaaaaaaaaaaaaaaaa@aaaa.aaa-aaaaa.com", "qwertyy@domain.tld"

    I'm not a very good teacher, but this might be a good real-world example of something a regex shines in. I'll let the book author explain it better. :)

Re: Regular Expressions: Call for Examples
by dws (Chancellor) on Jul 22, 2002 at 03:40 UTC
    Not simple dinky ones, but perhaps regexes that got you out of a bind, or were quite sneaky at what they did, ...

    A couple of times recently I've used a "nested" regexp to pull off a bit of tricky substition. The "outer" regexp serves as a filter, and the "inner" regex, fired via /e, does a more targetted subsitution (or no substitution at all).

    In the code below, the challenge was to turn words like "cowsCanFly" into "cows-can-fly".

    $phrase = "NoMatch cowsCanFly sheepAreVeryCool NoMatch"; $phrase =~ s{ \b ( [a-z]+ (?:[A-Z][a-z]+)+ ) \b }{ my $word = $1; $word =~ s/([A-Z])/"-" . lc($1)/eg; $word; }gex; print $phrase, "\n";
    When I first posted this fragment (in this node), there was some concern that the regex engine wasn't reentrant, and that I'd just gotten lucky. Perhaps, though I've done this a few times with 5.6.0 or later, and haven't run into any problems.

    japhy, since you're now a regex UberLord, perhaps you can vet this approach for reentrancy issues.

      There's no re-entry here. The regex engine has exited once the regex portion of the s/// ends. Once the right-hand side of the substitution is done, the regex engine starts again; it is not paused, though, it has stopped. Compare: "japhy" =~ m{.(?{ "perlmonk" =~ /./ }).}; Watch it explode.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        That segfaults for me on 5.005_03, and 5.6.0, but not on 5.6.1 or 5.8.0. It seems to be fixed between 5.7.0 and 5.7.1. The former segfaults, the latter doesn't.

        Abigail

Re: Regular Expressions: Call for Examples
by Cody Pendant (Prior) on Jul 22, 2002 at 06:45 UTC
    I would nominate two regexes that merlyn was responsible for, one being the answer to "I want to replace spaces with underscores, but only where they're found between brackets" on a newsgroup posting somewhere.

    It's a relatively simple one, but it opened my eyes, as a beginner, to a whole world of nested and executing regexes.

    I'm reproducing it here from memory so merlyn will forgive me if it's not quite how he did it:

    $str= 'no change <these spaces need replacing> not these <these do>'; $str =~ s{(<[^>]*?>)} { my $x=$1; $x=~s/ /_/g; $x; }egx; print $str

    And the other one I can't remember at all, but I remember it involved Old MacDonald, and a regex that double-executed, and therefore ended "/eieio". Has to be included.
    --

    ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

      Do you mean this one?

      $Old_MacDonald = q#print #; $had_a_farm = (q-q:Just another Perl hacker,:-); s/^/q[Sing it, boys and girls...],$Old_MacDonald.$had_a_farm/eieio;

      As far as I can tell, the first (currently) recorded appearance of this can be found on this announcement regarding what I presume is the first edition of a certain book in most of our collections.

      (Note: I've written as three lines because the two-line version wrapped oddly.)

      --f

Re: Regular Expressions: Call for Examples
by I0 (Priest) on Jul 22, 2002 at 05:01 UTC
    remove nested <table>...</table> elements
    $_ = join'',<>; ($re=$_)=~ s#((<table[^>]*>)|(</table>)|<!--.*?-->|.)#${['(','']}[!$2]\Q$1\E${[') +','']}[!$3]#sgi; $re=join"|",map quotemeta,eval{/$re/}; die $@ if $@=~/unmatched/i; s/$re//g; print;
Re: Regular Expressions: Call for Examples
by smackdab (Pilgrim) on Jul 22, 2002 at 03:18 UTC
    You could explain saving and restoring REs to a file...I had some help from the monks to get me going...
Re: Regular Expressions: Call for Examples
by PodMaster (Abbot) on Jul 23, 2002 at 21:16 UTC
    I didn't write it, but it's pretty damn interesting: strip HTML tags
    sub untag { local $_ = $_[0] || $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > s{ < # open tag (?: # open group (A) (!--) | # comment (1) or (\?) | # another comment (2) or (?i: # open group (B) for /i ( TITLE | # one of start tags SCRIPT | # for which APPLET | # must be skipped OBJECT | # all content STYLE # to correspond ) # end tag (3) ) | # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(4) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] | # and not close tag or [\s=] # \s or "=" with `[^`]*` | # something in quotes ` or [\s=] # \s or "=" with '[^']*' | # something in quotes ' or [\s=] # \s or "=" with "[^"]*" # something in quotes " )* # repeat (C) 0 or more times | # else (if previous case is not (4)) .*? # minimum of any chars ) # end if previous char is (4) (?(1) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(2) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(3) # if one of tags-containers (3) </ # wait for end (?i:\3) # of this tag (?:\s[^>]*)? # skip junk to ">" ) # end if (3) > # tag closed }{}gsx; # STRIP THIS TAG return $_ ? $_ : ""; }

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Regular Expressions: Call for Examples
by stefp (Vicar) on Jul 27, 2002 at 17:49 UTC