Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Re: Regular Expression Builder

by Anonymous Monk
on Aug 30, 2002 at 17:05 UTC ( [id://194182]=note: print w/replies, xml ) Need Help??


in reply to Re: Regular Expression Builder
in thread Regular Expression Builder

But i dont think this will scale very well... (and probably has subtle problems anyway)

One quibble is that because \d is a subset of \w then a string such as "abc123def" will get \w{9} in your version. Here's a slightly improved version (for some definition of improved)

my $string=" \aabc123def!*#\n"; $string=~s{ ([[:digit:]]+) |([[:alpha:]]+) |([[:punct:]]+) |([[:space:]]+) |([[:cntrl:]]+) |(.) } { defined($1) ? '[[:digit:]]{'.length($1).'}' : defined($2) ? '[[:alpha:]]{'.length($2).'}' : defined($3) ? '[[:punct:]]{'.length($3).'}' : defined($4) ? '[[:space:]]{'.length($4).'}' : defined($5) ? '[[:cntrl:]]{'.length($5).'}' : "\Q$+\E" # anything else? }gex; print $string;

But it still has problems (for example, \n is in both :space: and :cntrl: so "\n\a" produces [[:space:]]{1}[[:cntrl:]]{1}, but "\a\n" produces [[:cntrl:]]{2}).

Replies are listed 'Best First'.
Re: Re: Re: Regular Expression Builder
by demerphq (Chancellor) on Aug 30, 2002 at 17:18 UTC
    One quibble is that because \d is a subset of \w then a string such as "abc123def" will get \w{9} in your version.

    Yup. But personally I consider that a feature not a bug. :-) After all ldkjdlkjf2098kklls probably isnt [[:alpha:]]+\d+[[:alpha:]]+

    But we are both in agreement that there isnt a good way to do this, although as we both have shown there are a variety of bad ways to do it... BTW, is the . really necessary? I dont think it is as the s/// will just skip the char if it doesnt match.

    Oh and I considered using something like you post here, but I fgured that considering I tend not to use the POSIX char classes that much probably others wouldnt either.

    :-)

    Yves / DeMerphq
    ---
    Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

      But we are both in agreement that there isnt a good way to do this

      Agreed. Anything that tries to generalize beyond "\Q$string\E" requires a variety of assumptions.

      BTW, is the . really necessary? I dont think it is as the s/// will just skip the char if it doesnt match.

      Just trying to be careful :-) In case I neglected something with those classes, and that something also required escaping, then just leaving it in the string wouldn't result in a valid re. (I don't tend to use POSIX char classes either, so I wasn't sure exactly how inclusive I was being).

      After all ldkjdlkjf2098kklls probably isnt [[:alpha:]]+\d+[[:al­pha:]]+
      I did play around once with a small script that did these things, when I was bored, and no, it probably isn't. But what I did was take several strings and tried to derive a common expression out of them - first looking for similarities, like a sequence of numbers in the middle, or whitespace at the end, or whatever, and then built sub-regexes from the parts. I think starting with splitting on non-words and such gave so-so results for things like email addresses..

      Of course, I never really got any really usable results, but it was a fun exercise. :) What I wanted to say was that do decide which it should be you need a decent sample of several strings that should all match. Then it is sometimes possible to get something to build upon. Maybe. :)


      You have moved into a dark place.
      It is pitch black. You are likely to be eaten by a grue.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://194182]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-19 03:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found