Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: regex at word boundary

by davido (Archbishop)
on Dec 08, 2005 at 04:27 UTC ( #515124=note: print w/ replies, xml ) Need Help??


in reply to Re: regex at word boundary
in thread regex at word boundary

You're right about the (??{code}) construct facilitating the use of regular expressions in detecting palindromes. Behold:

use strict; use warnings; my $string = "God dog"; my $re = qr/^ (.+) (??{ ( length $^N > 1 and lc $^N eq reverse lc $^N ) ? '' : '1\b2' }) $/ix; print "$string is", ($string =~ $re) ? "" : "n't", " a palindrome.\n";

One trick here; since (??{code}) must spawn a regular subexpression that will then be evaluated for truth, the code block I used plays a little trick. If the boolean test of reversability succeeds, the (??{code}) expression returns an empty string (which, in other words, adds no additional criteria to be matched). If the reversability test fails, the code returns a regular subexpression that can never succeed; a search for '1\b2' (in other words, the literal number one, followed by the literal number two, but with a word boundry sandwiched between; an impossibility). That way the outcome of the reversability test can force a failure to match for the entire regular expression.

Also, the use of $^N is a convenience described in perlvar and perlre. It contains the most recent parenthetical capture. That way you don't need to count capturing parens, not that it's an issue in this case.

By the way, in my solution I chose to accept palindromes regardless of whether they contain only alpha characters or not. Why? Wikipedia says, "A palindrome is a word, phrase, number or any other sequence of units (like a strand of DNA) which has the property of reading the same in either direction..." It also says that the position of spaces can usually be adjusted as necessary. My regexp doesn't accomodate that possibility; it treats space like any other character. But why not; it's just a proof of concept. ;)


Dave


Comment on Re^2: regex at word boundary
Select or Download Code
Re^3: regex at word boundary
by QM (Vicar) on Dec 08, 2005 at 04:50 UTC
    Very good.

    '1\b2'
    I would point out the existence of (?!), which I'm told always fails, though I'm not sure I understand it.

    Update 1:

    By the way, in my solution I chose to accept palindromes regardless of whether they contain only alpha characters or not.
    Yes. The OP was looking for multi-word palindromes, perhaps more along the lines of the "interesting to humans" variety, which seems to be what started the thread in the first place.

    Update 2: After further examination, your idea could be adapted for intervening whitespace (or indeed any noise characters) if the regex engine was re-entrant (is that with or without the hyphen?). Something like:

    (??{ local $N; ($N = $^N) =~ s/\w+//g; (lc $N eq reverse lc $N) ? '' : (?!) })
    which might be further streamlined to
    (??{ local $N; ($N = $^N) =~ s/\w+//g; (?!) if (lc $N ne reverse lc $N) })
    I think length $^N > 1 is superfluous, as length $^N is sufficient as a test, and (.+) would always be positive anyway (or is there a zero length character that would match?)

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      (?!) explained:

      From perlre: "A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar"." So in other words, (?!) is a negative lookahead assertion that "must not match nothing" (an impossibility).

      YAPE::Regex::Explain describes it simply like this:

      perl -MYAPE::Regex::Explain -e "printYAPE::Regex::Explain->new( qr/(?! +)/ )->explain();" The regular expression: (?-imsx:(?!)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- (?!) fail ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

      Dave

        "must not match nothing"
        I understand that at some level, but my brain is yelling "double negative!!". It might be better to say:
        (?!) is the negative look-ahead assertion for the null string. Since the null string matches anything and everything (including nothing), (?!) always fails.
        [I was tempted to find a symbolic representation for the null string, but I thougt that would only confuse the issue.]

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      Regarding the length $^N > 1 test, I added that because while refining this RE I discovered that it was capable of considering a single-character string to be palindromic (new word?), since "a" in reverse is still "a".

      Update: Now that I think about it, the length $^N > 1 test is better written as ..+ earlier in the RE, like this:

      qr/(..+) (??{ ( lc $^N eq reverse lc $^N ) ? '' : '(?!)' }) /ix

      ...thus forcing at least two characters to be tested in the first place, rather than hand-coding a test for two or more characters via 'length'. Duh!


      Dave

        It must be the late hour, because I missed that (.+) would give trivial palindromes of length 1.

        I prefer (.{2,}) over (..+), as those dots tend to fade into the background.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      As you point out, the regexp engine is not re-entrant. But tr/// isn't a regular expression, and that can be used to strip away whitespace.

      use strict; use warnings; use vars qw/ $N /; my $string = "God dog"; my $re = qr/^ (..+) (??{ local $N = lc $^N; $N =~ tr!\t \f!!d; ( $N eq reverse $N ) ? '' : '(?!)' }) $/ix; print "$string is", ( $string =~ $re ) ? "" : "n't" , " a palindrome\n";

      Dave

        Does this find all non-trivial palindrome phrases on each input line?

        Does it fail on some interesting input?

        #!/your/perl/here # find all palindrome phrases on each line # only phrases of two or more alpha # a phrase starts and ends on a word boundary # # nested palindrome phrases will also be found # (including single words) use strict; use warnings; our $N; my $re = qr/( # start $1 \b # left word edge ([a-z].*[a-z]) # at least 2 alpha \b # right word edge (??{ # start code local $N = lc $^N; # save capture $N =~ tr!a-zA-Z!!dc; # remove non-alpha # fail if not pal '(?!)' if (lc($N) ne reverse lc($N)); }) # end code ) # end $1 /ix; while (<DATA>) { my $found; while ( /$re/g ) { print "line $.:\n" unless $found; pos = pos() - length($1); # find nested pals print "(",pos,") \"$1\"\n"; $found = 1; # \n between groups pos = pos() + 1; # advance one } print "\n" if $found; } exit; __DATA__ god dog alpha beta gamma stop pots wonka wonka wonka bookkeeper raisinhead what is a bookkeeper pop repeekkoob tomorrow? boob tube A man, a plan, a canal, Panama! kook peep aha aba abba abbba aabbbaa abbba abba aba
        produces
        __OUTPUT__ line 1: (0) "god dog" line 2: (17) "stop pots" line 4: (10) "bookkeeper pop repeekkoob" (21) "pop" line 5: (0) "boob" (10) "A man, a plan, a canal, Panama" (42) "kook" (47) "peep" (52) "aha" line 6: (0) "aba abba abbba aabbbaa abbba abba aba" (4) "abba abbba aabbbaa abbba abba" (10) "abbba aabbbaa abbba" (17) "aabbbaa" (25) "abbba" (31) "abba" (36) "aba"
        BTW, if '(?!)' is changed to (?!) (no single quotes), I get a "panic: top_env" from the compiler. Is this a known bug?

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://515124]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2014-04-24 10:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (565 votes), past polls