Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

regex at word boundary

by mikeraz (Friar)
on Dec 07, 2005 at 17:42 UTC ( #514932=perlquestion: print w/replies, xml ) Need Help??

mikeraz has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a bit to look for palindromes. To search for those with embedded whitespace, like "God dog" I started with this regex:

/(\b([a-zA-Z]).*$2\b)/i;
But it matches any line. For instance:
   Being able to script this in Perl gives a lot more flexibility, and
   is arguably easier to use and extend.
I'm blind as to where the $2 match is occuring in those two lines. There is no word ending in b on the first line. There is no word ending in i on the second.

What is the problem with my perception of the regex?

update: OP used <pre> instead of <code> tags. This confused the issue for the first few replies. Question was really and completely addressed by japhy - I was asking for the smack upside the head to clear out temporary blindness.

Be Appropriate && Follow Your Curiosity

Replies are listed 'Best First'.
Re: regex at word boundary
by japhy (Canon) on Dec 07, 2005 at 17:54 UTC
    Backreferences in a regex are \NUMBER, not $NUMBER.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

    2005-12-12 Retitled by jdporter: s/boundry/boundary/
    Original title: 'regex at word boundry'

      Ow! Ow, ow, ow.

      Thank you. Off to get more coffee.

      Be Appropriate && Follow Your Curiosity
Re: regex at word boundary
by davido (Cardinal) on Dec 07, 2005 at 18:01 UTC

    japhy correctly mentioned that $2 isn't a backreference, \2 is. But there's another problem. If you wish for such a RE to match "God dog", that one won't be selective enough to exclude non-palindromes. This is because it only fails to match if the string you're comparing doesn't have two of the same letter. For example, "Pete" would match, because it has two "e's". But "God" wouldn't match, because it doesn't have two of any particular letter in it.


    Dave

Re: regex at word boundary
by ww (Archbishop) on Dec 07, 2005 at 18:09 UTC
    mikeraz:
    Welcome. Briefly, re your actual question...

     \b marks a word_boundary, not the letter "b" -- except inside a character_class, where it's the metachar for backspace
     /i is a modifier which tells the regex to search in case_INsensitive mode.

    The Monastery does some magic with regard to rendering, so read Perl Monks Approved HTML tags. For example, use &#91; to write an open_square_bracket here; a literal bracket, except inside <code>... </code> or <c>... </c> tags operates in the Monastery's specialized parsing as a link token.

    In short, you need to read how to post here, starting with Welcome to the Monastery and How do I post a question effectively?-- but even before that, you should probably read perlretut and similar regex introductions (including Tutorials#regexdata) as your question suggests you've missed some very basic info.

Re: regex at word boundary
by mikeraz (Friar) on Dec 07, 2005 at 22:06 UTC

    I reviewed the link that swkronenfeld provided. The code samples, or at least as many as I waded through, did not find palindromes. The found mirror strings. I prefer pi is a palindrome. aaabbcbbaaa is not a palindrome.

    After writing to steele1381 and QM about intent and all that I decided I'd better step up and show the code.

    I believe this word end searching should be faster than a pure regex version but have not tested it yet.

    Here's the trivial program I wrote to take a text file and locate all the palindromes. Note: it's trivial and should be easy to break. In particular the punct filter is incomplete, HTML will break it . . .

    #!/usr/bin/perl -w use strict; # find palindromes in text file my ($le, @lines, @F, $test, $pal, %pals, $start_char, $i, $word); # cross line boundries but not paragraph boundries $le = $/; $/ = "\n\n"; @lines = <>; $/ = $le; foreach (@lines) { # strip punct - no posix on my system s/[)(\?".,\/]//g; s/-/ /g; chomp; (@F) = split; while (int @F) { # select array slices where last letter of last word in # slice equals first letter of first word $start_char = lc substr $F[0], 0, 1; foreach $i (1 .. $#F) { if( (lc substr $F[$i], -1) eq $start_char) { # test for palindrome $test = lc join "", @F[0..$i]; if($test eq reverse $test) { $pal = join " ", @F[0..$i]; $pals{$pal}++; } } } # grab single word palindromes $word = shift @F; if(length $word > 2 && lc $word eq lc reverse $word) { $pals{$word}++; } } } foreach $pal (sort keys %pals) { print "$pals{$pal}\t$pal\n"; }
    Be Appropriate && Follow Your Curiosity
Re: regex at word boundary
by QM (Parson) on Dec 07, 2005 at 19:49 UTC
    As others have mentioned, you aren't going to find palindromes with that. Single word palindromes are problematic without some code magic. You really need regex recursion, which is what Regexp::Common::lingua does.

    There are ways to do it with embedded regex ??{code}, without being as fancy as in R::C::l, but I'd go with R::C::l. If you really want to do it yourself, it's easy enough to Google for these, so I'll leave that up to you.

    As far as I know, there's not an easy way to skip over whitespace, without explicitly stripping it out first (but maybe I'm just not being creative enough today).

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

    2005-12-12 Retitled by jdporter: s/boundry/boundary/
    Original title: 'regex at word boundry'

      You're right about the (??{code}) construct facilitating the use of regular expressions in detecting palindromes. Behold:

      use strict; use warnings; my $string = "God dog"; my $re = qr/^ (.+) (??{ ( length $^N > 1 and lc $^N eq reverse lc $^N ) ? '' : '1\b2' }) $/ix; print "$string is", ($string =~ $re) ? "" : "n't", " a palindrome.\n";

      One trick here; since (??{code}) must spawn a regular subexpression that will then be evaluated for truth, the code block I used plays a little trick. If the boolean test of reversability succeeds, the (??{code}) expression returns an empty string (which, in other words, adds no additional criteria to be matched). If the reversability test fails, the code returns a regular subexpression that can never succeed; a search for '1\b2' (in other words, the literal number one, followed by the literal number two, but with a word boundry sandwiched between; an impossibility). That way the outcome of the reversability test can force a failure to match for the entire regular expression.

      Also, the use of $^N is a convenience described in perlvar and perlre. It contains the most recent parenthetical capture. That way you don't need to count capturing parens, not that it's an issue in this case.

      By the way, in my solution I chose to accept palindromes regardless of whether they contain only alpha characters or not. Why? Wikipedia says, "A palindrome is a word, phrase, number or any other sequence of units (like a strand of DNA) which has the property of reading the same in either direction..." It also says that the position of spaces can usually be adjusted as necessary. My regexp doesn't accomodate that possibility; it treats space like any other character. But why not; it's just a proof of concept. ;)


      Dave

        Very good.

        '1\b2'
        I would point out the existence of (?!), which I'm told always fails, though I'm not sure I understand it.

        Update 1:

        By the way, in my solution I chose to accept palindromes regardless of whether they contain only alpha characters or not.
        Yes. The OP was looking for multi-word palindromes, perhaps more along the lines of the "interesting to humans" variety, which seems to be what started the thread in the first place.

        Update 2: After further examination, your idea could be adapted for intervening whitespace (or indeed any noise characters) if the regex engine was re-entrant (is that with or without the hyphen?). Something like:

        (??{ local $N; ($N = $^N) =~ s/\w+//g; (lc $N eq reverse lc $N) ? '' : (?!) })
        which might be further streamlined to
        (??{ local $N; ($N = $^N) =~ s/\w+//g; (?!) if (lc $N ne reverse lc $N) })
        I think length $^N > 1 is superfluous, as length $^N is sufficient as a test, and (.+) would always be positive anyway (or is there a zero length character that would match?)

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      You're right, the (corrected) code won't find palindromes. It wasn't intended to. It was the first step in finding a palindrome candidate. After the candidate is found I can strip out spaces and check to see if the word group does qualify as a palindrome.

      Ultimately I don't care about finding palindromes. It is just a little puzzle to apply some time to. Some people do soduko puzzles, some crosswords, I look for little problems that I don't know the answer to and explore the possibilities.

      Be Appropriate && Follow Your Curiosity
        You're right, the (corrected) code won't find palindromes. It wasn't intended to. It was the first step in finding a palindrome candidate.
        OK, fair enough. So what did you want it to do? Did you just want to check for repeated characters at the proper end of word boundaries?

        [Insert "Lightbulb Over Head" graphic here ]

        You just want a filter to grab candidates, then strip whitespace, then test for palindrome-ness.

        I think you'd do us a favor by coming up with an appropriate regex with ??{code}, something near:

        /(\b (([a-z]) # at least one char ?{local @suffix; push @suffix, $^N}) # build up suffix (([a-z) ?{local @suffix; push @suffix, $^N} | \S+)+ [a-z]? # possible odd char (([a-z]) (?{if ($^N eq $suffix[-1]) {pop @suffix}}) | \S+)+) (([a-z]) (?{if ($^N eq $suffix[-1]) {pop @suffix}}) \b) /x
        Though I don't think that works, and it's untested. (Hey, I have to leave something up to you.) Besides, there's a lot left out of the perlre with ?{code} and ??{code}, so please come back and tell us what it should say :)

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      I came up with this recursive-regex code. It's a bit simpler than most of the others posted here. It finds the leftmost longest palindrome in each line.
      my $palin_re; $palin_re = qr/(([a-z]) # First letter [^a-z]* # any irrelevant chars (?:(??{$palin_re})|[a-z]?) # The interior [^a-z]* # any other irrelevant chars \2) # the last letter matches the first /xi; while (<DATA>) { if (/\b$palin_re\b/) { print "Matched $1\n"; } else { print "No palindrome found in '$_'\n"; } } __DATA__ oo madam in eden, I'm Adam, prince of eternia is god a dog? nested testest detsen nested i prefer pi ip referp nothing to see here, folks.

      Caution: Contents may have been coded under pressure.
Re: regex at word boundary
by QM (Parson) on Dec 10, 2005 at 15:48 UTC
    Is the code at Re^5: regex at word boundary useful? I thought you might compare it with your filter version, and see what falls out. Also, it could be made to search across newlines, but at considerable slowdown if not done carefully.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      The code at that Re^5: regex at word boundary fails on overlapping palindromes, not that I'm aware of any that exist in the wild. Such that adding

      nested testest detsen nested
      i prefer pi ip referp
      
      yeilds:
      
      line 7:
      (0) "nested testest detsen nested"
      (7) "testest detsen nested"
      (15) "detsen nested"
      (22) "nested"
      
      line 8:
      (0) "i prefer pi ip referp"
      (2) "prefer pi ip referp"
      (9) "pi ip referp"
      (12) "ip referp"
      (15) "referp"
      
      I also tested it on a handy text file of 79,569 lines and it ran much slower than the code I listed above, modified to just test on each line, not each paragraph.
      sunorccws04 ~$ time ./mr_pal.pl trf > mr.out
      
      real    1m2.161s
      user    1m1.210s
      sys     0m0.280s
      
      sunorccws04 ~$ time ./qm_pal.pl trf > qm.out
      
      real    2m53.492s
      user    2m49.070s
      sys     0m1.690s
      
      trf is the output of a tcpdump session. Other data sets are sure to produce differing comparative speeds.

      Be Appropriate && Follow Your Curiosity
        Are you comparing apples to apples? Does the other code find overlapping palindromes?

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: regex at word boundary
by swkronenfeld (Hermit) on Dec 07, 2005 at 18:00 UTC
    Edit: removing comments except for link due to <pre> <code> issue.

    Have a look here Try this in Regexp palindrome (found using the super search) for a discussion on this topic.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://514932]
Approved by dsb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2022-09-25 08:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (116 votes). Check out past polls.

    Notices?