Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Prog. Perl 3rd Ed. Regex Question

by makar (Novice)
on Feb 13, 2004 at 04:55 UTC ( [id://328729]=perlquestion: print w/replies, xml ) Need Help??

makar has asked for the wisdom of the Perl Monks concerning the following question:

I am sitting here reading through Chapter 5 of Programming Perl, and a bit on p. 203 is confusing me. Maybe.

$_ = "Paris in THE THE THE THE spring"; 1 while s/\b(\w+) \1\b/$1/gi;
The regex is a reference to earlier in the chapter, a duplicate word killer. After playing around with test patterns and test regex searches, I am not sure I really get what's going on above (perhaps it is just a lot to hit a weekend programmer like myself with at once ^_^). Here is what I think is going on: 1st regex) The while loop has a one because all the interesting stuff happens in the logical test. (\b(\w+) \1\b) matches a word boundary followed by one or more word characters followed by a space and whatever matched the (\w+). The match of (\w+) replaces the whole match. I think I kinda get it.

But were you to tell me to make a regex for that match, I'd have done this:

/(\b\w+)\1\b
Unfortunately I'm not really good enough to divine the difference between the two. Also, I don't see why the substitution that the book does doesn't eat up the preceding space: "Paris THE THE" turns into "ParisTHE". I hope I've a real conceptual problem here, and not just a brain fart, because I'll be feeling really sheepish asking if it is indeed a brain fart. Thanks in advance!

Replies are listed 'Best First'.
Re: Prog. Perl 3rd Ed. Regex Question
by NetWallah (Canon) on Feb 13, 2004 at 05:16 UTC
    You almost got the explanation right.

    What the Book's RE is matching is:
    Word Boundry, followed by a '1 or more character' word
    Followed by space
    Followed by a word-boundry
    Followed by the Same word

    Your RE is missing the SPACE at the end of the word found, and are trying to capture the word boundry '\b' inside the parens, which is meaningless.

    "When you are faced with a dilemma, might as well make dilemmanade. "
      Thank you for the help. I am, however, still not sure why the \b inside the parens is meaningless.
        In the case of this particular regex, the parens are used specifically (and only) for capturing.

        \b is a zero-width assertion. It doesn't match a space, or a character of some sort. It matches when a word character is found next to a nonword character.

        Therefore, there isn't anything in particular to capture when talking about \b. So it's probably not really accurate to say that \b inside parens is meaningless, because it has just as much meaning as it would have outside the parens.

        But that's the crux of it; position inside or outside of the parens is unimportant, in this case, because the function is identical. ...the act of relocating it inside the parens is pointless, because in or out, its function is the same.

        Update: By the way, I ++ed the original question because even though it was pretty basic, it started out with "I'm sitting here reading the camel book." (paraphrasing)

        Kudos to those who do a little homework themselves and come to SoPW for clarification rather than a spoonfeeding. Very good question, in that context.


        Dave

Re: Prog. Perl 3rd Ed. Regex Question
by japhy (Canon) on Feb 13, 2004 at 06:02 UTC
    You're confused about \b. It is an anchor. It matches a POSITION, it does not actually match any CHARACTERS.
    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Prog. Perl 3rd Ed. Regex Question
by leriksen (Curate) on Feb 13, 2004 at 07:20 UTC
    Maybe it would help if I told you that regexes can match two quite different things
    • things such as characters e.g. \w, \s, \d etc
    • things between characters, like \b, ^, $ etc

    The things between characters have zero width, characters have width (generally 1, can't think of a wider example, even in Unicode)

    So \b is the zero-width bit between a word character, and 'something' that is 'not-word', like a space, or a '-' or a '%' etc.

    +++++++++++++++++
    #!/usr/bin/perl
    use warnings;use strict;use brain;

Re: Prog. Perl 3rd Ed. Regex Question
by artist (Parson) on Feb 13, 2004 at 05:23 UTC
    \b is word boundary , not a space: Learn more regex using use re 'debug' at the top of your code.
Re: Prog. Perl 3rd Ed. Regex Question
by Anonymous Monk on Feb 13, 2004 at 08:20 UTC
    $ perl -MYAPE::Regex::Explain -le'die YAPE::Regex::Explain->new(qr/(\ +b\w+)\1\b/)->explain' The regular expression: (?-imsx:(\b\w+)\1\b) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- \b the boundary between a word char (\w) and something that is not a word char ---------------------------------------------------------------------- \w+ word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \1 what was matched by capture \1 ---------------------------------------------------------------------- \b the boundary between a word char (\w) and something that is not a word char ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://328729]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2024-07-19 05:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.