Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Regex expression to match...

by Anonymous Monk
on Jun 17, 2011 at 04:35 UTC ( #910068=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all
I am seeking knowledge for a regular expression to match either of the following:


I want to identify the repeating phrase 'one/' or 'water/'
I setup a loop that would breakdown each line, and search for repeats, but I am sure there are folks who can identify a repeat using a single expression. Being that I am a beginner, I would be very grateful for some help so I can understand regular expression matching much better.

Thank you!

Replies are listed 'Best First'.
Re: Regex expression to match...
by wind (Priest) on Jun 17, 2011 at 04:58 UTC
    The following will match two words in a row:
    use strict; use warnings; my @strings = qw( /help/one/one/one/something_here /buy/cash/buy/water/water/water/nothing_here ); for (@strings) { if (m{\b(\w+)/\1\b}) { print "Dup is '$1'\n"; } }
      Match a word if it follows itself:
      Correct the problem (remove the second word):

      \b matches a word break. It is zero bytes long and lifes on both sides of a word. In the words "|sides| |of| |a| |word|", each | shows the position of a \b.

      The regex matches a word break (which is also before the first and behind the last word in a string/on a line), followed by a word (\w+) containing one or more of [a-zA-Z0-9_] followed by a non-word-char (\W) (which is everything but the chars listed before).

      You know $1, which contains the contents of the first ( ) in the last executed regex? \1 is the same, but it's for the current regex and contains the word we found before by using \w+.

      Adding another word-break at the end keeps us from matching /in/information/, because (\w+)\W\1 would match the in (as \w+ in ( ) going to \1), the / as match for \W and "in" of "information" for \1, because \1 contains the "in" from the \w+ match.

      The first regex could be used on your whole text, there is no need to split it into lines.
      The second regex removes any duplicate words converting /help/one/one/one/something to /help/one/something

      Speaking of texts, some suggestions:

    • you may want to add a "+" after \W which would also match "text, text"
    • replace the \W by \/ to match only / instead of every non-word char
    • replace \w by [^\/] to allow everything between two / which is no /, but remember to also replace \b by \/ because \b won't work any longer
      Using quantifier can help you avoid repeatting the pattern:
      if ($string=~m@(\w+/){2,}@) { print "Dup is $1\n"; }
        Quantifiers used in that way do not do what you think they will:
        use strict; use warnings; my @strings = qw( /help/one/one/one/bar/something_here /buy/cash/buy/water/water/water/baz/nothing_here ); for (@strings) { if (m{(\w+/){2,}}) { print "Dup is '$1'\n"; } } =prints bar/ baz/ =cut

        Even if it did work, you'd still need to add boundary conditions so that a sub directory that is a suffix of the previous directory wouldn't match. Also, if the duplicate is on the end, it wouldn't have a trailing /.

        Don't worry, at one point, I also thought a quantifier should work in that way, but that's specifically why they allow \1 in the LHS.

Re: Regex expression to match...
by planetscape (Chancellor) on Jun 17, 2011 at 05:05 UTC
Re: Regex expression to match...
by Anonymous Monk on Jun 17, 2011 at 04:49 UTC
    Great, first post your code :)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://910068]
Approved by planetscape
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2017-10-19 20:54 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (257 votes). Check out past polls.