Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Regex expression to match...

by Anonymous Monk
on Jun 17, 2011 at 04:35 UTC ( #910068=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all
I am seeking knowledge for a regular expression to match either of the following:

/help/one/one/one/something_here
/buy/cash/buy/water/water/water/nothing_here

I want to identify the repeating phrase 'one/' or 'water/'
I setup a loop that would breakdown each line, and search for repeats, but I am sure there are folks who can identify a repeat using a single expression. Being that I am a beginner, I would be very grateful for some help so I can understand regular expression matching much better.

Thank you!

Comment on Regex expression to match...
Re: Regex expression to match...
by Anonymous Monk on Jun 17, 2011 at 04:49 UTC
    Great, first post your code :)
Re: Regex expression to match...
by wind (Priest) on Jun 17, 2011 at 04:58 UTC
    The following will match two words in a row:
    use strict; use warnings; my @strings = qw( /help/one/one/one/something_here /buy/cash/buy/water/water/water/nothing_here ); for (@strings) { if (m{\b(\w+)/\1\b}) { print "Dup is '$1'\n"; } }
      Match a word if it follows itself:
      /\b(\w+)\W\1\b/
      Correct the problem (remove the second word):
      s/\b(\w+)\W\1\b/$1/g

      \b matches a word break. It is zero bytes long and lifes on both sides of a word. In the words "|sides| |of| |a| |word|", each | shows the position of a \b.

      The regex matches a word break (which is also before the first and behind the last word in a string/on a line), followed by a word (\w+) containing one or more of [a-zA-Z0-9_] followed by a non-word-char (\W) (which is everything but the chars listed before).

      You know $1, which contains the contents of the first ( ) in the last executed regex? \1 is the same, but it's for the current regex and contains the word we found before by using \w+.

      Adding another word-break at the end keeps us from matching /in/information/, because (\w+)\W\1 would match the in (as \w+ in ( ) going to \1), the / as match for \W and "in" of "information" for \1, because \1 contains the "in" from the \w+ match.

      The first regex could be used on your whole text, there is no need to split it into lines.
      The second regex removes any duplicate words converting /help/one/one/one/something to /help/one/something

      Speaking of texts, some suggestions:

    • you may want to add a "+" after \W which would also match "text, text"
    • replace the \W by \/ to match only / instead of every non-word char
    • replace \w by [^\/] to allow everything between two / which is no /, but remember to also replace \b by \/ because \b won't work any longer
      Using quantifier can help you avoid repeatting the pattern:
      if ($string=~m@(\w+/){2,}@) { print "Dup is $1\n"; }
        Quantifiers used in that way do not do what you think they will:
        use strict; use warnings; my @strings = qw( /help/one/one/one/bar/something_here /buy/cash/buy/water/water/water/baz/nothing_here ); for (@strings) { if (m{(\w+/){2,}}) { print "Dup is '$1'\n"; } } =prints bar/ baz/ =cut

        Even if it did work, you'd still need to add boundary conditions so that a sub directory that is a suffix of the previous directory wouldn't match. Also, if the duplicate is on the end, it wouldn't have a trailing /.

        Don't worry, at one point, I also thought a quantifier should work in that way, but that's specifically why they allow \1 in the LHS.

Re: Regex expression to match...
by planetscape (Canon) on Jun 17, 2011 at 05:05 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://910068]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2014-12-18 01:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (41 votes), past polls