Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Regex expression to match...

by wind (Priest)
on Jun 17, 2011 at 04:58 UTC ( #910073=note: print w/ replies, xml ) Need Help??


in reply to Regex expression to match...

The following will match two words in a row:

use strict; use warnings; my @strings = qw( /help/one/one/one/something_here /buy/cash/buy/water/water/water/nothing_here ); for (@strings) { if (m{\b(\w+)/\1\b}) { print "Dup is '$1'\n"; } }


Comment on Re: Regex expression to match...
Download Code
Re^2: Regex expression to match...
by Sewi (Friar) on Jun 17, 2011 at 07:16 UTC
    Match a word if it follows itself:
    /\b(\w+)\W\1\b/
    Correct the problem (remove the second word):
    s/\b(\w+)\W\1\b/$1/g

    \b matches a word break. It is zero bytes long and lifes on both sides of a word. In the words "|sides| |of| |a| |word|", each | shows the position of a \b.

    The regex matches a word break (which is also before the first and behind the last word in a string/on a line), followed by a word (\w+) containing one or more of [a-zA-Z0-9_] followed by a non-word-char (\W) (which is everything but the chars listed before).

    You know $1, which contains the contents of the first ( ) in the last executed regex? \1 is the same, but it's for the current regex and contains the word we found before by using \w+.

    Adding another word-break at the end keeps us from matching /in/information/, because (\w+)\W\1 would match the in (as \w+ in ( ) going to \1), the / as match for \W and "in" of "information" for \1, because \1 contains the "in" from the \w+ match.

    The first regex could be used on your whole text, there is no need to split it into lines.
    The second regex removes any duplicate words converting /help/one/one/one/something to /help/one/something

    Speaking of texts, some suggestions:

  • you may want to add a "+" after \W which would also match "text, text"
  • replace the \W by \/ to match only / instead of every non-word char
  • replace \w by [^\/] to allow everything between two / which is no /, but remember to also replace \b by \/ because \b won't work any longer
Re^2: Regex expression to match...
by dxxd116 (Beadle) on Jun 17, 2011 at 18:15 UTC
    Using quantifier can help you avoid repeatting the pattern:
    if ($string=~m@(\w+/){2,}@) { print "Dup is $1\n"; }
      Quantifiers used in that way do not do what you think they will:
      use strict; use warnings; my @strings = qw( /help/one/one/one/bar/something_here /buy/cash/buy/water/water/water/baz/nothing_here ); for (@strings) { if (m{(\w+/){2,}}) { print "Dup is '$1'\n"; } } =prints bar/ baz/ =cut

      Even if it did work, you'd still need to add boundary conditions so that a sub directory that is a suffix of the previous directory wouldn't match. Also, if the duplicate is on the end, it wouldn't have a trailing /.

      Don't worry, at one point, I also thought a quantifier should work in that way, but that's specifically why they allow \1 in the LHS.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://910073]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (14)
As of 2015-07-06 21:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (83 votes), past polls