Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: Regex expression to match...

by Sewi (Friar)
on Jun 17, 2011 at 07:16 UTC ( #910080=note: print w/ replies, xml ) Need Help??


in reply to Re: Regex expression to match...
in thread Regex expression to match...

Match a word if it follows itself:
/\b(\w+)\W\1\b/
Correct the problem (remove the second word):
s/\b(\w+)\W\1\b/$1/g

\b matches a word break. It is zero bytes long and lifes on both sides of a word. In the words "|sides| |of| |a| |word|", each | shows the position of a \b.

The regex matches a word break (which is also before the first and behind the last word in a string/on a line), followed by a word (\w+) containing one or more of [a-zA-Z0-9_] followed by a non-word-char (\W) (which is everything but the chars listed before).

You know $1, which contains the contents of the first ( ) in the last executed regex? \1 is the same, but it's for the current regex and contains the word we found before by using \w+.

Adding another word-break at the end keeps us from matching /in/information/, because (\w+)\W\1 would match the in (as \w+ in ( ) going to \1), the / as match for \W and "in" of "information" for \1, because \1 contains the "in" from the \w+ match.

The first regex could be used on your whole text, there is no need to split it into lines.
The second regex removes any duplicate words converting /help/one/one/one/something to /help/one/something

Speaking of texts, some suggestions:

  • you may want to add a "+" after \W which would also match "text, text"
  • replace the \W by \/ to match only / instead of every non-word char
  • replace \w by [^\/] to allow everything between two / which is no /, but remember to also replace \b by \/ because \b won't work any longer

  • Comment on Re^2: Regex expression to match...
    Select or Download Code

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: note [id://910080]
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others wandering the Monastery: (9)
    As of 2015-07-29 03:30 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (260 votes), past polls