Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re^3: regex search for words with one digit (updated)

by AnomalousMonk (Bishop)
on Sep 21, 2020 at 18:20 UTC ( #11122046=note: print w/replies, xml ) Need Help??

in reply to Re^2: regex search for words with one digit
in thread regex search for words with one digit

Why has everybody else kept the \b?

I can't answer for others, but for me, throwing in boundary assertions like this is a reflexive, defensive (and possibly cargo-cultish) move I tend to use when I'm dealing with "words".

The string the Anonymous Monk gives as an example is fairly straightforward: it's delimited by whitespace and the beginning and end of the string. As you say, \b will not help here (update: No! See tybalt89's reply), though it does no harm.

Unfortunately, "words" be tricky. Is "word's" one word or two? If it's supposed to be one word, then Anonymous Monk's /\b\w*\d\w*\b/ or /\w*\d\w*/ or, I think, any of the other solutions I've seen so far will fail to match it with or without \bs. Words like "t'other", "wouldn't've", "words'" or "left-handed" can be difficult to deal with. I'm sure one could give many other examples, and that's just in English!

In general, I think (?<! \S) and (?! \S) would serve better than \b as word boundaries (update: in the OPed case). But once again, it is unfortunately true that there are few generalities in human language.

Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^4: regex search for words with one digit
by tybalt89 (Prior) on Sep 21, 2020 at 19:37 UTC

    Once you add in the restriction of "only one digit", the \b is required.

    my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /\b[^\W\d]*\d[^\W\d]*\b/g; print "@names\n";


    P5ete Nic4k

    but without the \b's

    my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /[^\W\d]*\d[^\W\d]*/g; print "@names\n";

    it outputs

    P5ete Richard5 8 Nic4k Le7on 5

    It's pulling patterns out of the middle of "words".

      Ah — I completely missed this point! Maybe a bit of occasional cargo-culting isn't entirely bad. :)

      I still think I would use (?<! \S) (?! \S) as boundary assertions instead of \b though, to properly handle "words" like 'a1-b1'. Anon gives no hint that such things may appear in the data, but an abundance of caution inclines me this way. More reflex defensiveness I guess.

      Give a man a fish:  <%-{-{-{-<

        Anon gives no hint that such things may appear in the data

        Furthermore, Anon gives no hint whether they should be treated as one word or two if they did. It's great to have both tools in the box but blindly applying one or other in the absence of a relevant spec becomes a guessing game.


      Very instructive! I hadn't looked closely at your code, embarrassingly, and this actually happens to be the first example I can remember where a \b is actually required. And I really can't think up an alternative without it. (It's not that I somehow don't like it, just that I've never had to use it.)

      On reflection, I would perhaps have used split and then something with grep {/^...$/} -- much clumsier and also amounting to the same thing in disguise, namely string anchors. I never realised that \b are a kind of ^ and $, but within the text.

      (The point I was trying to make -- only half convincingly -- was that I think it's a good habit not to throw in things "for good measure" and move on as soon as it works, because that approach doesn't teach you a lot, and you sometimes walk away with a wrong or (worse!) fuzzy impression of what actually did the trick. Or both. I have experienced this a thousand times.)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11122046]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2021-04-18 00:53 GMT
Find Nodes?
    Voting Booth?

    No recent polls found