Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Re^2: regex search for words with one digit

by Bruder Savigny (Initiate)
on Sep 21, 2020 at 20:40 UTC ( #11122035=note: print w/replies, xml ) Need Help??

in reply to Re: regex search for words with one digit
in thread regex search for words with one digit

I guess many of us have asked "silly" questions (I certainly have, and still am). The documentation is often intimidatingly vast. The upshot here is that the regex character classes, anchors and the like may perfectly overlap, and you have unwittingly not only used a class that was more inclusive that you assumed ("word characters" -- well, who can blame you? I have ignored the 'numeric' part in 'alphanumeric' myself more than once because of the 'word'.), but were therefore also unaware that \d is not complementary to \w, but its subclass. (This is simply not what we intuitively expect.) And that - as tybalt has already demonstrated - negated character classes ([^ ... whatever ...]) are often massively helpful, especially if you follow Athanasius' suggestion and define the character classes yourself (which makes your own code more transparent to you).

Guessing from how you've written the regex, it seems to me that the following advice may also be helpful:

Perl does not "understand what you mean", but, like any other programming language, slavishly follows the rules you have given it, and, since rules are rules, does not need to be told things twice. With that in mind, you learn a lot if you try to design regexes (like other code) as "thinly" as possible:

my @names = $text =~ /\w*\d\w*/g;

does exactly the same as your original regex: matching all "words" (in the above definition) which contain at least one digit, somewhere. (Not what you wanted, I know, but it's still instructive.) Why?

Although the \b anchors do match what you mean them to match, they are redundant: their meaning is "Match a \w\W or \W\w boundary" (man perlref). But your \w* already matches everything that falls under the definition of \w (* is "greedy", as you may know), and that is necessarily until it hits a character that does not - which is precisely the definition of \W. In other words, until it hits "a \w\W boundary". (As \d is a subclass of \w, it will never match anything that matches \W either; in other words, it will stop at a \d\W border, if there isn't a \w in between.)

(Or am I somehow mistaken? Why has everybody else kept the \b?)

Replies are listed 'Best First'.
Re^3: regex search for words with one digit (updated)
by AnomalousMonk (Bishop) on Sep 21, 2020 at 22:20 UTC
    Why has everybody else kept the \b?

    I can't answer for others, but for me, throwing in boundary assertions like this is a reflexive, defensive (and possibly cargo-cultish) move I tend to use when I'm dealing with "words".

    The string the Anonymous Monk gives as an example is fairly straightforward: it's delimited by whitespace and the beginning and end of the string. As you say, \b will not help here (update: No! See tybalt89's reply), though it does no harm.

    Unfortunately, "words" be tricky. Is "word's" one word or two? If it's supposed to be one word, then Anonymous Monk's /\b\w*\d\w*\b/ or /\w*\d\w*/ or, I think, any of the other solutions I've seen so far will fail to match it with or without \bs. Words like "t'other", "wouldn't've", "words'" or "left-handed" can be difficult to deal with. I'm sure one could give many other examples, and that's just in English!

    In general, I think (?<! \S) and (?! \S) would serve better than \b as word boundaries (update: in the OPed case). But once again, it is unfortunately true that there are few generalities in human language.

    Give a man a fish:  <%-{-{-{-<

      Once you add in the restriction of "only one digit", the \b is required.

      my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /\b[^\W\d]*\d[^\W\d]*\b/g; print "@names\n";


      P5ete Nic4k

      but without the \b's

      my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /[^\W\d]*\d[^\W\d]*/g; print "@names\n";

      it outputs

      P5ete Richard5 8 Nic4k Le7on 5

      It's pulling patterns out of the middle of "words".

        Ah — I completely missed this point! Maybe a bit of occasional cargo-culting isn't entirely bad. :)

        I still think I would use (?<! \S) (?! \S) as boundary assertions instead of \b though, to properly handle "words" like 'a1-b1'. Anon gives no hint that such things may appear in the data, but an abundance of caution inclines me this way. More reflex defensiveness I guess.

        Give a man a fish:  <%-{-{-{-<

        Very instructive! I hadn't looked closely at your code, embarrassingly, and this actually happens to be the first example I can remember where a \b is actually required. And I really can't think up an alternative without it. (It's not that I somehow don't like it, just that I've never had to use it.)

        On reflection, I would perhaps have used split and then something with grep {/^...$/} -- much clumsier and also amounting to the same thing in disguise, namely string anchors. I never realised that \b are a kind of ^ and $, but within the text.

        (The point I was trying to make -- only half convincingly -- was that I think it's a good habit not to throw in things "for good measure" and move on as soon as it works, because that approach doesn't teach you a lot, and you sometimes walk away with a wrong or (worse!) fuzzy impression of what actually did the trick. Or both. I have experienced this a thousand times.)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11122035]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2021-03-02 02:24 GMT
Find Nodes?
    Voting Booth?
    My favorite kind of desktop background is:

    Results (34 votes). Check out past polls.