Re: regex search for words with one digit
by Athanasius (Bishop) on Sep 21, 2020 at 16:03 UTC
|
The character class \w matches an alphanumeric character, so it matches a digit as well as a letter (or underscore). You need a character class which excludes digits. But \D includes anything not a digit, so it matches whitespace. A negated character class [^\d\s] will match a character that is neither a digit nor a space:
my @names = $text =~ /\b[^\d\s]*\d[^\d\s]*\b/g;
Or, more simply, specify the letters you want to match explicitly (note the /i modifier to make the regex case-insensitive):
my @names = $text =~ /\b[A-Z]*\d[A-Z]*\b/gi;
See the section “Character Classes and other Special Escapes” in perlre#Regular-Expressions.
Hope that helps,
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
use strict;
use warnings;
my $text = "John P5ete 1 Andrew Richard58 Nic4k Le7on5 Ab5%&=-/zz.";
my @words = split /\s+/, $text;
my @names;
for my $word (@words)
{
my @chars = $word =~ /[A-Z]/gi;
my @digits = $word =~ /\d/g;
my @symbols = $word =~ /\W/g;
push @names, $word if @chars && @digits == 1 && !@symbols;
}
print "@names\n";
Output:
19:27 >perl 2057_SoPW.pl
P5ete Nic4k
19:27 >
This may or may not be exactly what the OP intended, but breaking down the code into separate parts like this at least makes it easier to tweak as and when the requirements are clarified.
To the OP:
- \W matches any non-word character; but, as the original string was split on whitespace, there are no whitespace characters in any $word and so within the for loop \W matches the sort of non-alphanumeric symbols identified by AnomalousMonk.
- if @chars is Perlish shorthand for if scalar(@chars) != 0; similarly, if ... !@symbols is a shorter way of saying if ... scalar(@symbols) == 0.
Cheers,
| [reply] [d/l] [select] |
Re: regex search for words with one digit
by haj (Curate) on Sep 21, 2020 at 16:08 UTC
|
Digits are, in Perl regular expressions, word characters.
If you want to exclude digits, you can use character classes: Either those defined by POSIX (only if you don't have Unicode characters), or using Unicode properties in a recent Perl.
Here's a Unicode-aware example:
use strict;
use warnings;
my $text = "John P5ete Andrew Richard58 Nic4k Le7on5";
my @names = $text =~ /\b\p{Alphabetic}*\d\p{Alphabetic}*\b/g;
print "@names\n";
| [reply] [d/l] |
|
| [reply] [d/l] [select] |
Re: regex search for words with one digit
by tybalt89 (Prior) on Sep 21, 2020 at 17:22 UTC
|
The exclusion trick: \w and [^\W] match exactly the same thing, so to
match a \w but not a \d, just use [^\W\d]
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11122003
use warnings;
my $text = "John P5ete Andrew Richard58 Nic4k Le7on5";
my @names = $text =~ /\b[^\W\d]*\d[^\W\d]*\b/g;
print "@names\n";
Outputs:
P5ete Nic4k
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
Rereading the spec, the name '1' is valid. :)
If the spec is changed, I'll change my regex, but it will cost extra...
| [reply] |
Re: regex search for words with one digit
by Anonymous Monk on Sep 21, 2020 at 16:45 UTC
|
Thank you, and forgive me my rather silly question.
| [reply] |
|
I guess many of us have asked "silly" questions (I certainly have, and still am). The documentation is often intimidatingly vast. The upshot here is that the regex character classes, anchors and the like may perfectly overlap, and you have unwittingly not only used a class that was more inclusive that you assumed ("word characters" -- well, who can blame you? I have ignored the 'numeric' part in 'alphanumeric' myself more than once because of the 'word'.), but were therefore also unaware that \d is not complementary to \w, but its subclass. (This is simply not what we intuitively expect.) And that - as tybalt has already demonstrated - negated character classes ([^ ... whatever ...]) are often massively helpful, especially if you follow Athanasius' suggestion and define the character classes yourself (which makes your own code more transparent to you).
Guessing from how you've written the regex, it seems to me that the following advice may also be helpful:
Perl does not "understand what you mean", but, like any other programming language, slavishly follows the rules you have given it, and, since rules are rules, does not need to be told things twice. With that in mind, you learn a lot if you try to design regexes (like other code) as "thinly" as possible:
my @names = $text =~ /\w*\d\w*/g;
does exactly the same as your original regex: matching all "words" (in the above definition) which contain at least one digit, somewhere. (Not what you wanted, I know, but it's still instructive.) Why?
Although the \b anchors do match what you mean them to match, they are redundant: their meaning is "Match a \w\W or \W\w boundary" (man perlref). But your \w* already matches everything that falls under the definition of \w (* is "greedy", as you may know), and that is necessarily until it hits a character that does not - which is precisely the definition of \W. In other words, until it hits "a \w\W boundary". (As \d is a subclass of \w, it will never match anything that matches \W either; in other words, it will stop at a \d\W border, if there isn't a \w in between.)
(Or am I somehow mistaken? Why has everybody else kept the \b?) | [reply] [d/l] [select] |
|
Why has everybody else kept the \b?
I can't answer for others, but for me, throwing in boundary
assertions like this is a reflexive, defensive (and possibly
cargo-cultish) move I tend to use when I'm dealing with "words".
The string the Anonymous Monk gives as an example is fairly
straightforward: it's delimited by whitespace and the beginning and
end of the string. As you say, \b will not help here
(update: No! See tybalt89's reply), though it does no
harm.
Unfortunately, "words" be tricky. Is "word's" one word or two? If
it's supposed to be one word, then Anonymous Monk's /\b\w*\d\w*\b/ or
/\w*\d\w*/ or, I think, any of the other solutions I've seen
so far will fail to match it with or without \bs. Words like "t'other", "wouldn't've",
"words'" or "left-handed" can be difficult to deal with. I'm
sure one could give many other examples, and that's just in English!
In general, I think (?<! \S) and (?! \S) would serve
better than \b as word boundaries (update: in
the OPed case). But once again, it is
unfortunately true that there are few generalities in human
language.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
|
|