I made the following assumptions (modeled after the following basic interpretation of Perl identifier rules--will not match most special Perl punctuation variables or captures, but could be modified to do so if that's what you want):
- Identifiers may contain alphanumeric characters and underscores
- Underscores may appear anywhere in the identifier (including not at all, or as the only character)
- The first character must not be a digit
All three of these options produce the same output:
use warnings;
use strict;
use Benchmark qw/:all/;
my $code = q{this is_++ meant to 0_be some_sample program $__code
for testing whether the regex is(__0K__) ._._._};
cmpthese(-1, {
'split' => q{ grep { /_/ } split /[^\w_]/, $code },
'grep' => q{ grep { /_/ } $code =~ /[a-zA-Z_][\w_]*/g },
'regex' => q{ $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx },
});
my @words = $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx;
print "\nWords: ", join(', ', @words);
The regex clearly wins in performance, but if one of the other is more to your style liking and your data is sufficiently small, you have some options. Output:
Rate split grep regex
split 10581141/s -- -22% -49%
grep 13630260/s 29% -- -34%
regex 20760455/s 96% 52% --
Words: is_, _be, some_sample, __code, __0K__, _, _, _
CAVEATS
If you are actually passing in program code, and need to pull out the identifiers, none of the simple solutions so far posted in this thread will be adequate. For instance, these will be trivially confounded by my $foo = "bar_of_SOAP", which you would probably not want to match if you are looking for variable or identifier names. Perl is notoriously difficult to parse.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.