Re: extracting words with certain characters

by rjt (Deacon)
on Dec 04, 2012 at 11:24 UTC

in reply to extracting words with certain characters

I made the following assumptions (modeled after the following basic interpretation of Perl identifier rules--will not match most special Perl punctuation variables or captures, but could be modified to do so if that's what you want):

  • Identifiers may contain alphanumeric characters and underscores
  • Underscores may appear anywhere in the identifier (including not at all, or as the only character)
  • The first character must not be a digit

All three of these options produce the same output:

use warnings; use strict; use Benchmark qw/:all/; my $code = q{this is_++ meant to 0_be some_sample program $__code for testing whether the regex is(__0K__) ._._._}; cmpthese(-1, { 'split' => q{ grep { /_/ } split /[^\w_]/, $code }, 'grep' => q{ grep { /_/ } $code =~ /[a-zA-Z_][\w_]*/g }, 'regex' => q{ $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx }, }); my @words = $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx; print "\nWords: ", join(', ', @words);

The regex clearly wins in performance, but if one of the other is more to your style liking and your data is sufficiently small, you have some options. Output:

Rate split grep regex split 10581141/s -- -22% -49% grep 13630260/s 29% -- -34% regex 20760455/s 96% 52% -- Words: is_, _be, some_sample, __code, __0K__, _, _, _


If you are actually passing in program code, and need to pull out the identifiers, none of the simple solutions so far posted in this thread will be adequate. For instance, these will be trivially confounded by my $foo = "bar_of_SOAP", which you would probably not want to match if you are looking for variable or identifier names. Perl is notoriously difficult to parse.

Re^2: extracting words with certain characters
by ColonelPanic (Friar) on Dec 04, 2012 at 11:30 UTC

    Thanks for this helpful, thorough answer.

    As for your caveat, the OP did say that only the identifiers contain underscores. As long as that is true, a simplistic solution should work fine.

