Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I made the following assumptions (modeled after the following basic interpretation of Perl identifier rules--will not match most special Perl punctuation variables or captures, but could be modified to do so if that's what you want):

  • Identifiers may contain alphanumeric characters and underscores
  • Underscores may appear anywhere in the identifier (including not at all, or as the only character)
  • The first character must not be a digit

All three of these options produce the same output:

use warnings; use strict; use Benchmark qw/:all/; my $code = q{this is_++ meant to 0_be some_sample program $__code for testing whether the regex is(__0K__) ._._._}; cmpthese(-1, { 'split' => q{ grep { /_/ } split /[^\w_]/, $code }, 'grep' => q{ grep { /_/ } $code =~ /[a-zA-Z_][\w_]*/g }, 'regex' => q{ $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx }, }); my @words = $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx; print "\nWords: ", join(', ', @words);

The regex clearly wins in performance, but if one of the other is more to your style liking and your data is sufficiently small, you have some options. Output:

Rate split grep regex split 10581141/s -- -22% -49% grep 13630260/s 29% -- -34% regex 20760455/s 96% 52% -- Words: is_, _be, some_sample, __code, __0K__, _, _, _

CAVEATS

If you are actually passing in program code, and need to pull out the identifiers, none of the simple solutions so far posted in this thread will be adequate. For instance, these will be trivially confounded by my $foo = "bar_of_SOAP", which you would probably not want to match if you are looking for variable or identifier names. Perl is notoriously difficult to parse.


In reply to Re: extracting words with certain characters by rjt
in thread extracting words with certain characters by geek09

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-04-18 02:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found