Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

I made the following assumptions (modeled after the following basic interpretation of Perl identifier rules--will not match most special Perl punctuation variables or captures, but could be modified to do so if that's what you want):

  • Identifiers may contain alphanumeric characters and underscores
  • Underscores may appear anywhere in the identifier (including not at all, or as the only character)
  • The first character must not be a digit

All three of these options produce the same output:

use warnings; use strict; use Benchmark qw/:all/; my $code = q{this is_++ meant to 0_be some_sample program $__code for testing whether the regex is(__0K__) ._._._}; cmpthese(-1, { 'split' => q{ grep { /_/ } split /[^\w_]/, $code }, 'grep' => q{ grep { /_/ } $code =~ /[a-zA-Z_][\w_]*/g }, 'regex' => q{ $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx }, }); my @words = $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx; print "\nWords: ", join(', ', @words);

The regex clearly wins in performance, but if one of the other is more to your style liking and your data is sufficiently small, you have some options. Output:

Rate split grep regex split 10581141/s -- -22% -49% grep 13630260/s 29% -- -34% regex 20760455/s 96% 52% -- Words: is_, _be, some_sample, __code, __0K__, _, _, _

CAVEATS

If you are actually passing in program code, and need to pull out the identifiers, none of the simple solutions so far posted in this thread will be adequate. For instance, these will be trivially confounded by my $foo = "bar_of_SOAP", which you would probably not want to match if you are looking for variable or identifier names. Perl is notoriously difficult to parse.


In reply to Re: extracting words with certain characters by rjt
in thread extracting words with certain characters by geek09

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (7)
    As of 2014-12-28 12:28 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (181 votes), past polls