Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I made the following assumptions (modeled after the following basic interpretation of Perl identifier rules--will not match most special Perl punctuation variables or captures, but could be modified to do so if that's what you want):

  • Identifiers may contain alphanumeric characters and underscores
  • Underscores may appear anywhere in the identifier (including not at all, or as the only character)
  • The first character must not be a digit

All three of these options produce the same output:

use warnings; use strict; use Benchmark qw/:all/; my $code = q{this is_++ meant to 0_be some_sample program $__code for testing whether the regex is(__0K__) ._._._}; cmpthese(-1, { 'split' => q{ grep { /_/ } split /[^\w_]/, $code }, 'grep' => q{ grep { /_/ } $code =~ /[a-zA-Z_][\w_]*/g }, 'regex' => q{ $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx }, }); my @words = $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx; print "\nWords: ", join(', ', @words);

The regex clearly wins in performance, but if one of the other is more to your style liking and your data is sufficiently small, you have some options. Output:

Rate split grep regex split 10581141/s -- -22% -49% grep 13630260/s 29% -- -34% regex 20760455/s 96% 52% -- Words: is_, _be, some_sample, __code, __0K__, _, _, _


If you are actually passing in program code, and need to pull out the identifiers, none of the simple solutions so far posted in this thread will be adequate. For instance, these will be trivially confounded by my $foo = "bar_of_SOAP", which you would probably not want to match if you are looking for variable or identifier names. Perl is notoriously difficult to parse.

In reply to Re: extracting words with certain characters by rjt
in thread extracting words with certain characters by geek09

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others exploiting the Monastery: (4)
    As of 2018-08-19 02:19 GMT
    Find Nodes?
      Voting Booth?
      Asked to put a square peg in a round hole, I would:

      Results (186 votes). Check out past polls.