Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

extracting words with certain characters

by geek09 (Initiate)
on Dec 04, 2012 at 10:27 UTC ( #1007053=perlquestion: print w/ replies, xml ) Need Help??
geek09 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Iam a beginner on perl, and came across this problem. Have few lines of code containing some instance names at various positions, say X_Y_blah_blah. I need to extract these "instance names", only these instance names contain underscore character "_", so may be code can be directed to extracting words with _. Please let me know if iam not clear.

Comment on extracting words with certain characters
Re: extracting words with certain characters
by space_monk (Chaplain) on Dec 04, 2012 at 10:44 UTC

    Update: Just had to make a correction for multiple matches on one line...

    perl -ne 'while (/(\w*_\w*)/g) { print "$1\n";}' code_file(s)

    You may want to change the regex to accept only names with a leading alpha character, or so it must contain at least one alpha character, or save the names into a hash to remove duplicates, so YMMV from what I've produced here...

    Mapping into a hash using:

    perl -ne 'BEGIN { my %hash; }; END { print map { "$_\n"} keys %hash} while (/(\w*_\w*)/g) { $hash{$1}=1};' code_file(s)

    Gives all the unique names, e.g.

    no_wait ZERO_TABLE_SIZE subpart_name segment_config __END__ skip_table_list rnc_dspp_dspresu range_end dry_run gp_partition_drop keep_empty ignore_table_list get_lock FULL_DATE_FORMAT get_summarised_days log_init tv_interval get_partition_row_count drop_agg_level table_name total_table_size keep_summarised summarisation_log DATE_FORMAT get_config MAX_CACHE_TIME lock_table skip_tables GPM_BIN day_count drop_daily_agg_level lock_attempt get_dbh get_drop_partition_list agg_level lock_type schema_name _ site_perl empty_only partition_name range_start drop_partition keep_unclassified time_zone pm_nsn_3g_ran row_count GPI_RECOVER_BACKLOG
    A Monk aims to give answers to those who have none, and to learn from those who know more.

      In progress...

      perl -ne '/([A-Za-z_]+)/ && print "$1\n";' code_file(s)

      Just had to be first, didn't we? :-)

        Not really ; its a slow work day and I was bored, so I kept changing the answer.... :-P
        A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: extracting words with certain characters
by Ratazong (Prior) on Dec 04, 2012 at 10:47 UTC

    The following code may get you started:

    my $words = "y____x_z dddetr x_y erre yyy_"; while ($words =~ /([\S]*_[\S]*)/g) { print "$1\n"; };
    It looks for any words containing an underscore-character and prints them. Due to the loop and the /g-modifier it will even all words containing an underscore

    I intentionally wrote get you started, as there are some basic assumptions inside, e.g.:

    • the "instance names" are seperated by whitespaces (and not commas ...)
    • more than one underscore is OK, and the underscores may follwo each other
    • an instance-name consisting only of underscores is fine
    This webpage may help you understanding the regex.

    HTH, Rata

      Note: this will match "words" containing anything that is not whitespace. This may or may not be what you want. For example, should this be two different words?

      word_1,word_2

      If so, you could use a regex that matches a more traditional definition of a word:

      /(\w*_\w*)/g

      \w matches "word characters": alphanumeric characters plus underscores.



      When's the last time you used duct tape on a duct? --Larry Wall
Re: extracting words with certain characters
by rjt (Deacon) on Dec 04, 2012 at 11:24 UTC

    I made the following assumptions (modeled after the following basic interpretation of Perl identifier rules--will not match most special Perl punctuation variables or captures, but could be modified to do so if that's what you want):

    • Identifiers may contain alphanumeric characters and underscores
    • Underscores may appear anywhere in the identifier (including not at all, or as the only character)
    • The first character must not be a digit

    All three of these options produce the same output:

    use warnings; use strict; use Benchmark qw/:all/; my $code = q{this is_++ meant to 0_be some_sample program $__code for testing whether the regex is(__0K__) ._._._}; cmpthese(-1, { 'split' => q{ grep { /_/ } split /[^\w_]/, $code }, 'grep' => q{ grep { /_/ } $code =~ /[a-zA-Z_][\w_]*/g }, 'regex' => q{ $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx }, }); my @words = $code =~ /_[\w_]* | [a-zA-Z]\w*_\w*/gx; print "\nWords: ", join(', ', @words);

    The regex clearly wins in performance, but if one of the other is more to your style liking and your data is sufficiently small, you have some options. Output:

    Rate split grep regex split 10581141/s -- -22% -49% grep 13630260/s 29% -- -34% regex 20760455/s 96% 52% -- Words: is_, _be, some_sample, __code, __0K__, _, _, _

    CAVEATS

    If you are actually passing in program code, and need to pull out the identifiers, none of the simple solutions so far posted in this thread will be adequate. For instance, these will be trivially confounded by my $foo = "bar_of_SOAP", which you would probably not want to match if you are looking for variable or identifier names. Perl is notoriously difficult to parse.

      Thanks for this helpful, thorough answer.

      As for your caveat, the OP did say that only the identifiers contain underscores. As long as that is true, a simplistic solution should work fine.



      When's the last time you used duct tape on a duct? --Larry Wall

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1007053]
Approved by Ratazong
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2014-12-23 01:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (133 votes), past polls