Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: how to extract string by possible groupings?

by muba (Priest)
on Jun 02, 2014 at 15:42 UTC ( #1088275=note: print w/replies, xml ) Need Help??


in reply to how to extract string by possible groupings?

There are six things obviously wrong with your regex:

  1. \s matches a single whitespace character, but as far as I can tell from your sample input, there could be multiple spaces between the columns. \s should be written \s+.
  2. You have included the \s+ inside the parens, meaning that the white spaces separating the columns are part of the data you're trying to capture (in other words, $match[0] won't be "test1.cpp", it will actually be "test1.cpp     ", and likewise $match[1] will have trailing spaces).
  3. A percent sign doesn't carry any special meaning inside regular expressions, and thus it doesn't need to be escaped.
  4. You use the /g modifier even though you don't need it.
  5. Your grouping and capturing is a little off, and way too complex.
  6. A good practice is DRY, or Don't Repeat Yourself. A good way to adhere to the DRY principle is to generalize stuff as much as possible. You violate this principle, though.

Regarding grouping and capturing, remember that every pair of parens inside a regex creates a capturing group, and captured substrings are returned in order of appearance (added: as LanX++ beautifully illustrated). Consider the following snippet:

$string = "foo bar"; @match = $string =~ m/(f(oo)) (b(ar))/ print "$match[0]\n"; # prints "foo" (captured by /(f(oo))/ print "$match[1]\n"; # prints "oo" (captured by /(oo)/ print "$match[2]\n"; # prints "bar" (captured by /(b(ar))/ print "$match[3]\n"; # prints "ar" (captured by /(ar)/

Likewise, you seem to think that your @match variable will contain three elements, but as a matter of fact it will contain 8 (eight!): one for every pair of parens in your regex, some of which only surround non-data such as the word "of" or just whitespace \s+.

Don't believe me? Do me a favour and run this snippet (in which I only fixed the \s vs \s+ issue)

use Data::Dumper; while (chomp(my $line = <DATA>)) { @match = $line =~ m/((.*\.c\s+)|(.*\.h\s+)|(.*\.cpp\s+))|(\s+(.*) +\%\s+(of)\s+\d+\s)|(\bNone\b)/; print "$line\n"; print Dumper \@match; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None

The output I get:

[... snip ...] test1.cpp 0.00% of 21 0.00% of 16 $VAR1 = [ 'test1.cpp ', undef, undef, 'test1.cpp ', undef, undef, undef, undef [... snip ...]

This neatly demonstrates at least three things:

  1. You've captured the filename twice (once because of the outer group, once because of the extension-specific group for .cpp).
  2. The matched file name includes the trailing white space, which I don't think is part of the filename anyway.
  3. Your @match array contains way more elements than you think it does - nearly three times as much!

As for the DRY principle, you violate this for example in the chunk of the regex where you try to capture the file names. What you have written is: "match any number of characters, a literal period, a literal 'c', white space; OR match any number of characters, a literal period, a literal 'cpp', white space space; OR match any (...)" I'm sure you get the pattern.

The way I would have written it, would read as: "match any number of characters, a literal period, one of these literal strings ('c', 'cpp', 'h'), whitespace."

/(.*\.(?:c|cpp|h))\s+/ # Use (?:...) to create a non-capturing group +.

The readability of your script could use some work too. Here's how I would've written it:

# I always start my script with these two lines. # They prevent you from making various mistakes # and make debugging a whole lot easier. use strict; use warnings; # Regular expressions have the tendency to become long # strings of near-undecipherable line noise. To avoid # that, I usually like to split them up in smaller # logical chunks. # In this case, I'd write one regex to capture the # file names and one regex to capture percentages. my $title_re = qr/.*\.(?:c|cpp|h)/; my $percent_re = qr/(?:\d+\.\d+% of \d+|None)/; # Next thing is to combine them into a single # regex to match the input against. # I use the /x modifier so that I can use # white space and comments inside the tegex. my $line_re = qr/ ($title_re) \s+ # Match and capture file names, match whit +espace ($percent_re) \s+ # Match and capture Percent2, match non-da +ta ($percent_re) # Match and capture Percent3 /x; <DATA>; # Read and discard the first line, as this contains non-data. # Read input line by line, cut off newline # characters from the end. while (my $line = <DATA>) { chomp $line; # Match input against the regex, capture # the stuff into separate variables. # I mean, I find a "$title" much more # comprehensible than "$match[0]". my ($title, $percent2, $percent3) = $line =~ $line_re; print "$line\n"; print "Title: $title\n"; print "Percent2: $percent2\n"; print "Percent3: $percent3\n"; print "\n"; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None
test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None C:\Users\Lona\Desktop>perl x.pl test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None

Replies are listed 'Best First'.
Re^2: how to extract string by possible groupings?
by Laurent_R (Canon) on Jun 02, 2014 at 16:54 UTC
    I wish I could upvote more than once such a useful, detailed and complete post.

      As much as those warm words are appreciated, I do think I could've been even more complete by including links to relevant sections of the documentation, but I didn't feel like it ;)

Re^2: how to extract string by possible groupings?
by adrive (Scribe) on Jun 03, 2014 at 02:24 UTC
    thanks! this is really clear and easy to understand. although, what does the symbol ":?" mean? also..i didn't even know qr can prepare regex pattern.. I guess I'm too rusty in perl!!
      > what does the symbol ":?" mean

      its (?:...) not :?

      see (like already mentioned) perlre#Extended-Patterns

      Cheers Rolf

      (addicted to the Perl Programming Language)

      (?:...) is used for non capturing parentheses. This is useful when you need to regroup a subpattern (for example for an alternation or a quantification), but are not interested in capturing the content in $1, $2, etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1088275]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2019-04-21 08:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I am most likely to install a new module from CPAN if:
















    Results (110 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!