Re: how to extract string by possible groupings?

There are six things obviously wrong with your regex:

\s matches a single whitespace character, but as far as I can tell from your sample input, there could be multiple spaces between the columns. \s should be written \s+.
You have included the \s+ inside the parens, meaning that the white spaces separating the columns are part of the data you're trying to capture (in other words, $match[0] won't be "test1.cpp", it will actually be "test1.cpp ", and likewise $match[1] will have trailing spaces).
A percent sign doesn't carry any special meaning inside regular expressions, and thus it doesn't need to be escaped.
You use the /g modifier even though you don't need it.
Your grouping and capturing is a little off, and way too complex.
A good practice is DRY, or Don't Repeat Yourself. A good way to adhere to the DRY principle is to generalize stuff as much as possible. You violate this principle, though.

Regarding grouping and capturing, remember that every pair of parens inside a regex creates a capturing group, and captured substrings are returned in order of appearance (added: as LanX++ beautifully illustrated). Consider the following snippet:

$string = "foo bar";
@match = $string =~ m/(f(oo)) (b(ar))/
print "$match[0]\n";    # prints "foo" (captured by /(f(oo))/
print "$match[1]\n";    # prints "oo"  (captured by /(oo)/
print "$match[2]\n";    # prints "bar" (captured by /(b(ar))/
print "$match[3]\n";    # prints "ar" (captured by  /(ar)/
[download]

Likewise, you seem to think that your @match variable will contain three elements, but as a matter of fact it will contain 8 (eight!): one for every pair of parens in your regex, some of which only surround non-data such as the word "of" or just whitespace \s+.

Don't believe me? Do me a favour and run this snippet (in which I only fixed the \s vs \s+ issue)

use Data::Dumper;

while (chomp(my $line = <DATA>)) {
    @match =  $line =~ m/((.*\.c\s+)|(.*\.h\s+)|(.*\.cpp\s+))|(\s+(.*)
+\%\s+(of)\s+\d+\s)|(\bNone\b)/;
    print "$line\n";
    print Dumper \@match;
    
} 
    
__DATA__
Title               Percent2   Percent3
test1.cpp     0.00% of 21     0.00% of 16
test2.c     None   16.53% of 484
test3.h         0.00% of 138    None
[download]

The output I get:

[... snip ...]
test1.cpp     0.00% of 21     0.00% of 16
$VAR1 = [
          'test1.cpp     ',
          undef,
          undef,
          'test1.cpp     ',
          undef,
          undef,
          undef,
          undef
[... snip ...]
[download]

This neatly demonstrates at least three things:

You've captured the filename twice (once because of the outer group, once because of the extension-specific group for .cpp).
The matched file name includes the trailing white space, which I don't think is part of the filename anyway.
Your @match array contains way more elements than you think it does - nearly three times as much!

As for the DRY principle, you violate this for example in the chunk of the regex where you try to capture the file names. What you have written is: "match any number of characters, a literal period, a literal 'c', white space; OR match any number of characters, a literal period, a literal 'cpp', white space space; OR match any (...)" I'm sure you get the pattern.

The way I would have written it, would read as: "match any number of characters, a literal period, one of these literal strings ('c', 'cpp', 'h'), whitespace."

/(.*\.(?:c|cpp|h))\s+/   # Use (?:...) to create a non-capturing group
+.
[download]

The readability of your script could use some work too. Here's how I would've written it:

# I always start my script with these two lines.
# They prevent you from making various mistakes
# and make debugging a whole lot easier.
use strict;
use warnings;

# Regular expressions have the tendency to become long
# strings of near-undecipherable line noise. To avoid
# that, I usually like to split them up in smaller
# logical chunks.
# In this case, I'd write one regex to capture the
# file names and one regex to capture percentages.
my $title_re = qr/.*\.(?:c|cpp|h)/;
my $percent_re = qr/(?:\d+\.\d+% of \d+|None)/;

# Next thing is to combine them into a single
# regex to match the input against.
# I use the /x modifier so that I can use
# white space and comments inside the tegex.
my $line_re = qr/
        ($title_re) \s+     # Match and capture file names, match whit
+espace
        ($percent_re) \s+   # Match and capture Percent2, match non-da
+ta
        ($percent_re)       # Match and capture Percent3
    /x;
        
<DATA>; # Read and discard the first line, as this contains non-data.

# Read input line by line, cut off newline
# characters from the end. 
while (my $line = <DATA>) {
    chomp $line;
    
    # Match input against the regex, capture
    # the stuff into separate variables.
    # I mean, I find a "$title" much more
    # comprehensible than "$match[0]".
    my ($title, $percent2, $percent3) = $line =~ $line_re;
    print "$line\n";
    print "Title:    $title\n";
    print "Percent2: $percent2\n";
    print "Percent3: $percent3\n";
    print "\n";
    
    
} 
    
__DATA__
Title               Percent2   Percent3
test1.cpp     0.00% of 21     0.00% of 16
test2.c     None   16.53% of 484
test3.h         0.00% of 138    None
[download]

test1.cpp     0.00% of 21     0.00% of 16
Title:    test1.cpp
Percent2: 0.00% of 21
Percent3: 0.00% of 16
test2.c     None   16.53% of 484
Title:    test2.c
Percent2: None
Percent3: 16.53% of 484
test3.h         0.00% of 138    None
Title:    test3.h
Percent2: 0.00% of 138
Percent3: None

C:\Users\Lona\Desktop>perl x.pl
test1.cpp     0.00% of 21     0.00% of 16
Title:    test1.cpp
Percent2: 0.00% of 21
Percent3: 0.00% of 16

test2.c     None   16.53% of 484
Title:    test2.c
Percent2: None
Percent3: 16.53% of 484

test3.h         0.00% of 138    None
Title:    test3.h
Percent2: 0.00% of 138
Percent3: None
[download]

Comment on Re: how to extract string by possible groupings? Select or Download Code

Replies are listed 'Best First'.
Re^2: how to extract string by possible groupings? by Laurent_R (Canon) on Jun 02, 2014 at 16:54 UTC
I wish I could upvote more than once such a useful, detailed and complete post.	[reply]
Re^3: how to extract string by possible groupings? by muba (Priest) on Jun 02, 2014 at 18:45 UTC
As much as those warm words are appreciated, I do think I could've been even more complete by including links to relevant sections of the documentation, but I didn't feel like it ;)	[reply]
Re^2: how to extract string by possible groupings? by adrive (Scribe) on Jun 03, 2014 at 02:24 UTC
thanks! this is really clear and easy to understand. although, what does the symbol ":?" mean? also..i didn't even know qr can prepare regex pattern.. I guess I'm too rusty in perl!!	[reply]
Re^3: how to extract string by possible groupings? by LanX (Saint) on Jun 03, 2014 at 02:37 UTC
> what does the symbol ":?" mean its `(?:...)` not `:?` see (like already mentioned) `perlre#Extended-Patterns` Cheers Rolf (addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^3: how to extract string by possible groupings? by Laurent_R (Canon) on Jun 03, 2014 at 06:49 UTC
(?:...) is used for non capturing parentheses. This is useful when you need to regroup a subpattern (for example for an alternation or a quantification), but are not interested in capturing the content in $1, $2, etc.	[reply]