Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses


by Anonymous Monk
on Jun 21, 2000 at 17:11 UTC ( #19228=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

after putting brkets around the entire match, it eliminated the problem of printing eveything else on the line, however it isnt printing the entire date match.
$dir='C:/texts/'; opendir(directory,$dir) or die "cant"; while($file=readdir directory){ next if $file=~/^\./; $rfname=$dir.$file; # print "Found file: '$rfname'\n"; open (CONT, $rfname); while (<CONT>){ if($_=~m/([0-3]?[0-9(th)?(st)?(nd)?(rd)?]\s+Jan(uary)?|Feb(ruary)? +|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(obe +r)?|Nov(ember)?|Dec(ember)?\s+[0-9]?[0-9]?[0-9][0-9])/ig){ print "$file\t $1\n"; } elsif($_=~m/(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?| +Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?\s+[ +1-3]?[0-9](th)?(nd)?(st)?(rd)?\s+[0-9]?[0-9]?[0-9][0-9])/ig){ print "$file\t $1\n"; } } }
if someone can solve this problem, please help.

Replies are listed 'Best First'.
Re: $1
by jjhorner (Hermit) on Jun 21, 2000 at 17:28 UTC

    You should really pick up a few O'Reilly perl books.

    From Programming Perl, by O'Reilly & Associates, the online edition: The fine print As mentioned above, \1, \2, \3, and so on, are equivalent to whatever the corresponding set of parentheses matched, counting opening parentheses from left to right. (If the particular pair of parentheses had a quantifier such as * after it, such that it matched a series of substrings, then only the last match counts as the backreference.) Note that such a backreference matches whatever actually matched for the subpattern in the string being examined; it's not just a shorthand for the rules of that subpattern. Therefore, (0|0x)\d*\s\1\d* will match "0x1234 0x4321", but not "0x1234 01234", since subpattern 1 actually matched "0x", even though the rule 0|0x could potentially match the leading 0 in the second number. Outside of the pattern (in particular, in the replacement of a substitution operator) you can continue to refer to backreferences by using $ instead of \ in front of the number. The variables $1, $2, $3 ... are automatically localized, and their scope (and that of $`, $&, and $' below) extends to the end of the enclosing block or eval string, or to the next successful pattern match, whichever comes first. (The \1 notation sometimes works outside the current pattern, but should not be relied upon.) $+ returns whatever the last bracket match matched. $& returns the entire matched string. $` returns everything before the matched string.24 $' returns everything after the matched string. For more explanation of these magical variables (and for a way to write them in English), see the section "Special Variables" at the end of this chapter. 24 In the case of something like s/pattern/length($`)/eg, which does multiple replacements if the pattern occurs multiple times, the value of $` does not include any modifications done by previous replacement iterations. To get the other effect, say:

    1 while s/pattern/length($`)/e;

    For example, to change all tabs to the corresponding number of spaces, you could say:

    1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;

    You may have as many parentheses as you wish. If you have more than nine pairs, the variables $10, $11, ... refer to the corresponding substring. Within the pattern, \10, \11, and so on, refer back to substrings if there have been at least that many left parentheses before the backreference. Otherwise (for backward compatibility) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)


    s/^([^ ]+) +([^ ]+)/$2 $1/; # swap first two words /(\w+)\s*=\s*\1/; # match "foo = foo" /.{80,}/; # match line of at least 80 chars /^(\d+\.?\d*|\.\d+)$/; # match valid number if (/Time: (..):(..):(..)/) { # pull fields out of a line $hours = $1; $minutes = $2; $seconds = $3; }

    Hint: instead of writing patterns like /(...)(..)(.....)/, use the unpack function. It's more efficient.

    A word boundary (\b) is defined as a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W. (Within character classes \b represents backspace rather than a word boundary.)

    Normally, the ^ character is guaranteed to match only at the beginning of the string, the $ character only at the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by ^ or $. However, you may wish to treat a string as a multi-line buffer, such that the ^ will also match after any newline within the string, and $ will also match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting $*, but this practice is now deprecated.) \A and \Z are just like ^ and $ except that they won't match multiple times when the /m modifier is used, while ^ and $ will match at every internal line boundary. To match the actual end of the string, not ignoring newline, you can use \Z(?!\n). There's an example of a negative lookahead assertion.

    To facilitate multi-line substitutions, the . character never matches a newline unless you use the /s modifier, which tells Perl to pretend the string is a single line - even if it isn't. (The /s modifier also overrides the setting of $*, in case you have some (badly behaved) older code that sets it in another module.) In particular, the following leaves a newline on the $_ string:

    $_ = <STDIN>; s/.*(some_string).*/$1/;

    If the newline is unwanted, use any of these:

    s/.*(some_string).*/$1/s; s/.*(some_string).*\n/$1/; s/.*(some_string)[^\0]*/$1/; s/.*(some_string)(.|\n)*/$1/; chop; s/.*(some_string).*/$1/; /(some_string)/ && ($_ = $1);

    Note that all backslashed metacharacters in Perl are alphanumeric, such as \b, \w, and \n. Unlike some regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a literal character, not a metacharacter. This makes it simple to quote a string that you want to use for a pattern but that you are afraid might contain metacharacters. Just quote all the non-alphanumeric characters:

    $pattern =~ s/(\W)/\\$1/g;

    You can also use the built-in quotemeta function to do this. An even easier way to quote metacharacters right in the match operator is to say:


    Remember that the first and last alternatives (before the first | and after the last one) tend to gobble up the other elements of the regular expression on either side, out to the ends of the expression, unless there are enclosing parentheses. A common mistake is to ask for:


    when you really mean:


    The first matches "fee" at the beginning of the string, or "fie" anywhere, or "foe" at the end of the string. The second matches any string consisting solely of "fee" or "fie" or "foe".

    You should be able to use this to find your solution.

    NOTE: This was reproduced without permission, and if someone doesn't like it (someone official, that is), I will happily remove it.

    J. J. Horner
    Linux, Perl, Apache, Stronghold, Unix
Re: $1
by Perl-chick (Initiate) on Jun 21, 2000 at 17:29 UTC
    Well this is strange... for one line in the text "23 feb 63", it printed "23 feb" and for the line "23 mar 45" it printed mar.....that has me stumped!!!
Re: $1
by Odud (Pilgrim) on Jun 21, 2000 at 17:42 UTC
    You have misunderstood what

    is doing. It says any one of these characters and so if you say 20th June then it is matching only "h" and not "20th". I guess what you meant was something like


    With a complicated regular expression it is worth using /x and putting lots of white space in so that you can easily see what is going on.
      I should have given an example of /x, here's how my version of your code that I used to try out my solution looks:

      while (<>){ if(/( [0-3]?[0-9](?:th|st|nd|rd)?\s+ (?:Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?| Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(embe +r?))\s+ [0-9]?[0-9]?[0-9][0-9]) /ix) { print "$1\n"; } }

        I like that regex more, but there are still a few minor changes I would make.
        while (<>){ if(/ ( [0-3]?[0-9](?:th|st|nd|rd)?\s+ # Get day (?: # Get month Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?| Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?| Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?| Nov(?:ember)?|Dec(?:ember)? )\s+ [0-9]{2,4} # Get Year ) /ix) { print "$1\n"; } }
        The (?:) things are just like normal ()'s except they don't capture to a variable, they just do grouping, helps the interpreter optimize the regex.

        The {x,y} means require atleast x of the previous item, and at most y. For refrence you can also say {x} which means require x of them, or {x,} which means atleast x.
Re: $1
by Ovid (Cardinal) on Jun 21, 2000 at 22:50 UTC
    We've got a lot of duplication of data (since there are two possible regexes here), so let's pull the duplicate data into variables and have just one regex that tests both conditions. Also, we're going to use the /o modifier to ensure that the regex is compiled only once and therefore runs faster.
    $month = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)? |Apr(?:il)?|May|Jun(?:e)? |Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)? |Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+'; $day = '[0-3]?[0-9](?:th|st|nd|rd)?,?\s+'; # I tossed an optional comm +a (,?) # in there in case we have +June 1st, 2000 $year = '[0-9]{2,4}'; opendir(DIRECTORY, $dir) or die "Can't open $dir: $!\n"; while($file = readdir DIRECTORY){ next if $file=~/^\./; $rfname=$dir . $file; open (CONT, $rfname); while (<CONT>){ if(/( (?:${day}${month}${year}) | (?:${month}${day}${year}) ) /ixo) { print "$1\n"; } } }
    Regexes are bad enough. If we can use variable interpolation in them, it makes them much easier to grok.
      But what about Y10k compliance!
Re: $1
by raflach (Pilgrim) on Jun 21, 2000 at 17:31 UTC
    $& may be more what you're looking for. I'm not sure though. With it you should be able to remove the all-enclosing parens, and just use it. It = entire matched string. Also you might want to know about $` = everything before the matched string, and $' = everything after the matched string. Those last two add a lot of overhead though, so I wouldn't use them unless absolutely necessary.
Re: $1
by raflach (Pilgrim) on Jun 21, 2000 at 17:23 UTC
    What is it printing?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://19228]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (2)
As of 2018-10-20 23:36 GMT
Find Nodes?
    Voting Booth?
    When I need money for a bigger acquisition, I usually ...

    Results (119 votes). Check out past polls.