http://www.perlmonks.org?node_id=79923

stuffy has asked for the wisdom of the Perl Monks concerning the following question:

I have the date in this format
Sun Apr 1 10:27:03 CDT 2001
I want to match the sun apr 1 part of it. I tried to match using
/\w{3}\s+\w{3}\s+\d+/ the way I understand it, it should match a 3 letter word, a space, a three letter word, a space, and a number.
It doesn't work for me. Onto the questions. Is there an easier way to match the date, and why won't what I'm trying to do work?

Stuffy

Edit: chipmunk 2001-05-12

Replies are listed 'Best First'.
Re: regex-matching the date
by Trimbach (Curate) on May 12, 2001 at 16:44 UTC
    There's a couple of ways to do this, using either a regex or (if your dates are always well-formed) 'split.' If you use a regex you need to do two things: 1) add a ^ character to anchor your regex to the beginning of your scalar. As it is your regex will match ANYWHERE in the string, not just the beginning. Using the ^ character you can restrict the match to the beginning of the string, which is what you want. Also, 2) you'll probably need to add capturing parentheses around the parts of the regex you're interested in. A matching regex without capturing parentheses only returns a "true" or "false" depending on whether a match is found or not. It does NOT return the match itself. (Well it does, but only if you use some funny variables... not recommended.) Like this:
    #!/usr/bin/perl -w use strict; my $date= 'Sun Apr 1 10:27:03 CDT 2001'; if ($date =~ m/^(\w{3}\s+\w{3}\s+\d+)/) { print "Matched $1\n"; }
    Alternatively, you can use 'split' (this would be my choice.) Split will split the string up into space-divided chunks... so long as the order of the chunks doesn't change it's all good:
    my $date= 'Sun Apr 1 10:27:03 CDT 2001'; my ($weekday, $month, $day) = split " ", $date; print "$weekday $month $day\n";
    Perlman perlfunc has more details on split if you're interested. Enjoy!

    Gary Blackburn
    Trained Killer

Re: regex-matching the date
by Albannach (Monsignor) on May 12, 2001 at 17:43 UTC
    The regular expression you choose for any situation depends a great deal on how much you can trust your data to follow a pattern. Just for some examples (untested but should serve to illustrate):

    • Your original attempt is vague (it matches 'ABCD xyz 123') but works better if modified slightly by adding an anchor to the front, or a word boundary if you don't want to be stuck to that position:

      /\b\w{3}\s+\w{3}\s+\d+/

    • If you are certain that the date always starts the line, then the split is certainly a nice option as Trimbach said.

    • If you want to be more certain that you get a real date, you could do something like:

      /(Sun|Mon|Tue|Wed|Thu|Fri|Sat)\s(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2}/i

      with the i option used if you can't trust the case of the letters. The alternations in this example will make it very slow however, so if you use the line a lot, that may cause problems. This should find a valid date anywhere in the line (anchor it with ^ if you don't want that as Trimbach said), but it will match cases which are not followed by the time, timezone and year, so you might want to extend the regex to match them also for an extra validity check. Even with all that specificity, this will still match "Wed Mar 98" which clearly isn't a date. To fix that, the numeric match could be changed to ([012]?[0-9]|[3][0-1]) but this is getting pretty messy!

    • For another more reasonable regex, but less precise, try:

      /[A-Z][a-z]{2}\s+[A-Z][a-z]{2}\s+\d{1,2}/

      or the slightly more specific but definitely funny looking

      /[SMTWF][uoehra][neduit]\s+[JFMASOND][aepuco][nbrylgtvc]\s+\d{1,2}/i

    So in summary, a regex will just match what you are telling it to look for (if present), which may very well not be a date. It may be wise to do a validation after the match, using something like Time::ParseDate, in which case you can choose a much simpler less-specific regex.

    --
    I'd like to be able to assign to an luser

      correct me if I'm wrong, in order to use the split function, I need to use the regex in order to find where it is in the file, then place it into a variable, then split it? Even still, I think that is something I will use for formating purposes. I was able to get it working finally, I found I was making a newbie mistake, I was testing for the match, but I wasn't assigning it into a variable. My question now is on how I assigned it to a variable.
      if /(\w{3}\s+\w{3}\s+\d+)/){ $foo = $&; }

      how does this differ from using
      $foo = $1;

      If I am running through a long file, and $foo is changeing frequently, will one work better then the other?

      By the way, I like the last solution you used. The date will always be in the same format, and I am pretty sure that there will never be anything else with the same pattern, but then again never say never.

      thanks for all the help...I'm currently struggling through other regex problems But so far I have worked most of them out on my own which I prefer to do before asking the monks.

      Stuffy

        from perldoc perlre:
          WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program.
        In other words, you can use $& (i used to, im an ex-sed hacker), but it will slow things down and its kind of unmaintainable. $1 $2 $3 et cetera, are really shinier, happier codelets.

        brother dep.

        --
        Laziness, Impatience, Hubris, and Generosity.

        Brother dep has covered the downside of $&, but on your split question, the beauty there is that you don't need to match the (sometimes) complex target of your interest, just the separators that mark where your interest ends, and that's often a lot easier. In this case, if you're going to verify the date anyway, there is not much sense in going to great lengths to do that in the regex, so you can just split on whitespace instead.

        As you noted, split won't be able to find your dates at all. It is a great option if you are parsing some sort of log file in which the lines always start with that date format, but if you want to get that date out of the middle of a lot of other text, a specific regex would be my choice, and instead of split you can use $1 etc. to get your date components, like:

        if /(\w{3})\s+(\w{3})\s+(\d+)/){ ($day, $month, $daynum) = ($1, $2, $3); }
        Finally, while we're talking about OWTDI, you might also consider unpack for jobs like this as it is usually faster, though it is even more fussy about the format of the data being consistent. It is however ideal for fixed-width columns of data (anyone else still dealing with data in card images?).

        --
        I'd like to be able to assign to an luser

Re: regex-matching the date
by Chady (Priest) on May 12, 2001 at 11:46 UTC

    looks Ok to me... and maybe you need to /^\w{3}\s\w{3}\s\d/ or check out time if you are doing this based on the time the script is running..


    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady | http://chady.net/
Re: regex-matching the date
by Eureka_sg (Monk) on May 12, 2001 at 11:41 UTC

    Note that the '+' modifier is greedy so your regex will match the entire string

    You can use '?' to make it non-greedy  /(\w{3}\s+?\w{3}\s+?\d+)/ and the result will be stored in $1.

    UPDATE: Ignore this post.

Re: regex-matching the date
by stuffy (Monk) on May 12, 2001 at 11:28 UTC
    had a typo, my regex should be
    /\w{3}\s+\w{3}\s+\d+/



    Stuffy

      Hrmmm ... it seems to work for me - I'm not entirely sure that I know what in context you are trying to use it ... Try this ...
      $var = "Sun Apr 1 10:27:03 CDT"; if ($var =~ /(\w{3}\s+\w{3}\s+\d+)/) { print $1."\n"; };
      This worked fine for me in my testing - Note the additional brackets around the regex to allow the result to be pulled from $1. If you are still having problems, post again with a bit more context as to where you are using this regex and someone with more experience than myself may be able to help further.