Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: My code seems messy...

by PerlGrrl (Sexton)
on May 14, 2006 at 12:09 UTC ( [id://549310]=note: print w/replies, xml ) Need Help??


in reply to Re: My code seems messy...
in thread My code seems messy...

My text files include dates in string fromat such as the following:
...the next social club meeting is on April 15, 1994...

...September 21-23, 1994 we will be hosting visitors...

...submissions should be made by 11 February 1994...

...Mail sent 7 Feb 1994...

...On the 16th September 1994 Mr X will be giving a talk on ...

...unconfirmed conference dates are March 4, 5 and 6, 1994...

(...is MY emphasis - for clarity's sake)
Have absolutely no knowledge of surrounding boundaries, nor where the strings can occur in the text, just got to pull them all out, so that I can then do some date normalisation...also, i've found that some of my strings are sometimes matching numeric strings...

Replies are listed 'Best First'.
Re^3: My code seems messy...
by wfsp (Abbot) on May 14, 2006 at 14:13 UTC
    Here's my attempt. I've used two regexes for the general cases of day month and month day.

    #!/usr/bin/perl use strict; use warnings; while (<DATA>){ if ( # month day / ( [JFMASOND][a-z]{2,8} # full or 3 letter month name \s [123]?\d # day number (?:-[123]?\d)? # optional dash and day number (?:,\s[123]?\d\s)* # optional list of day numbers (?:and\s)? # optional and (?:[123]?\d,?\s)? # optional end of list (?:19|20)\d{2} # year starting 19 or 20 ) /x ) { print "1: $1\n"; } elsif ( # day month / ( [123]?\d # day number (?:st|nd|rd|th)? # optional st, nd etc. \s+ [JFMASOND][a-z]{2,8} # month name \s+ (?:19|20)\d{2} # year starting 19 or 20 ) /x ) { print "2: $1\n"; } } __DATA__ ...the next social club meeting is on April 15, 1994... ...September 21-23, 1994 we will be hosting visitors... ...submissions should be made by 11 February 1994... ...Mail sent 7 Feb 1994... ...On the 16th September 1994 Mr X will be giving a talk on ... ...unconfirmed conference dates are March 4, 5 and 6, 1994...
    output:
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl 1: April 15, 1994 1: September 21-23, 1994 2: 11 February 1994 2: 7 Feb 1994 2: 16th September 1994 1: March 4, 5 and 6, 1994 > Terminated with exit code 0.
Re^3: My code seems messy...
by Miguel (Friar) on May 14, 2006 at 15:56 UTC
    Here I start by checking if there's a valid month name in the string. If there is, I start extracting the dates from the string; else skipt that line.
    Note that I had to add a space after the first '...' because '...September' is not a valid month name.
    Updated to return valid dates
    #!/usr/bin/perl # filename: extract_dates.pl use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 1; my @dates; push @dates, findMonth($_) while (<DATA>); print Dumper \@dates; # or do something else with your dates sub findMonth { my @words = split / /,shift; my %months = map {$_ => 1 } qw/ january jan february feb march mar april apr may may june jun july jul august aug september sep october oct november nov december dec /; foreach (@words) { if (exists $months{lc($_)} ) { return extractDate( { MONTH => $_, STRING => "@words" } ); last; } } return; } sub extractDate { my $self = shift; return makeValidDate( {STRING=>$self->{STRING},DATE=>$1} ) if ($self->{STRING} =~ / ( (?: ( [123]?\d (st|nd|rd|th)? \s+ )? $self->{MONTH} \s\d{1,4} ( (-[123]?\d)? (,\s[123]?\d\s)* (and\s\d+)? (,\s\d{4})? )? ) ) /x ); return; } sub makeValidDate { my $self = shift; my ($string) = $self->{STRING} =~/^(.+)$/; $self->{DATE} =~s/-/X/g; $self->{DATE} =~s/(^-|\W|and|th|st|nd|rd)/ /gi; my @date = split /\s+/,$self->{DATE}; my $date = {}; foreach (@date) { if ($_=~/^\d{1,2}$/) { push @{$date->{days}},$_ } elsif ($_=~/^\d{4}$/) { $date->{year} = $_ } elsif ($_=~/(\d+)X(\d+)/) { push @{$date->{days}},$1 .. $2 } else { $date->{month} = lc($_) } } use Date::Manip; my $out_date = {}; foreach (@{$date->{days}}) { my $temp_date = ParseDate( $_ . " " . $date->{month} . " " . $date->{year} ); $temp_date = &UnixDate($temp_date,"%D"); push @{$out_date->{dates}},$temp_date; $out_date->{string} = $string; } return $out_date; } __DATA__ ... the next social club meeting is on April 15, 1994... ... September 21-23, 1994 we will be hosting visitors... ... submissions should be made by 11 February 1994... ... Mail sent 7 Feb 1994... ... On the 16th September 1994 Mr X will be giving a talk on ... ... unconfirmed conference dates are March 4, 5 and 6, 1994...
    $VAR1 = [ { 'dates' => [ '04/15/94' ], 'string' => '... the next social club meeting is on April 15, 1994 +...' }, { 'dates' => [ '09/21/94', '09/22/94', '09/23/94' ], 'string' => '... September 21-23, 1994 we will be hosting visitors +...' }, { 'dates' => [ '02/11/94' ], 'string' => '... submissions should be made by 11 February 1994... +' }, { 'dates' => [ '02/07/94' ], 'string' => '... Mail sent 7 Feb 1994...' }, { 'dates' => [ '09/16/94' ], 'string' => '... On the 16th September 1994 Mr X will be giving a +talk on ...' }, { 'dates' => [ '03/04/94', '03/05/94', '03/06/94' ], 'string' => '... unconfirmed conference dates are March 4, 5 and 6 +, 1994...' } ];

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://549310]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-20 00:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found