Finding dates in unstructured text

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding dates in unstructured text by toolic (Bishop) on Jan 12, 2011 at 20:04 UTC
Maybe Regexp::Common::time. I found this using Super Search where title contains all of "regex", "date": ?node_id=3989;HIT=regex%20date;re=N --> Re: Calling macros in Perl	[reply]
Re^2: Finding dates in unstructured text by cormanaz (Deacon) on Jan 12, 2011 at 20:11 UTC
D'oh! Of course I did all my searching before writing the thing about doing it with a regex so I didn't search for THAT :-( Thanks for the link; I shall investigate.	[reply]
Re: Finding dates in unstructured text by ambrus (Abbot) on Jan 12, 2011 at 20:03 UTC
Oh no! The terrorist must have hidden a bomb in the stadium! But to find when it'll explode we'd have to search through 200 MB of emails looking for anything that looks like a date! Anyway, I heared perl 6 may help you here, because it has a better way to write regular expressions. (Just kidding.)	[reply]
Re^2: Finding dates in unstructured text by ikegami (Patriarch) on Jan 12, 2011 at 20:22 UTC
I suspect most people assume the new skill being illustrated is Perl regular expressions, but it's really speed typing code with one hand. :)	[reply]
Re^3: Finding dates in unstructured text by ambrus (Abbot) on Jan 12, 2011 at 20:45 UTC
On the US-English layout, I can type backslash, parenthesis, vertical bar, splat, question mark, colon, plus, brackets all with one hand, so regular expressions are well suited to one-hand typing.	[reply]
Re^3: Finding dates in unstructured text by DrHyde (Prior) on Jan 13, 2011 at 10:31 UTC
Bet it fails to find "jeudi 13 janvier" :-)	[reply]
Re: Finding dates in unstructured text by philipbailey (Curate) on Jan 12, 2011 at 21:10 UTC
Date::Manip is good at parsing arbitrarily formatted dates, if its performance is good enough for your data volumes. But you should think hard about whether you trust the results, whichever parser you use to generate them. We once had a daily business report which was loaded by our users into Excel, which kindly parsed the value in some field, let us say "MAR6", as a date, where it actually represented something else altogether.	[reply]
Re: Finding dates in unstructured text by chrestomanci (Priest) on Jan 12, 2011 at 23:14 UTC
Care to be more specific on what looks like a date? 20110112 ? jan twelve ? 12 Janvier ? XII 1 MMXI ? OK, the last couple where a bit silly, but it illustrates the point. If you have a clear idea of what a date looks like, then a series of regular expressions is probably the way to go. If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from.	[reply]
Re^2: Finding dates in unstructured text by educated_foo (Vicar) on Jan 12, 2011 at 23:58 UTC
If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from. Actually, this would be a bad idea for at least two reasons: First, you have to segment the text before determining whether or not the segments are dates. Second, you have to have labeled data to train a classifier. A better approach would be to look through your data by hand and generalize to create a set of regular expressions (or, more generally, date-identifying functions). Once you have some of these, run them on more of your data, and refine them to include dates that they missed, and to exclude non-dates that they picked up. Keep doing this until you get the performance you need.	[reply]


Perl: the Markov chain saw
	PerlMonks