http://www.perlmonks.org?node_id=881971

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Howdy bros. I have an application where I need to search through unstructured text and output anything that looks like a date. Is there an existing Perl solution for this? It seems like I have seen one in the past, but after about 20 min of searching I can't locate anything.

If there's not a solution does anyone have advice about how to approach the task? I guess a person could write a regex for it, but it would be pretty hairy given all the different ways a date could be expressed.

TIA

Steve

Replies are listed 'Best First'.
Re: Finding dates in unstructured text
by toolic (Bishop) on Jan 12, 2011 at 20:04 UTC
      D'oh! Of course I did all my searching before writing the thing about doing it with a regex so I didn't search for THAT :-( Thanks for the link; I shall investigate.
Re: Finding dates in unstructured text
by ambrus (Abbot) on Jan 12, 2011 at 20:03 UTC
      I suspect most people assume the new skill being illustrated is Perl regular expressions, but it's really speed typing code with one hand. :)

        On the US-English layout, I can type backslash, parenthesis, vertical bar, splat, question mark, colon, plus, brackets all with one hand, so regular expressions are well suited to one-hand typing.

        Bet it fails to find "jeudi 13 janvier" :-)
Re: Finding dates in unstructured text
by philipbailey (Curate) on Jan 12, 2011 at 21:10 UTC

    Date::Manip is good at parsing arbitrarily formatted dates, if its performance is good enough for your data volumes.

    But you should think hard about whether you trust the results, whichever parser you use to generate them. We once had a daily business report which was loaded by our users into Excel, which kindly parsed the value in some field, let us say "MAR6", as a date, where it actually represented something else altogether.

Re: Finding dates in unstructured text
by chrestomanci (Priest) on Jan 12, 2011 at 23:14 UTC

    Care to be more specific on what looks like a date?

    20110112 ?

    jan twelve ?

    12 Janvier ?

    XII 1 MMXI ?

    OK, the last couple where a bit silly, but it illustrates the point. If you have a clear idea of what a date looks like, then a series of regular expressions is probably the way to go.

    If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from.

      If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from.
      Actually, this would be a bad idea for at least two reasons: First, you have to segment the text before determining whether or not the segments are dates. Second, you have to have labeled data to train a classifier. A better approach would be to look through your data by hand and generalize to create a set of regular expressions (or, more generally, date-identifying functions). Once you have some of these, run them on more of your data, and refine them to include dates that they missed, and to exclude non-dates that they picked up. Keep doing this until you get the performance you need.