Re: Finding dates in unstructured text
by toolic (Bishop) on Jan 12, 2011 at 20:04 UTC
|
| [reply] |
|
D'oh! Of course I did all my searching before writing the thing about doing it with a regex so I didn't search for THAT :-( Thanks for the link; I shall investigate.
| [reply] |
Re: Finding dates in unstructured text
by ambrus (Abbot) on Jan 12, 2011 at 20:03 UTC
|
| [reply] |
|
I suspect most people assume the new skill being illustrated is Perl regular expressions, but it's really speed typing code with one hand. :)
| [reply] |
|
On the US-English layout, I can type backslash, parenthesis, vertical bar, splat, question mark, colon, plus, brackets all with one hand, so regular expressions are well suited to one-hand typing.
| [reply] |
|
Bet it fails to find "jeudi 13 janvier" :-)
| [reply] |
Re: Finding dates in unstructured text
by philipbailey (Curate) on Jan 12, 2011 at 21:10 UTC
|
Date::Manip is good at parsing arbitrarily formatted dates, if its performance is good enough for your data volumes.
But you should think hard about whether you trust the results, whichever parser you use to generate them. We once had a daily business report which was loaded by our users into Excel, which kindly parsed the value in some field, let us say "MAR6", as a date, where it actually represented something else altogether.
| [reply] |
Re: Finding dates in unstructured text
by chrestomanci (Priest) on Jan 12, 2011 at 23:14 UTC
|
Care to be more specific on what looks like a date?
20110112 ?
jan twelve ?
12 Janvier ?
XII 1 MMXI ?
OK, the last couple where a bit silly, but it illustrates the point. If you have a clear idea of what a date looks like, then a series of regular expressions is probably the way to go.
If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from.
| [reply] |
|
If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from.
Actually, this would be a bad idea for at least two reasons: First, you have to segment the text before determining whether or not the segments are dates. Second, you have to have labeled data to train a classifier. A better approach would be to look through your data by hand and generalize to create a set of regular expressions (or, more generally, date-identifying functions). Once you have some of these, run them on more of your data, and refine them to include dates that they missed, and to exclude non-dates that they picked up. Keep doing this until you get the performance you need.
| [reply] |