good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
Optimising Lingua::EN::NamedEntity for Very Stringsby Anonymous Monk |
on May 08, 2006 at 02:11 UTC ( [id://547951]=perlquestion: print w/replies, xml ) | Need Help?? |
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question: I'm using some modules to parse my mbox files, but found that certain messages caused a CPU spike and caused the process to hang. I narrowed the problem down to Lingua::EN::NamedEntity, which one of the modules uses internally. It was choking on a message with a large attachment, which, for the uninitiated, consists of many, many lines like B/cCltBeBOMyzktNthjoXjIHOCsJvMkKk2u1Tcjlo6mAiwJmhwN6FT9iL... (I'd been removing the attachments before passing them to Lingua::EN::NamedEntity, but that one was corrupted, so remained inline). It strikes me that Lingua::EN::NamedEntity could be modified to better handle garbage input such as this, but I'm not sure of the best approach. Strings over are a certain length just aren't useful for entity extraction, IMO. Any suggestions so I can send the maintainer a patch?
Back to
Seekers of Perl Wisdom
|
|