Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Optimising Lingua::EN::NamedEntity for Very Strings

by Anonymous Monk
on May 08, 2006 at 02:11 UTC ( [id://547951]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm using some modules to parse my mbox files, but found that certain messages caused a CPU spike and caused the process to hang. I narrowed the problem down to Lingua::EN::NamedEntity, which one of the modules uses internally. It was choking on a message with a large attachment, which, for the uninitiated, consists of many, many lines like

B/cCltBeBOMyzktNthjoXjIHOCsJvMkKk2u1Tcjlo6mAiwJmhwN6FT9iL...

(I'd been removing the attachments before passing them to Lingua::EN::NamedEntity, but that one was corrupted, so remained inline).

It strikes me that Lingua::EN::NamedEntity could be modified to better handle garbage input such as this, but I'm not sure of the best approach. Strings over are a certain length just aren't useful for entity extraction, IMO. Any suggestions so I can send the maintainer a patch?

Replies are listed 'Best First'.
Re: Optimising Lingua::EN::NamedEntity for Very Strings
by EdwardG (Vicar) on May 10, 2006 at 14:19 UTC

           Any suggestions so I can send the maintainer a patch?

    Perhaps parameterisation, as in

    use Lingua::EN::NamedEntity; my @entities = extract_entities($some_text, $max_string_length);
    or filter the output (but not solve your problem)
    my @entities = extract_entities($some_text, $max_entity_length);

    A reasonable default for either option might be 92 characters, which would accomodate a variant spelling of the name of a hill in my country of origin;

    Tetaumatawhakatangihangakoauaotamateaurehaeaturipukapihimaungahoronukupokaiwhenuaakitanarahu (link goes to image).

     

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://547951]
Approved by GrandFather
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-03-28 16:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found