Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: End of sentence regex excluding " i.e." and " e.g."

by jabowery (Beadle)
on Feb 06, 2017 at 18:05 UTC ( [id://1181189]=note: print w/replies, xml ) Need Help??


in reply to Re: End of sentence regex excluding " i.e." and " e.g."
in thread End of sentence regex excluding " i.e." and " e.g."

In general, for a corpus like this, I'd split it into known good, known bad, and grey, and then use test-driven development in order to build out my filter.
That's what I'm doing, but I got stuck at the early stage of handling just the cases of " e.g." and " i.e.", and I'm asking how to get unstuck so I can follow your advice, which I already was doing.
  • Comment on Re^2: End of sentence regex excluding " i.e." and " e.g."

Replies are listed 'Best First'.
Re^3: End of sentence regex excluding " i.e." and " e.g."
by kennethk (Abbot) on Feb 06, 2017 at 18:35 UTC
    Did the negative look-ahead for a capital help? I should mention I think you have a typo in your real script (as opposed to what you posted) because the following script behaves well for me:
    #!/usr/bin/perl use strict; use warnings; use File::Stream; my ($handler, $stream) = File::Stream->new( \*DATA, read_length => 1024, separator => qr/(?<!\b[A-Z])(?<!e\.g)(?<!i\.e)[.!?]\s{1,2}(?=[A-Z0 +-9])/, ); while (<$stream>) { print "*$_\n\n" ; } __DATA__ Perl filehandles are streams, but sometimes they just aren't powerful enough. This module offers to have streams from filehandles searched with regexes and allows the global input record separator variable to contain regexes. Thus, readline() and the <> operator can now return records delimited by regular expression matches. There are some very important gripes with applying regular expressions to (possibly infinite) streams. Please read the CAVEATS section of this documentation carfully. Some bunnys are fluffy, e.g. Peter. H.G. Wells was a great author. Some sports require specialized equipment, e.g. baseball.

    Debugging is hard without particular examples from your corpus.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1181189]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-23 06:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found