++ for noticing the problem with abbreviations. Short of a the ability to parse and comprehend grammar, it's going to be very difficult to separate
"We sold the division to MegaTech, Ltd. in Asia last week,
who flipped the sale to someone else."
"We sold the division to MegaTech Industries. In Asia last week, they flipped the sale to someone else."
other than the fact that we are supposed to start a new sentence with an upper-case letter. There may be examples where that following word is a proper noun, however -- in which case it's going to be a very
hard nut to crack.
If, however, you only care about the "typical" case (because this is going to be a one-shot tool), you could:
- Split the text on /[.]\s+[A-Z]/ to get sentences.
- Grep the text for /[aA]sia/, or for /Asia\s/ if you don't want the word "asian" to count.
- Split the sentences that pass on ' ' to get words.
- Use the words you get from that split as keys to a hash, and increment a count in each bin.