Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Perl & Unicode: state of the art?

by BrowserUk (Pope)
on Oct 07, 2013 at 16:25 UTC ( #1057279=note: print w/ replies, xml ) Need Help??


in reply to Re: Perl & Unicode: state of the art?
in thread Perl & Unicode: state of the art?

Define words and sentences,

How about starting with the simplest possible definitions:

  1. Words: whitespace delimited sequences of letters.
  2. Sentences: sets of words delimited by a full stop.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^2: Perl & Unicode: state of the art?
Re^3: Perl & Unicode: state of the art?
by choroba (Abbot) on Oct 07, 2013 at 20:43 UTC
    Words are usually delimited by punctuation, not only whitespace. Therefore, the following script only counts letters, delimited by non-letters.
    #!/usr/bin/perl use warnings; use strict; use open IO => ':utf8', ':std'; my ($words, $sentences); while (<>) { $words++ for m/\p{L}+/g; $sentences++ for m/\./g; } print "$words $sentences\n";

    Tested on the following text:

    Огонь XXII Зимних олимпийских игр в Сочи во второй раз погас в понедельник в Москве, во время этапа эстафеты олимпийского огня. После нескольких безуспешных попыток снова его зажечь, факел был заменен, передает портал Sports.ru.
    Казус произошел на Раушской набережной, недалеко от Кремля. Видно, как зрители приветствуют факелоносца, он машет в ответ, и через какое-то время факел гаснет.
    
    Output:
    59 5
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      According to that, this:

      ...................................................................... +...........................................

      Contains a lot of sentences.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Sure. Lots of empty sets of words. ;-)
        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Let's assume Czech. Dots are used after numbers to denote ordinals (1. = 1st, 8. = 8th, etc.). Dots are used between numbers in dates (3.9.1975 or 3. zř 1975 = September 3rd 1975). Dots are used at the end of abbreviations, though to make things harder some abbreviations are so common that they do not need the dot and if there's an abbreviation at the end of a sentence, you do not double the dot. And of course some sentences end by a question or exclamation mark.

      Anything that would not take into account the language would fail on any nontrivial text. Even if it did take the language into account, there would be "false positives" and missed sentences.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

Re^3: Perl & Unicode: state of the art?
by DrHyde (Prior) on Oct 08, 2013 at 10:18 UTC

    How many sentences are there in these examples?

    • He said "I like pie. I also like tickles."
    • He said "The pie cost me 2.30."
    • I like pie
    • Who watches the watchers?

    Some will argue that something like "court martial" is a single word, despite having a space in.

    And then there are all those pesky non-European languages. Apparently Chinese uses the same space between words as between characters.

    So the answer is "probably not", in the general case.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057279]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-09-22 07:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (182 votes), past polls