Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Perl & Unicode: state of the art?

by farang (Hermit)
on Oct 07, 2013 at 22:43 UTC ( #1057320=note: print w/ replies, xml ) Need Help??


in reply to Perl & Unicode: state of the art?

Is is possible to write a script that when fed a file containing properly formed Unicode text, it will count the number of words and sentences it contains?
No! Languages of the world are way too complex. Unicode deals with text at the character and grapheme level, which is hard enough. It is silent on what constitutes a word or sentence. It is certainly possible in many cases to define "words" and "sentences" in a way appropriate to some particular expected text format in some known language, but even then there are usually exceptions. Take choroba's code which satisfies a given spec. Is Sports.ru one word or two? Is какое-то two words, as the code determines, or just one as Russian linguists would probably contend? Do all other languages handle hyphenated text similarly? Almost certainly not, as a general rule. The more text considered, the more edge cases and ambiguities arise, even within a single language.

I am slowly but steadily working to handle Thai and Lao text in Perl. For these languages, sentences are generally delimited by whitespace, and individual words are not delimited at all in the text, but instead are delimited by syntactic rules. Code can and has been written to count individual Thai words, but it is considerably different and more complicated than counting the number of character strings between spaces.


Comment on Re: Perl & Unicode: state of the art?
Re^2: Perl & Unicode: state of the art?
by BrowserUk (Pope) on Oct 08, 2013 at 00:25 UTC
    Thai and Lao text ... these languages, sentences are generally delimited by whitespace, and individual words are not delimited at all in the text, but instead are delimited by syntactic rules.

    So, fair to say that the first requirement to process Unicode 'text'; is to determine the language.

    So then the question becomes: given a file of Unicode text; can the language be determined?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      > can the language be determined?

      You know the answer, only with statistical certainty and dependent on the length of the text and the distance of languages.

      Hand and finger (en) <=> Hand und Finger (de)

      If same script lead to same delimiters can only be answered by someone knowing all 6000 languages of the world.

      But already Arabic words should be a problem, maybe less if transcribed. Chinese even more.

      see also Word_divider and Word#Word_boundaries

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        You know the answer

        Nope. If I knew, I wouldn't be asking.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Again, in the general case, no. There exist texts which are in multiple languages, which may have different syntactic rules. Sometimes the two languages are in separate volumes, or at least separate halves on a volume, but sometimes you'll get the two languages on opposite pages, or in two columns on each page, or even line by line translations. And very occasionally you'll even see line by line translations in more than two langauges. I have a book at home that is tri-lingual Greek/Latin/English, for example.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057320]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2014-07-24 23:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (167 votes), past polls