Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: Perl & Unicode: state of the art?

by BrowserUk (Pope)
on Oct 08, 2013 at 00:25 UTC ( #1057331=note: print w/ replies, xml ) Need Help??


in reply to Re: Perl & Unicode: state of the art?
in thread Perl & Unicode: state of the art?

Thai and Lao text ... these languages, sentences are generally delimited by whitespace, and individual words are not delimited at all in the text, but instead are delimited by syntactic rules.

So, fair to say that the first requirement to process Unicode 'text'; is to determine the language.

So then the question becomes: given a file of Unicode text; can the language be determined?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^2: Perl & Unicode: state of the art?
Re^3: Perl & Unicode: state of the art?
by LanX (Canon) on Oct 08, 2013 at 00:45 UTC
    > can the language be determined?

    You know the answer, only with statistical certainty and dependent on the length of the text and the distance of languages.

    Hand and finger (en) <=> Hand und Finger (de)

    If same script lead to same delimiters can only be answered by someone knowing all 6000 languages of the world.

    But already Arabic words should be a problem, maybe less if transcribed. Chinese even more.

    see also Word_divider and Word#Word_boundaries

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      You know the answer

      Nope. If I knew, I wouldn't be asking.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Well come back to Babel, brothers..

        Languages are live things, poetry is a valid form of a language.

        Processors are mechanicals things: no way to cover all the cases.

        Perl is digital and my brain is analogical.

        no hope, sorry


        there are no rules, there are no thumbs..
Re^3: Perl & Unicode: state of the art?
by DrHyde (Prior) on Oct 08, 2013 at 10:35 UTC
    Again, in the general case, no. There exist texts which are in multiple languages, which may have different syntactic rules. Sometimes the two languages are in separate volumes, or at least separate halves on a volume, but sometimes you'll get the two languages on opposite pages, or in two columns on each page, or even line by line translations. And very occasionally you'll even see line by line translations in more than two langauges. I have a book at home that is tri-lingual Greek/Latin/English, for example.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057331]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2014-10-02 01:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (42 votes), past polls