Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^7: Mixed Unicode and ANSI string comparisons?

by Anonymous Monk
on Dec 15, 2015 at 02:16 UTC ( [id://1150324]=note: print w/replies, xml ) Need Help??


in reply to Re^6: Mixed Unicode and ANSI string comparisons?
in thread Mixed Unicode and ANSI string comparisons?

Because in ISO-8859-x, the 8-bit chars vary depending upon the x... and their all mixed together
Oh. Yes, it's probably better to decline that offer...
  • Comment on Re^7: Mixed Unicode and ANSI string comparisons?

Replies are listed 'Best First'.
Re^8: Mixed Unicode and ANSI string comparisons?
by BrowserUk (Patriarch) on Dec 15, 2015 at 02:53 UTC
    Oh. Yes, it's probably better to decline that offer...

    Hm. I think I may have given a wrong impression here.

    Think of the description lines in FASTA files. They can contain anything useful to the researcher, and often contain stuff that only makes sense to the originator; thus it was often written in a local code page. Each individual string makes sense in the context of its file and origin.

    Now take a bunch of legacy FASTA files that originate from all over the world and bring them together into a central DB and index them by their descriptions. And then try to bring the index of legacy descriptions together with more modern ones with their descriptions in Unicode. Now sort them together to provide a single index.

    That's pretty close to the problem.

    Ideally, the descriptions would all be converted into Unicode; but that requires a huge effort entailing a bunch of translators working in many different languages to translate technical terms; abbreviations, and anything else the originating researchers felt important to put there in his own language. Basically an impossible task.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Given that description, any sense of "sorting" seems pretty meaningless. Is there some other term that might better describe a sequencing of elements that is better than random?

      If the overall data is (close to) what you describe, my first inclination would be to partition or segregate the data, by checking for the following conditions in the order shown:

      1. chunks that contain null bytes (these are probably UTF16 or UCS2)
      2. chunks that are entirely comprised of 7-bit ASCII
      3. chunks with some non-ASCII that are properly utf8 encoded
      4. chunks with some non-ASCII that are not proper utf8
      5. chunks that are not utf8 but are entirely mostly comprised of bytes in the range 128-255, except for carriage-returns and line-feeds and maybe tabs (some pre-Unicode Asian encodings could behave this way, even though all such encodings could also accommodate ASCII bytes interspersed with non-ASCII byte pairs that make up 16-bit characters).

      Obviously, you have to start by using plain old binmode to read the input as raw bytes. In case you didn't look it up yet, the test for step 3 is:

      eval { decode("utf8",$input,Encode::FB_CROAK) };
      If the eval succeeds, it's utf8 data.

      Default sorting within some of those partitions would make sense. For the others, it's not so much a matter of making sense, but rather just behaving in some consistent, predictable way.

      Note that group 2 could actually qualify as a subset of groups 3-5 - and that's a good reason to keep it distinct from those others.

      Apart from that, if there's some desire to "classify" or "cluster" the non-ASCII, non-Unicode strings, statistics on byte ngrams can help a fair bit with that (but it remains a bit of a research task, with some training of models required for classification).

      (updated to amend the conditions for set 5)

        Thanks for taking the time to consider the implications of my question seriously.

        Given that description, any sense of "sorting" seems pretty meaningless.

        Hm. They want a unified index. Generally, they want similar things to be found roughly together; and given something specific to look for, a rough idea of where to start looking. 'Ordered'? 'Collated'? I'm not sure that any other term is much better?

        Apart from that, if there's some desire to "classify" or "cluster" the non-ASCII, non-Unicode strings, statistics on byte ngrams can help a fair bit with that (but it remains a bit of a research task, with some training of models required for classification).

        With enough time and knowledge and money, I've no doubt that something along those lines could be done, but they do not have the money to fund such a project. They were looking for a quick fix and I was basically thinking aloud when I asked my question. I didn't anticipate the hostility people would have to answering such a simple question.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Now sort them together to provide a single index.
      Ok, at this point it's not clear to me what 'sorting' even means here :) Sort in alphabetical order? According to what alphabet? In codepoint order (which only makes sense for a couple of languages)? How about some examples :)

      Also, I don't see what decoding has to do with translating from one language to another.

        I don't see what decoding has to do with translating from one language to another.

        The data is. They are free form descriptions produced by researchers from many countries. Parts of most of them will be in Latin (the language not the encoding); parts will be in the researchers own language.

        It's not a case of "translating from one language to another", it is having someone who understands what is in the file so that you could decide how to decode it. The files go back decades; researchers move on. The data continues to exist.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1150324]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-26 01:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found