http://www.perlmonks.org?node_id=1150322


in reply to Re^4: Mixed Unicode and ANSI string comparisons?
in thread Mixed Unicode and ANSI string comparisons?

The problem is that the legacy set makes use of the extended ascii character set (8-bit chars) which don't convert to Unicode (easily).
Hmm... why not?
My take when asked about it was: don't! Keep two lists for lookup and don't mix them, because they cannot logically be sorted together. They countered by sorting two small subsets together (using Java) and saying that it was easier for their people to do lookups in a single list.
Well, that doesn't look too difficult? Why not decode their legacy set (as in map Encode::decode( 'LEGACY_SET', $_ ), @set) and sort that? And if their set happens to be ISO-8859-1 (aka Latin-1), then decoding isn't even necessary (and that's the deal with utf8-off strings in perl; they're assumed to be in THAT encoding, although some people say it just looks like it :)
The result of this thread is so depressing that I'm going to turn the work down and let them find someone else. (Shame. Could have been a nice in.)
Shame indeed, because Perl is actually very good for Unicode stuff... Unicode::Collate::Locale, for example... but yeah, Perl's strings are a source of much confusion.

Replies are listed 'Best First'.
Re^6: Mixed Unicode and ANSI string comparisons?
by BrowserUk (Patriarch) on Dec 15, 2015 at 02:09 UTC
    Hmm... why not?

    Because in ISO-8859-x, the 8-bit chars vary depending upon the x.

    To see what I mean, view Re: How to replace extended ascii ctrs with \xnn strings? and see how the characters in the text between the two code blocks change as you switch the View->Encoding from Cyrillic to Arabic to Hebrew to Japanese to Korean etc.

    They cannot be translated automatically without knowing what the original code page is; and they're all mixed together.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Because in ISO-8859-x, the 8-bit chars vary depending upon the x... and their all mixed together
      Oh. Yes, it's probably better to decline that offer...
        Oh. Yes, it's probably better to decline that offer...

        Hm. I think I may have given a wrong impression here.

        Think of the description lines in FASTA files. They can contain anything useful to the researcher, and often contain stuff that only makes sense to the originator; thus it was often written in a local code page. Each individual string makes sense in the context of its file and origin.

        Now take a bunch of legacy FASTA files that originate from all over the world and bring them together into a central DB and index them by their descriptions. And then try to bring the index of legacy descriptions together with more modern ones with their descriptions in Unicode. Now sort them together to provide a single index.

        That's pretty close to the problem.

        Ideally, the descriptions would all be converted into Unicode; but that requires a huge effort entailing a bunch of translators working in many different languages to translate technical terms; abbreviations, and anything else the originating researchers felt important to put there in his own language. Basically an impossible task.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.