Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^4: Mixed Unicode and ANSI string comparisons?

by BrowserUk (Pope)
on Dec 15, 2015 at 01:14 UTC ( #1150318=note: print w/replies, xml ) Need Help??


in reply to Re^3: Mixed Unicode and ANSI string comparisons?
in thread Mixed Unicode and ANSI string comparisons?

As Ricardo Signes said: Right now, you can write programs in Perl that handle all this correctly, using only one tool: extreme vigilance.

That's the source of my depression!

The "situation" I referred to is the desire of a customer to sort two sets of data together: 1 legacy set stored in ascii/ANSI/ISO-8859-x; and another newer set stored in Unicode. The problem is that the legacy set makes use of the extended ascii character set (8-bit chars) which don't convert to Unicode (easily).

My take when asked about it was: don't! Keep two lists for lookup and don't mix them, because they cannot logically be sorted together. They countered by sorting two small subsets together (using Java) and saying that it was easier for their people to do lookups in a single list.

It was at that point I asked my question here. My expectation was that sort would either throw an error; or sort them into two distinct groups, but I didn't know. (Or know how to check without doing a shitload of reading and trial and error.)

The result of this thread is so depressing that I'm going to turn the work down and let them find someone else. (Shame. Could have been a nice in.)


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^4: Mixed Unicode and ANSI string comparisons?

Replies are listed 'Best First'.
Re^5: Mixed Unicode and ANSI string comparisons?
by Your Mother (Bishop) on Dec 15, 2015 at 18:49 UTC
    newer set stored in Unicode

    This doesn't mean anything. Unicode is the complete standard, not a character set or encoding.

      A reply falls below the community's threshold of quality. You may see it by logging in.
Re^5: Mixed Unicode and ANSI string comparisons?
by Anonymous Monk on Dec 15, 2015 at 01:44 UTC
    The problem is that the legacy set makes use of the extended ascii character set (8-bit chars) which don't convert to Unicode (easily).
    Hmm... why not?
    My take when asked about it was: don't! Keep two lists for lookup and don't mix them, because they cannot logically be sorted together. They countered by sorting two small subsets together (using Java) and saying that it was easier for their people to do lookups in a single list.
    Well, that doesn't look too difficult? Why not decode their legacy set (as in map Encode::decode( 'LEGACY_SET', $_ ), @set) and sort that? And if their set happens to be ISO-8859-1 (aka Latin-1), then decoding isn't even necessary (and that's the deal with utf8-off strings in perl; they're assumed to be in THAT encoding, although some people say it just looks like it :)
    The result of this thread is so depressing that I'm going to turn the work down and let them find someone else. (Shame. Could have been a nice in.)
    Shame indeed, because Perl is actually very good for Unicode stuff... Unicode::Collate::Locale, for example... but yeah, Perl's strings are a source of much confusion.
      Hmm... why not?

      Because in ISO-8859-x, the 8-bit chars vary depending upon the x.

      To see what I mean, view Re: How to replace extended ascii ctrs with \xnn strings? and see how the characters in the text between the two code blocks change as you switch the View->Encoding from Cyrillic to Arabic to Hebrew to Japanese to Korean etc.

      They cannot be translated automatically without knowing what the original code page is; and they're all mixed together.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Because in ISO-8859-x, the 8-bit chars vary depending upon the x... and their all mixed together
        Oh. Yes, it's probably better to decline that offer...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1150318]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2019-08-21 16:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?