Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^12: Mixed Unicode and ANSI string comparisons?

by BrowserUk (Patriarch)
on Dec 15, 2015 at 12:13 UTC ( [id://1150365]=note: print w/replies, xml ) Need Help??


in reply to Re^11: Mixed Unicode and ANSI string comparisons?
in thread Mixed Unicode and ANSI string comparisons?

The first module covers 10 European languages; the small sample I saw contained Cyrillic, Arabic, Urdo, and what I think (but can't swear to) were Korean and Japanese.

The second appears to be completely undocumented, but given its author, I'm guessing is designed to try and determine which of the multitude of Unicrap encodings a file contains, rather than anything to do with ISO-8859-x stuff.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^12: Mixed Unicode and ANSI string comparisons?

Replies are listed 'Best First'.
Re^13: Mixed Unicode and ANSI string comparisons?
by soonix (Canon) on Dec 16, 2015 at 08:36 UTC

    The missing comments and description, in connection with the author's reputed name, made me stop short, and I had a (short) look at the source.
    It seems to try to distinguish several ISO-8859-x variants and codepages, and that seemed relevant enough for the problem at hand. Otherwise I would't have mentioned it.

    But more important was my other half sentence: Would it be feasible to build a list of researcher's names (or other type of ID) and their preferred encodings? Or did most of them author only one or two records?

      But more important was my other half sentence: Would it be feasible to build a list of researcher's names (or other type of ID) and their preferred encodings? Or did most of them author only one or two records?

      On the basis of the very small sample I've seen, there are no authorship -- individual or institution -- identifiers. The only semi-consistent thing are the species names in Latin, (the language) and mostly in Latin-1 encoding; but:

      1. They appear in comment cards that are freeform and also contain 8-bit chars that represent different code pages depending where they originate from.
      2. Often the species names are abbreviated.
      3. At least 2 of the small sample also used 8-bit chars in the species name. Specifically, the character that combines a & e into a single char.

      There are many, many of these files. The comment cards are easy to locate and extract; and the desire is to build a single index to them all legacy and new; but the institute commissioning the work has neither the skills nor funding to pay people with the appropriate skills (languages and science) to inspect an translate/convert them in order to unify them.

      They were hoping to throw the problem at a (cheap) computer program and have it magically fix the problem. Like many of those in research they've heard of AI, but don't have any appreciation of what's really involved.

      I quite litereally had no idea what would happen if I threw a bunch of non-unicode & unicode strings at perl's sort. I half hoped that it might do something sensible with the mix; hence I asked my question.

      Personally, I've reach the point in my career where I am able to choose what work I take on; and this is simply not something I can be bothered with.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        no authorship -- individual or institution -- identifiers. … freeform and also contain 8-bit chars that represent different code pages
        So the only possibility would be to "view in Latin-16, ISO-8859-8, KOI8-R, Windows-1256, etc and decide which one is looking right". Urgh. Well, perhaps some reCAPTCHA-like system could work over time...
        species name. Specifically, the character that combines a & e into a single char.
        If it's only æ and œ, I think these are since medieval times used in latin texts.
        I am able to choose what work I take on;
        Lucky you. That means, if you took on this project, it wouldn't be because of the money, but because of the challenge :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1150365]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 23:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found