Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Mapping ACCEPT_LANG, USER_AGENT & GeoIP to Encode's character sets

by cosmicperl (Chaplain)
on Jun 22, 2012 at 02:29 UTC ( [id://977749]=perlquestion: print w/replies, xml ) Need Help??

cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  For a while now it's been my job to deal with user uploads. The users are globally disperse, often have very limited computer skills, and upload text files that were created in various locales.
  It's very challenging to "Do The Right Thing". Encode::Detect helps a lot, and works great a lot of the time. I convert everything to UTF-8, so from then on there aren't any issues... Well, until the user exports and doesn't get how to open the file in UTF-8 mode (depending on what program they are using). But I'm not worried much about exports right not.
  I'm well aware that it doesn't matter where the user is from, as potentially the file they are uploading could be from any locale. But I've found that for our users at least, it's pretty consistent where they are from to what locale their uploads tend to be in. For example, Norwegian users using Mac tend to upload files in MacIcelandic locale, Russian Windows users Windows-1251, etc.
  So what I'm going to do is use HTTP_USER_AGENT, GeoIP and HTTP_ACCEPT_LANGUAGE to give me a best guess at locale for when Encode::Detect gets it wrong. This'll likely be displayed to the user with translation examples so that they can chose the charset that works.
  For the life of me I cannot find on google any examples of people doing this, or any modules for this kind of mapping on CPAN. Am I missing something? Otherwise I may as well create a new CPAN module for this, so that others in my situation may benefit.


Lyle
  • Comment on Mapping ACCEPT_LANG, USER_AGENT & GeoIP to Encode's character sets

Replies are listed 'Best First'.
Re: Mapping ACCEPT_LANG, USER_AGENT & GeoIP to Encode's character sets
by Your Mother (Archbishop) on Jun 22, 2012 at 04:26 UTC

    This might get you started–

    use strictures; use Encode; my $name = shift || die "Give an encoding!\n"; my $input = shift || "Some string..."; my $encoding = find_encoding($name) or die "No encoding found for $name\n"; binmode STDOUT, ":encoding(UTF-8)"; print $encoding->decode($input), $/; __END__
    perl pm-977749 MacIcelandic "OHAI Ƌ"
    OHAI ∆
    
    perl pm-977749 MacRoman  "OHAI Ƌ"
    OHAI ∆
    
    perl pm-977749 UTF-8 "OHAI Ƌ"
    OHAI �
    

    Basically, just find_encoding as declared by client, rejecting unknowns or customizing to handle them, and then decode. For customizing see the Pod for Encode and realize that of the thousands of named encodings out there, they mostly line up with the stock list Encode is aware of, you just might have to do some mapping of your own; I seem to recall the EUC-KR set having several different names in various standards for example.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://977749]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2025-06-14 17:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.