Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re^3: Arabic Encodding Problem

by graff (Chancellor)
on Feb 11, 2014 at 05:16 UTC ( #1074340=note: print w/replies, xml ) Need Help??

in reply to Re^2: Arabic Encodding Problem
in thread Arabic Encodding Problem

The non-ASCII content in that HTML data is also non-UTF8. Treating it as CP-1256 will probably yield suitable results.

If there are a bunch of HTML files like this (and also a bunch that really are utf8), and you don't want to waste too much time sorting them out, you can add a subroutine like this to your program:

use Encode; sub check_encoding { my ( $inp_name ) = @_; open( my $fh, '<:raw', $inp_name ) or return "$inp_name: open fail +ed: $!"; my $str = ''; until ( $str =~ /[^[:ascii:]]/ ) { $str = <$fh>; } if ( $str =~ /^[[:ascii:]]+$/ ) { return "ascii"; } eval { $_ = decode( 'utf8', $str, Encode::FB_CROAK ) }; if ( $@ ) { return "cp1256"; # We assume Arabic only, so if not utf8, the +n cp1256 } else { return "utf8"; } }
(update: removed hyphen from "cp1256")

Call that subroutine for each file name, and it will return the string that you should use for the encoding spec when you open the file for parsing. If you handle data for any language other than Arabic, and encounter the same problem, you'll need to tweak this to return some other non-unicode encoding, depending on the language.

You'll want to read the man page for Encode, especially the part about "Handling Malformed Data".

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1074340]
[choroba]: New blogpost after a long pause!
marto needs to think of a joke where the punchline is "why the long paws?"

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2018-03-20 16:26 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (254 votes). Check out past polls.