Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Re^3: Arabic Encodding Problem

by graff (Chancellor)
on Feb 11, 2014 at 05:16 UTC ( #1074340=note: print w/ replies, xml ) Need Help??

in reply to Re^2: Arabic Encodding Problem
in thread Arabic Encodding Problem

The non-ASCII content in that HTML data is also non-UTF8. Treating it as CP-1256 will probably yield suitable results.

If there are a bunch of HTML files like this (and also a bunch that really are utf8), and you don't want to waste too much time sorting them out, you can add a subroutine like this to your program:

use Encode; sub check_encoding { my ( $inp_name ) = @_; open( my $fh, '<:raw', $inp_name ) or return "$inp_name: open fail +ed: $!"; my $str = ''; until ( $str =~ /[^[:ascii:]]/ ) { $str = <$fh>; } if ( $str =~ /^[[:ascii:]]+$/ ) { return "ascii"; } eval { $_ = decode( 'utf8', $str, Encode::FB_CROAK ) }; if ( $@ ) { return "cp1256"; # We assume Arabic only, so if not utf8, the +n cp1256 } else { return "utf8"; } }
(update: removed hyphen from "cp1256")

Call that subroutine for each file name, and it will return the string that you should use for the encoding spec when you open the file for parsing. If you handle data for any language other than Arabic, and encounter the same problem, you'll need to tweak this to return some other non-unicode encoding, depending on the language.

You'll want to read the man page for Encode, especially the part about "Handling Malformed Data".

Comment on Re^3: Arabic Encodding Problem
Download Code

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1074340]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (1)
As of 2016-05-05 02:09 GMT
Find Nodes?
    Voting Booth?