Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re^3: Arabic Encodding Problem

by graff (Chancellor)
on Feb 11, 2014 at 05:16 UTC ( #1074340=note: print w/replies, xml ) Need Help??

in reply to Re^2: Arabic Encodding Problem
in thread Arabic Encodding Problem

The non-ASCII content in that HTML data is also non-UTF8. Treating it as CP-1256 will probably yield suitable results.

If there are a bunch of HTML files like this (and also a bunch that really are utf8), and you don't want to waste too much time sorting them out, you can add a subroutine like this to your program:

use Encode; sub check_encoding { my ( $inp_name ) = @_; open( my $fh, '<:raw', $inp_name ) or return "$inp_name: open fail +ed: $!"; my $str = ''; until ( $str =~ /[^[:ascii:]]/ ) { $str = <$fh>; } if ( $str =~ /^[[:ascii:]]+$/ ) { return "ascii"; } eval { $_ = decode( 'utf8', $str, Encode::FB_CROAK ) }; if ( $@ ) { return "cp1256"; # We assume Arabic only, so if not utf8, the +n cp1256 } else { return "utf8"; } }
(update: removed hyphen from "cp1256")

Call that subroutine for each file name, and it will return the string that you should use for the encoding spec when you open the file for parsing. If you handle data for any language other than Arabic, and encounter the same problem, you'll need to tweak this to return some other non-unicode encoding, depending on the language.

You'll want to read the man page for Encode, especially the part about "Handling Malformed Data".

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1074340]
[stevieb]: hmmm seems the C# library doesn't work right on win10. With berrybrew, only part of the perl portable zip is extracted. no errors nothing. Guess I have to look at switching out zip libraries. sigh

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2017-03-28 16:24 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (335 votes). Check out past polls.