Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Arabic to Hex and Hex to Arabic

by thanos1983 (Priest)
on Jul 28, 2017 at 16:12 UTC ( #1196219=perlquestion: print w/replies, xml ) Need Help??
thanos1983 has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow Monks,

I am trying to convert an Arabic string to hexadecimal (final form) and vise versa, but I am lost in between the steps.

I tried successfully to convert the string to utf-8 and then I was thinking of converting it into ascii characters and then to hex with the help of String::HexConvert. If that would work I would revert the process to the original form.

Sample of code:

#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use feature qw(say);
use Encode qw(decode encode);

binmode( STDOUT, ':utf8' );

my $arabic = 'ﻟﻠﺒﻴﻊ';                 # "For sale"
my $utf8 = encode( 'UTF-8', $arabic );
my $str = decode( 'UTF-8', $utf8);

say 'UTF-8';
say $arabic; # ﻟﻠﺒﻴﻊ
say $utf8;
say $str; # ﻟﻠﺒﻴﻊ
say '';

my $ucs2 = encode("UCS-2BE", $arabic);
my $decoded = decode("UCS-2BE", $ucs2);

say 'UCS-2';
say $arabic; # ﻟﻠﺒﻴﻊ
say $ucs2;
say $decoded; # ﻟﻠﺒﻴﻊ

__END__

$ perl test.pl
UTF-8
ﻟﻠﺒﻴﻊ
Ÿﻠ’ﻴŠ
ﻟﻠﺒﻴﻊ

UCS-2
ﻟﻠﺒﻴﻊ
’
ﻟﻠﺒﻴﻊ

I am getting lost on how to convert the utf-8 string to ascii, am I thinking correctly, or I am thinking completely wrong. The whole goal is to convert the Arabic string into hexadecimal and vise versa, it does not really matter for me how many conversions I have to make as long as it works. But so far none of the conversions have worked.

Anyone has any idea on how to convert utf-8 to ascii for Arabic characters, is this possible if not any ideas?

Update: Finally managed to make it work. See sample bellow:

#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use feature qw(say);
use Encode qw(decode encode);
use String::HexConvert ':all';

binmode( STDOUT, ':utf8' );

my $arabic = 'ﻟﻠﺒﻴﻊ';                 # "For sale"
say $arabic;
say 'UTF-8';
my $utf8 = encode( 'UTF-8', $arabic );

my $ascii2hexUTF8 = ascii_to_hex($utf8);
say $ascii2hexUTF8;

my $hex2ascciiUTF8 = hex_to_ascii($ascii2hexUTF8);
say $hex2ascciiUTF8;

my $strUTF8 = decode( 'UTF-8', $hex2ascciiUTF8);
say $strUTF8;
say '';
say 'UCS-2';
my $ucs2 = encode("UCS-2BE", $arabic);

my $ascii2hexUCS2 = ascii_to_hex($ucs2);
say $ascii2hexUCS2;

my $hex2ascciiUCS2 = hex_to_ascii($ascii2hexUCS2);
say $hex2ascciiUCS2;

my $strUCS2 = decode("UCS-2BE", $hex2ascciiUCS2);
say $strUCS2;

__END__

$ perl test.pl
ﻟﻠﺒﻴﻊ
UTF-8
efbb9fefbba0efba92efbbb4efbb8a
Ÿﻠ’ﻴŠ
ﻟﻠﺒﻴﻊ

UCS-2
fedffee0fe92fef4feca
’
ﻟﻠﺒﻴﻊ

Now the question is, how could I get hexadecimal output in a form like (sample):

00 31 00 2e

Update2: I added the sample of join and split expected output:

#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use feature qw(say);
use Encode qw(decode encode);
use String::HexConvert ':all';

binmode( STDOUT, ':utf8' );

my $arabic = 'ﻟﻠﺒﻴﻊ';                 # "For sale"

say 'UTF-8';
my $utf8 = encode( 'UTF-8', $arabic );

my $ascii2hexUTF8 = ascii_to_hex($utf8);
say join(' ', split(/(..)/, $ascii2hexUTF8));

my $hex2ascciiUTF8 = hex_to_ascii($ascii2hexUTF8);

my $strUTF8 = decode( 'UTF-8', $hex2ascciiUTF8);
say $strUTF8;

say '';

say 'UCS-2';
my $ucs2 = encode("UCS-2BE", $arabic);

my $ascii2hexUCS2 = ascii_to_hex($ucs2);
say join(' ', split(/(..)/, $ascii2hexUCS2));

my $hex2ascciiUCS2 = hex_to_ascii($ascii2hexUCS2);

my $strUCS2 = decode("UCS-2BE", $hex2ascciiUCS2);
say $strUCS2;

__END__

$ perl test.pl
UTF-8
 ef  bb  9f  ef  bb  a0  ef  ba  92  ef  bb  b4  ef  bb  8a
ﻟﻠﺒﻴﻊ

UCS-2
 fe  df  fe  e0  fe  92  fe  f4  fe  ca
ﻟﻠﺒﻴﻊ

I can split the hexadecimal string into two strings but is this correct? Is any other way to see the hexadecimal in other format? For example form the hex documentation they represent the hex as '0xAf' should I added manually or is there any other way to get the hex like this?

Update3: Thanks to fellow monk choroba I modified the <code> to <pre>.

Thanks everyone in advance for their time and effort.

Seeking for Perl wisdom...on the process of learning...not there...yet!

Replies are listed 'Best First'.
Re: Arabic to Hex and Hex to Arabic
by Your Mother (Chancellor) on Jul 28, 2017 at 17:25 UTC

    It looks like you got it worked out to where you understand the issues but it sounds like an XY problem. Is this just a learning exercise or do you have some kind of requirements? Here is a simplistic, related tangent that arrives at the same hex values–

    #!/usr/bin/env perl
    use utf8;
    use strict;
    use warnings;
    use HTML::Entities "encode_entities";
    my $arabic = "ﻟﻠﺒﻴﻊ"; # "For sale"
    
    print encode_entities($arabic), $/;
    __END__
    &#xFEDF;&#xFEE0;&#xFE92;&#xFEF4;&#xFECA;
    

      Hello Your Mother,

      Thanks for the reply, I saw this possible solution somewhere but I did not spend the time to apply it.

      Well in my case it is exactly the problem that I have and what I wanted to resolve. I am working for a telecommunications company and I wanted to debug / troubleshoot the problem of a customer. So I wanted exactly this :D

      Thank you for your time and effort again. :D

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Arabic to Hex and Hex to Arabic
by Tux (Abbot) on Jul 28, 2017 at 19:58 UTC
    use 5.18.2;
    use warnings;
    
    use utf8;
    use Data::Peek;
    use Encode qw( decode encode );
    
    my $arabic = "ﻟﻠﺒﻴﻊ"; # "For sale"
    
    say join " " => map { sprintf "U+%06x", ord $_ } split m// => $arabic;
    
    say join " " => map { sprintf "U+%06x",     $_ } unpack "U*", $arabic;
    
    DHexDump $arabic;
    
    say join " " => map { sprintf "U+%02x", ord $_ } split m// => encode (utf8 => $arabic);
    
    U+00fedf U+00fee0 U+00fe92 U+00fef4 U+00feca U+00fedf U+00fee0 U+00fe92 U+00fef4 U+00feca 0000 ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a .............. +. U+ef U+bb U+9f U+ef U+bb U+a0 U+ef U+ba U+92 U+ef U+bb U+b4 U+ef U+bb +U+8a

    Enjoy, Have FUN! H.Merijn

      Hello Tux,

      Thank you very much, this is another great idea. Thanks you again for your time and effort.

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Arabic to Hex and Hex to Arabic
by kcott (Chancellor) on Jul 29, 2017 at 06:07 UTC

    G'day thanos1983,

    You can get the code points, without firing up the regex engine, like this:

    $ perl -Mutf8 -C -E 'my $x = "ﻟﻠﺒﻴﻊ"; say sprintf "%x", ord substr $x, $_, 1 for 0 .. length($x) - 1'
    fedf
    fee0
    fe92
    fef4
    feca
    

    I don't speak, read or write Arabic; however, checking against Unicode's (PDF) code chart "Arabic Presentation Forms-B", these certainly appear correct.

    You asked about getting a "0x" prefix. You can do that with sprintf by changing "%x" to "%#x".

    $ perl -Mutf8 -C -E 'my $x = "ﻟﻠﺒﻴﻊ"; say sprintf "%#x", ord substr $x, $_, 1 for 0 .. length($x) - 1'
    0xfedf
    0xfee0
    0xfe92
    0xfef4
    0xfeca
    

    I don't know anything about UCS, so I might be missing something here. The output you show under "UCS-2", is just the code points, from my first one-liner, as pairs of hex digits (which, obviously, you could get with substr - still not needing a regex).

    I accidentally generated what you show as "UTF-8" output, when I initially wrote that first one-liner, because I forgot to add the utf8 pragma.

    $ perl -C -E 'my $x = "ﻟﻠﺒﻴﻊ"; say sprintf "%x", ord substr $x, $_, 1 for 0 .. length($x) - 1'
    ef
    bb
    9f
    ef
    bb
    a0
    ef
    ba
    92
    ef
    bb
    b4
    ef
    bb
    8a
    

    Anyway, knowing neither Arabic nor UCS, I don't want to draw any inferences from that output. It might, however, provide you with some insights.

    The second part of your title was "... Hex to Arabic". Just printing the hex output I first got, gives me the original Arabic string.

    $ perl -C -E 'say "\x{fedf}\x{fee0}\x{fe92}\x{fef4}\x{feca}"'
    ﻟﻠﺒﻴﻊ
    

    P.S. I'm using 5.26.0.

    — Ken

      I don't know anything about UCS

      UCS is essentially a legacy set of encodings for Unicode. UCS-2 is a two byte encoding, UCS-4 uses four bytes.

      UCS-2 is very similar to UTF-16, except that only characters in the BMP are allowed. UCS-2 has no concept of surrogates. You can read UCS-2 like you would read UTF-16. And if you write UTF-16 without surrogates, you also have written UCS-2. UTF-16 with surrogates is not compatible with UCS-2.

      UCS-4 is very similar to UTF-32, capable of encoding 2^63 2^31 characters (sign bit is fixed to 0), but its definition is artificially limited to the range 0..0x10FFFF to stay compatible with other Unicode encodings. Because of this limitiation, UCS-4 and UTF-32 encode all characters in an identical way.

      See also Universal Character Set, "Unicode Encodings" and "Beyond Unicode code points" in perlunicode.

      More "Unicde and Perl" stuff: perlunicode, perlunicook,perlunifaq, perluniintro, perlunitut, Encode

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Hello kcott,

      Thanks this is one of the reasons that I ask questions on this forum and not on any other. People are coming up with so many interesting answers and new ideas. Thank you for your time and effort. :D

      Seeking for Perl wisdom...on the process of learning...not there...yet!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1196219]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and !@monks...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2017-10-17 19:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My fridge is mostly full of:

















    Results (235 votes). Check out past polls.

    Notices?