http://www.perlmonks.org?node_id=1199663

Hello fellow Monks,

I am looking for your advice on updating and my implemented module for encoding and decoding multiple formats. I wrote the module and tried to include as many formats I could. I know that there other formats that I have not added but in my case during the encoding decoding process has to be also converted to hex and vise versa, where I found problems with more formats that I have not included on my sample of code.

The whole idea behind the module, I am working for a telecommunication company and part of my daily job is to correct problems. The languages can vary globally since it is a live network with live customers and the format is in hex on a variety of encoding patterns. I had some cases that I had to create small scripts to process the packages before and after the nodes so I can observe encoding corruptions or not. Sample of previous questions that I was working that are similar with the module (Chinese to Hex and Hex to Chinese, Arabic to Hex and Hex to Arabic). After seeing my self that I need more and more encodings for more and more languages I end up saying that I need to write a simple module to do that for me instead of creating more or less the same code again and again.

So having said that, sample of code as the user would use the module based on the encodings that can be handled:

#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use Cwd qw();
use Data::Dumper;
use feature 'say';

=alternative
BEGIN {
    push ( @INC, Cwd::cwd());
}
=cut

use Cwd qw();
use lib (Cwd::cwd());

use Foo::Bar qw(ascii2hexEncode hex2ascciiDecode hexOutput);

binmode( STDOUT, ':utf8' );

my @lanquages = qw(
    Chinese
    Japanese
    Russian
    Greek
    Arabic
    );

my @strs = (
    "這是一個測試",
    "これはテストです",
    "Это тест",
    "Αυτό είναι ένα τεστ",
    "هذا اختبار"
    );

my @flags = (
    'UCS-2',
    'UCS-2BE',
    'UCS-2LE',
    'UTF-7',
    'UTF-8',
    'utf-8-strict',
    'UTF-16',
    'UTF-16BE',
    'UTF-16LE',
    'UTF-32',
    'UTF-32BE',
    'UTF-32LE',
    );

my %hashOutput;
while ( defined ( my $flag = shift @flags ) ) {
    for ( 0 .. $#lanquages ) {
	my $hexEncoded = ascii2hexEncode($flag, $strs$_);
	say $lanquages$_ . " " . $flag;
	print Dumper hexOutput($flag, $strs$_);
	say hex2ascciiDecode($flag, $hexEncoded);
	say "";
	# $hashOutput{$flag}{$lanquages$_} = {
	# 'hex' => hexOutput($flag, $strs$_),
	# 'ascci' => hex2ascciiDecode($flag, $hexEncoded),
        # }
    }

}
# print Dumper \%hashOutput;

__END__

Arabic UCS-2
$VAR1 = [
          '06 47 06 30 06 27 00 20 06 27',
          '06 2e 06 2a 06 28 06 27 06 31'
        ]
هذا اختبار

Arabic UTF-8
$VAR1 = [
          'd9 87 d8 b0 d8 a7 20 d8 a7 d8',
          'ae d8 aa d8 a8 d8 a7 d8 b1'
        ]
هذا اختبار

Arabic utf-8-strict
$VAR1 = [
          'd9 87 d8 b0 d8 a7 20 d8 a7 d8',
          'ae d8 aa d8 a8 d8 a7 d8 b1'
        ]
هذا اختبار

Arabic UTF-16
$VAR1 = [
          'fe ff 06 47 06 30 06 27 00 20',
          '06 27 06 2e 06 2a 06 28 06 27',
          '06 31'
        ]
هذا اختبار

Arabic UTF-16BE
$VAR1 = [
          '06 47 06 30 06 27 00 20 06 27',
          '06 2e 06 2a 06 28 06 27 06 31'
        ]
هذا اختبار

Arabic UTF-32
$VAR1 = [
          '00 00 fe ff 00 00 06 47 00 00',
          '06 30 00 00 06 27 00 00 00 20',
          '00 00 06 27 00 00 06 2e 00 00',
          '06 2a 00 00 06 28 00 00 06 27',
          '00 00 06 31'
        ]
هذا اختبار

Arabic UTF-32LE
$VAR1 = [
          '47 06 00 00 30 06 00 00 27 06',
          '00 00 20 00 00 00 27 06 00 00',
          '2e 06 00 00 2a 06 00 00 28 06',
          '00 00 27 06 00 00 31 06 00 00'
        ]
هذا اختبار

The actual module, that I still have not found a good name to apply. Any ideas for naming please feel free to propose.

package Foo::Bar; # use utf8; use strict; use warnings; use Exporter qw(import); use Encode qw(decode encode); use vars qw($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS); $VERSION = 1.00; @ISA = qw(Exporter); @EXPORT = (); @EXPORT_OK = qw(ascii2hexEncode hex2ascciiDecode hexOutput); # binmode( STDOUT, ':utf8' ); sub _ascii2hex { return unpack("H*", $_[0]); } sub _hex2ascii { return pack("H*", $_[0]); } sub hexOutput { my ( $flag , $data ) = @_; my $octet = ascii2hexEncode( $flag , $data ); # trim leading and trailing white space # split string every two characters # join the splitted characters with white space $octet = join(' ', split(/(..)/, $octet)) =~ s/^\s+|\s+$//r =~ y/ / /rs; # insert new line character every 30 characters # join("\n", unpack('(A30)*', $octet)); push my @aref, , unpack('(A30)*', $octet); return \@aref; } sub ascii2hexEncode { my ( $flag , $data ) = @_; my $octet = encode( $flag , $data ); return _ascii2hex( $octet ); } sub hex2ascciiDecode { my ( $flag , $data ) = @_; my $hex2ascciiOctet = _hex2ascii( $data ); return decode( $flag , $hex2ascciiOctet ); } 1;

The module by it self is extremely simple, but at the same time on my position and for my colleagues is extremely useful. Any suggestions on code or any other improvement please feel free to suggest.

Hope this tiny module will help others also.

BR, Thanos

Seeking for Perl wisdom...on the process of learning...not there...yet!

Replies are listed 'Best First'.
Re: Encoding Decoding on multiple formats RFC
by Tux (Abbot) on Sep 21, 2017 at 14:17 UTC

    Just to compare, I used what already was available to achieve almost the same:

    use 5.18.2;
    use warnings;
    use utf8;
    use Encode qw( encode decode );
    use Data::Peek;
    
    binmode STDOUT, ":encoding(utf8)";
    
    my @lang = (
        [ Arabic   => "هذا اختبار"                    ],
        [ Chinese  => "這是一個測試"            ],
        [ Greek    => "Αυτό είναι ένα τεστ"         ],
        [ Japanese => "これはテストです"             ],
        [ Russian  => "Это тест"                  ],
        );
    
    my @encodings = (
        "UCS-2",
        "UCS-2BE",
        "UCS-2LE",
        "UTF-7",
        "UTF-8",
        "utf-8-strict",
        "UTF-16",
        "UTF-16BE",
        "UTF-16LE",
        "UTF-32",
        "UTF-32BE",
        "UTF-32LE",
        );
    
    my %hashOutput;
    for (@lang) {
        my ($lang, $str) = @$_;
        foreach my $enc (@encodings) {
            printf "--\n%-8s %s\n", $lang, $enc;
            DPeek $str;
            my $bytes = encode ($enc, $str);
            DHexDump $bytes;
            }
        }
    

    =>

    -- Arabic UCS-2 PV("\331\207\330\260\330\247 \330\247\330\256\330\252\330\250\330\247\ +330\261"\0) [UTF8 "\x{647}\x{630}\x{627} \x{627}\x{62e}\x{62a}\x{628} +\x{627}\x{631}"] 0000 06 47 06 30 06 27 00 20 06 27 06 2e 06 2a 06 28 .G.0.'. .'...* +.( 0010 06 27 06 31 .'.1 -- Arabic UCS-2BE PV("\331\207\330\260\330\247 \330\247\330\256\330\252\330\250\330\247\ +330\261"\0) [UTF8 "\x{647}\x{630}\x{627} \x{627}\x{62e}\x{62a}\x{628} +\x{627}\x{631}"] 0000 06 47 06 30 06 27 00 20 06 27 06 2e 06 2a 06 28 .G.0.'. .'...* +.( 0010 06 27 06 31 .'.1 -- Arabic UCS-2LE PV("\331\207\330\260\330\247 \330\247\330\256\330\252\330\250\330\247\ +330\261"\0) [UTF8 "\x{647}\x{630}\x{627} \x{627}\x{62e}\x{62a}\x{628} +\x{627}\x{631}"] 0000 47 06 30 06 27 06 20 00 27 06 2e 06 2a 06 28 06 G.0.'. .'...*. +(. 0010 27 06 31 06 '.1. : : : -- Russian UTF-32BE PV("\320\255\321\202\320\276 \321\202\320\265\321\201\321\202"\0) [UTF +8 "\x{42d}\x{442}\x{43e} \x{442}\x{435}\x{441}\x{442}"] 0000 00 00 04 2d 00 00 04 42 00 00 04 3e 00 00 00 20 ...-...B...>.. +. 0010 00 00 04 42 00 00 04 35 00 00 04 41 00 00 04 42 ...B...5...A.. +.B -- Russian UTF-32LE PV("\320\255\321\202\320\276 \321\202\320\265\321\201\321\202"\0) [UTF +8 "\x{42d}\x{442}\x{43e} \x{442}\x{435}\x{441}\x{442}"] 0000 2d 04 00 00 42 04 00 00 3e 04 00 00 20 00 00 00 -...B...>... . +.. 0010 42 04 00 00 35 04 00 00 41 04 00 00 42 04 00 00 B...5...A...B. +..

    Enjoy, Have FUN! H.Merijn

      Hello Tux,

      I had no clue about this module, thanks for pointing out.

      Indeed the output of both are very close to each other. The difference that I see is the decode process that the user can take the hex string and convert it back to the original format, where it is also necessary for me. When you do not have the string in original format and you need to see if the encoding occurred correctly.

      Again thank you for pointing out this great module as I am impressed by the hex output, I definitely want to add it as a reference so people can use it also.

      Update: Well you gave me the idea of that the user should be able to put pieces of DumperOutput and convert them to readable strings. So I added this method:

      sub hexDumperInput { my ( $unicodeFormat , $arrayRef ) = @_; my $hexString = join('', split(/ /, join('', @$arrayRef))); return hex2ascciiDecode($unicodeFormat, $hexString); }

      Sample of test.pl:

      my $hexArrayRef = [ '06 47 06 30 06 27 00 20 06 27',
      		    '06 2e 06 2a 06 28 06 27 06 31' ]
      
      say hexDumperInput('UCS-2', $hexArrayRef);
      
      __END__
      
      $ perl test.pl 
      هذا اختبار
      

      Thanks again for the inspiration and ideas :D. I will added on the module.

      BR / Thanos

      Seeking for Perl wisdom...on the process of learning...not there...yet!

        hex2ascciiDecode

        BTW, I don't know if anyone ever noticed, but this method name has a typo, it should be hex2asciiDecode. I suggest renaming it, and maybe leaving the misspelled one as a wrapper so as to not break existing code (though at this point in time there's not likely to be a lot).

Re: Encoding Decoding on multiple formats RFC
by AppleFritter (Vicar) on Sep 21, 2017 at 08:18 UTC

    This sounds like a rather useful module. Thanks for sharing! (Perlmonks is interpreting square brackets in your code as links, however. Are you sure you're using proper code tags?)

    The actual module, that I still have not found a good name to apply. Any ideas for naming please feel free to propose.

    Random idea: how about Unicode::Peek, given that you're essentially peeking at how a Unicode string is looking under the hood?

      (Perlmonks is interpreting square brackets in your code as links, however. Are you sure you're using proper code tags?)

      thanos1983 is using <pre> tags instead because the <code> tags don't play happily with non-Latin-1 encodings (it's a known problem of long standing). The workaround is to replace the relevant characters (the square brackets) in the source with HTML entities before posting, but that's a faff.

      Hello AppleFritter,

      Thank you for time and effort reading and replying to my request for comments. Hmmmm Unicode::Peek not a bad idea, it is exactly as you describe. The module meant to take a quick peek on the encoding. Well since I am also converting into hex why not Unicode::HexPeek?

      Regarding the code tags, as fellow Monk hippo said I am using the <pre></pre> tags to encode the UTF-8 data.

      Again thanks for the interesting ideas, I appreciate it. BR / Thanos

      Seeking for Perl wisdom...on the process of learning...not there...yet!

        G'day Thanos,

        "... using the <pre></pre> tags to encode the UTF-8 data."

        That's quite correct to do that!

        You also need to replace special characters with entities:

        ForReplaceWith
        HTML&&amp;
        <&lt;
        >&gt;
        PerlMonks[&#91;
        ]&#93;

        [The "HTML" ones are generic (i.e. not just for this site); The "PerlMonks" ones are specific to this site to prevent automatic link generation.]

        These appear under the textarea when you're composing your message:

        "You may need to use entities for some characters, as follows. ..."

        Unfortunately, that list is no longer there when previewing, nor when editing after node creation.

        — Ken

        Well since I am also converting into hex why not Unicode::HexPeek?

        That also works. Personally I'd not get too specific however; perhaps you'll find yourself wanting to expand the module beyond hexadecimal "peeking" further down the road. Best not to paint oneself into a corner too early!