Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

decoding a UTF-16B string found in an email subject

by neaj (Initiate)
on Oct 30, 2013 at 18:34 UTC ( [id://1060410]=perlquestion: print w/replies, xml ) Need Help??

neaj has asked for the wisdom of the Perl Monks concerning the following question:

hello,

i am downloading email with perl from a pop server. until now I haven't had any problems being able to read subject lines, that is to say I can read the letters and numbers and so forth, even though sometimes subjects are complete gibberish.

anyhow, i've been getting several emails that contain what I believe is a UTF16 string embeded in the subject header.

as it appears on the webmail's (hotmail) view source:

Subject: username, A Ne=?UTF-16?B?dwAgAEMAcgBlAGQAaQB0ACAAQwBhAHIAZAAg +AEMAbwB1AGwAZAAgAEIAZQAgAEgAZQBhAGQAZQBkACAAWQBvAHUAcgAgAFcAYQB5AA==? +=
(all on one line)

which is also exactly how the data looks like when downloaded via pop

when viewing that specific email in a browser on hotmail, the "A Ne*string*" appears as:

A New Credit Card Could Be Headed Your Way

now I just can't figure out for the life of me how I'm supposed to decode that string. I'll use regex to grab the ?UTF part and then decode it:

print decode( "UTF-16be", $string ), "\n";

but all that returns is a bunch of non english (japanese i think) characters and symbols

what is the proper way to decode that string so as i could actually read out the proper subject?

Replies are listed 'Best First'.
Re: decoding a UTF-16B string found in an email subject
by Your Mother (Archbishop) on Oct 30, 2013 at 19:07 UTC

    The headers are base64 encoded so this gets you better jibberish anyway–

    perl -CSD -MEncode -MMIME::Base64 -le 'print decode"UTF-16be", decode_base64"Subject: bobbypin22, A Ne=?UTF-16?B?dwAgAEMAcgBlAGQAaQB0ACAAQwBhAHIAZAAgAEMAbwB1AGwAZAAgAEIAZQAgAEgAZQBhAGQAZQBkACAAWQBvAHUAcgAgAFcAYQB5AA==?="'
    䫦쭛ꆶ⧶�
    

    You'll have to play around with the encoding stuff maybe but the MIME::Base64 is probably all you're missing.

    Update, looking at it more, it seems that it is doing something with the encoding I'm not familiar with… (embedded UTF-16?B stuff). So, someone will probably give a better answer, or I'll look more in a couple hours.

    That string on it's own looks like possibly clean Chinese when decoded (I don't read any so, uh, hope it's not echoed spam)–

    眀 䌀爀攀搀椀琀 䌀愀爀搀 䌀漀甀氀搀 䈀攀 䠀攀愀搀攀搀 夀漀甀爀 圀愀礀

      When I first saw the post, I also thought it might look like Base64 encoding, but, trying to decode it, I could not make anything out of its decoding, so I gave up and did not answer. So, if I understand correctly, it is Base64 + Chinese UTF16, did I get it right?

      it's actually english, as i can see the mail on the webserver all decoded and pretty like.

      it is in fact spam, which i echoed the subject line in my question above, but just to show what the line should be when decoded

      and thanks for looking. i'll see where MIME::Base64 leads me

        What skx says is more on point. But it doesn't work as is. Doing decode "MIME-Header" on it gets–

        UTF-16:Unrecognised BOM 7700
Re: decoding a UTF-16B string found in an email subject
by skx (Parson) on Oct 30, 2013 at 19:24 UTC

    The headers of mesages are encoded as per RFC 2047.

    You can see sample code, in Perl, to decode such headers if you consult CPAN for Encode::Mime::Header

    Steve
    --
      This should be the correct answer, but I don't think the string is correctly encoded. This:
      use Encode qw(decode); my $str = 'username, A Ne=?UTF-16?B?dwAgAEMAcgBlAGQAaQB0ACAAQwBhAHIAZA +AgAEMAbwB1AGwAZAAgAEIAZQAgAEgAZQBhAGQAZQBkACAAWQBvAHUAcgAgAFcAYQB5AA= +=?='; my $chr = decode('MIME-Header', $str); print "$chr\n";
      Gets me:
      UTF-16:Unrecognised BOM 7700 at /.../Encode/MIME/Header.pm line 81.
      While this:
      use MIME::Base64; my $cstr = 'dwAgAEMAcgBlAGQAaQB0ACAAQwBhAHIAZAAgAEMAbwB1AGwAZAAgAEIAZQ +AgAEgAZQBhAGQAZQBkACAAWQBvAHUAcgAgAFcAYQB5AA'; my $chk = decode_base64($cstr); print "$chk\n";
      Gets me:
      w Credit Card Could Be Headed Your Way
      So the part that is supposed to be UTF-16 appears to be just base64 encoded.

      UPDATE: And if you change 'UTF-16' in the first part to 'UTF-8', then it is correctly decoded without error.

        According to all docs I've found, a BOM is not necessary, and when a BOM is not present then big-endian is supposed. However the string you give seems to be little-endian (as is the case in the problem that got me to this page...). If you s/UTF-16/UTF-16LE/ then your string gets decoded correctly.

      thanks the doc for Encode::Mime::Header explained what i was doing wrong

      #my $string = "=?UTF-16?B?dwAgAEMAcgBlAGQAaQB0ACAAQwBhAHIAZAAgAEMAbwB1 +AGwAZAAgAEIAZQAgAEgAZQBhAGQAZQBkACAAWQBvAHUAcgAgAFcAYQB5AA==?="; my $string = "dwAgAEMAcgBlAGQAaQB0ACAAQwBhAHIAZAAgAEMAbwB1AGwAZAAgAEIA +ZQAgAEgAZQBhAGQAZQBkACAAWQBvAHUAcgAgAFcAYQB5AA=="; print MIME::Base64::decode( $string ), "\n"; w Credit Card Could Be Headed Your Way

      i had to use base64 decoding on the encoded word, and not the whole string

      =?encoding?X?ENCODED WORD?=
        If you just subsitute 'UTF-16' with 'UTF-8', then the entire line is correctly decoded with decode('MIME-Header', $str). The encoded part appears to be incorrectly encoded. Probably typical of spammers...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1060410]
Approved by Laurent_R
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-23 19:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found