Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Convert strings with unknown encodings to html

by Pascal666 (Scribe)
on Jun 29, 2015 at 06:54 UTC ( #1132423=perlquestion: print w/replies, xml ) Need Help??

Pascal666 has asked for the wisdom of the Perl Monks concerning the following question:

I need to pull strings out of a database and format them for display on a web page. Individually this is not a problem, but the strings are in various formats in the database and I'm having trouble figuring out a sequence of conversions that will handle all inputs. The strings are mostly ascii, but some of them have special characters embedded.

The below program creates a sample array of characters from the database, then converts them to html. I manually figured out what needs to be done to each character. I need to replace the noted two lines with something that can automatically handle the various formats.

#!/usr/bin/perl -W use strict; use warnings; use feature 'say'; use Encode qw(decode encode); use HTML::Entities; my @in = (chr(226).chr(152).chr(134), chr(195).chr(161), chr(150), chr +(153), '®', '&', 'Æ', chr(63743), chr(991), chr(9760)); decode_entities($_) for @in; #The below two lines need to be replaced $in[$_] = decode ('utf8', $in[$_]) for 0..1; $in[$_] = decode ('cp1252', $in[$_]) for 2..3; say encode_entities($_) for @in;
output: ☆ á – ™ ® & Æ  ϟ ☠

Thank you for any assistance you can render.

Replies are listed 'Best First'.
Re: Convert strings with unknown encodings to html (fix your database)
by Anonymous Monk on Jun 29, 2015 at 07:10 UTC

      Encoding::FixLatin does appear to be what I was looking for. I replaced the two noted lines with:

      $in[$_] = fix_latin($in[$_]) for 0..9;

      and everything converted correctly. I integrated this fix into my main program and again everything seems to be converting correctly.

      Thank you for your assistance.

        Great,

        next step should be to track down all the programs that are putting junk inside the database, fix them to put in the correct stuff

        then backup database, and go ahead and fix_latin the whole database, so fix_latin is no longer needed in your display program

        then your database will just have correct data

Re: Convert strings with unknown encodings to html
by Anonymous Monk on Jun 29, 2015 at 07:32 UTC
    the strings are in various formats
    Do you know all possible formats that the strings might be in, and if yes, what are they?
    The strings are mostly ascii, but some of them have special characters
    Please don't call them 'special characters': the characters are completely normal, it's your database that is 'special'.

      I included examples of each in the above test program. Note that some of the examples are multiple bytes (#1 below, for example, is two characters, one of three bytes and one of two). Best I can tell, the formats are:

      1. UTF-8: chr(226).chr(152).chr(134), chr(195).chr(161) 2. CP1252: chr(150), chr(153) 3. HTML: '®', 'Æ' 4. ASCII: '&' 5. Unicode codepoints: chr(63743), chr(991), chr(9760));

      Obviously the database is a bit 'special'. Unfortunately it is provided by a 3rd party, a very large company, and I have no control over their input sanitization.

        Obviously the database is a bit 'special'. Unfortunately it is provided by a 3rd party, a very large company, and I have no control over their input sanitization.

        :) complain

Re: Convert strings with unknown encodings to html
by graff (Chancellor) on Jun 30, 2015 at 03:57 UTC
    So, you're saying that you have this one database (just one table? multiple tables?), and when you query to get strings from it (from just one column? from multiple columns?), you sometimes get utf8 strings, and sometimes get cp1252 strings, and sometimes get character entity references like ® (and sometimes numeric references like þ or þ?). Can all the variation occur within a single column, or is it different depending on which column holds the string?

    And have you decided what format you want to normalize to? If so, what is that? (If not, why not?)

    If there's really no way to predict what sort of encoding is coming back from the database for a given query, then you really do have one totally fubar'd database. What a shame.

    I gather you've done some diagnosis of database contents, and have some idea about the scope of variation. Is stuff still being added to it? If so, does it continue to be as messy and uncontrolled as the stuff that's already there?

    Don't feel like you have to tell us the answers to all those questions - those are just the main things you have to think about because they affect what kind(s) of solution(s) are likely to be useful.

    Let's suppose you want to your "normalized format" to be just utf8 characters (no entities like & ® &#xf8ff etc.)

    In terms of checking what needs to be done to a given string in order to get to that normalized form, there are a few handy guidelines:

    • For any string that contains non-ASCII bytes, it's almost impossible for a non-utf8 string to be mistaken as utf8 data. Do this on every string from the database that has any non-ASCII content:
      eval { decode('utf8', $string, Encode::FB_CROAK) };
      and then check $@. If it's true (meaning that the eval block died), then the string is definitely not utf8; if it's false, the eval block didn't die, and you can be reasonably sure that the string is utf8-encoded.
    • If you happen to know that a non-ASCII string that is also non-utf8 is bound to be cp1252, then go ahead and decode from cp1252 to utf8.
    • Once the non-ASCII content (if any) is in utf8, decode character entities; you may have to do this more than once (I've seen text with stuff like á - if that can happen in your database, repeat this step until the output string matches the input string).

    Once the string is purely utf8 characters with no entity references, it should be pretty easy to convert that, if necessary, to any other form that you may need for a web display. Good luck.

      I mentioned the database primarily so that I would not get suggestions that involve open's encoding option. All of the examples above are from the same column. It may be possible to have multiple formats in the same string. Even if that is not the case today, I would like a solution that supports it in the future.

      I need to normalize to HTML. Whatever intermediate encoding gets the job done is fine by me. Passing html to encode_entities causes it to be double encoded, which does not display correctly on a web page, so first thing I do is run the input through decode_entities. I suppose it is possible that the input will already be double encoded html. I hadn't thought about running decode_entities multiple times before. Good idea.

      The database is only a few months old, is added to constantly, and is expected to be added to indefinitely. It is extremely unlikely the provider will do anything to improve their input sanitization. Had they the choice, they would not provide me the data in any format. They certainly aren't going to make it any easier to parse.

      I do try not to post n run. You took the time to read my message and reply, the least I can do is satisfy your curiousity.

      Thank you for the detailed analysis and suggestions. Had Encoding::FixLatin not been suggested first, I probably would have ended up using your "guidelines" as an outline for a solution.

        I hadn't heard of Encoding::FixLatin myself prior to seeing this thread, and I'm glad to have learned about it.
Re: Convert strings with unknown encodings to html
by soonix (Canon) on Jun 29, 2015 at 14:12 UTC
    as the others said, you have to know which encoding is used, otherwise you end up with Mojibake, which isn't half as delicious as it sounds …

      This does not appear to be correct. The very first reply suggested a module that will detect the encoding for each character individually. Graff also replied with a general outline for doing so. Using eval to catch when decode croaks was the main piece I was missing. It may sound obvious in hindsight, but my mind just wasn't getting there. I kept trying to figure out how to determine if it was safe to pass a string to decode.

Re: Convert strings with unknown encodings to html
by Anonymous Monk on Jun 29, 2015 at 13:21 UTC

      The test program above already uses that module. Unfortunately, it is insufficient.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1132423]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2020-10-23 03:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (234 votes). Check out past polls.

    Notices?