Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

RFC: Text::FixLatin

by grantm (Parson)
on Apr 01, 2009 at 08:45 UTC ( [id://754651]=perlquestion: print w/replies, xml ) Need Help??

grantm has asked for the wisdom of the Perl Monks concerning the following question:

A while back I wrote a script that solved a problem and a number of colleagues have found it to be useful so I thought I'd turn it into a module and release it to CPAN. You may be able to help me with the following questions:

  • Does something like this already exist on CPAN?
  • If not, what should I call it?

There are many use cases for the script but if I describe the original problem it will give you some context ...

The Problem

I had a Postgres database that contained plain ASCII data. I needed to convert the database to a Unicode encoding to support accented characters outside of the basic Latin-1 set. The process for converting the encoding of a Postgres database is:

  1. write the DB out to a dump file
  2. use a utility like iconv to convert the encoding of the dump file
  3. create a new (empty) database - specifying the new encoding
  4. restore the transcoded dump file into the new database

In my case step 2 was not necessary since I was converting from ASCII to UTF8 and ASCII is a subset of UTF8. So it really just boiled down to a dump and restore back into a database that had been created with the UTF8 encoding.

This is where my problems started. It turns out that the database did not just contain plain ASCII data. The Postgres 'SQLASCII' encoding basically just means take whatever bytes are given and store them in the DB. And apparently our application had been giving the database an interesting selection of bytes over the years. Originally our web frontend used the Apache default encoding of iso8859-1, later we fixed that so that it used utf-8. So originally accented characters were mostly arriving encoded as iso8859-1 bytes, but often included characters from windows machines using 'win-latin-1' or CP1252 (especially the so called 'smart quote' characters and em-dashes). After we fixed the web server config the non-ASCII data was coming in as UTF8 byte streams.

So it turns out I did need step 2 only I couldn't use iconv because it converts from one encoding on the input side to one encoding on the output and I had two or three encodings in my data dump.

The Solution

So I wrote a script called 'fix_latin' which we piped our dump file through. The bytes were examined and filtered as follows:

  • plain ASCII characters (0x00-0x7F) were passed through untouched
  • well-formed UTF-8 multi-byte characters were also passed through untouched
  • any remaining lone bytes (0x80-0xFF) were assumed to be CP1252 (being a superset of iso8859-1)

The script was used in a pipeline somewhat like this:

fix_latin < dump_file | psql -d database

The Short Story

So basically, the 'fix_latin' script is a filter taking input which may contain any mixture of ASCII, LATIN-1 (iso8859-1), WIN-LATIN-1 (CP1252) and UTF-8 encodings and producing UTF-8 as output.

The Proposal

So unless someone can point me at something on CPAN which already does this, I plan to rework the script into a module with essentially just one public function: fix_latin which will take a byte string and return a UTF-8 string.

The distribution will also include a simple command-line filter script which will apply the fix_latin function to each line of input.

My initial though on naming the module was 'Text::FixLatin'. It's possible that it might be more at home under the 'Encode' namespace (although from a Perl perspective it really 'decodes' bytes into Perl characters). I'm open to suggestions.

Replies are listed 'Best First'.
Re: RFC: Text::FixLatin
by mirod (Canon) on Apr 01, 2009 at 10:34 UTC

    Indeed, it might be useful to have 'Encode' or 'Encoding' somewhere in the name, when I saw the title of your node I thought you had written a tool that fixed latin declinations in text (really!).

    But in any case, do release the code, that will be most useful. Would it make sense to specify a 'default legacy encoding', from which characters in the 0x80-0xFF range would be converted? I suspect outside of the latin-1 zone that would be appreciated.

    Thanks

Re: RFC: Text::FixLatin
by moritz (Cardinal) on Apr 02, 2009 at 08:11 UTC
    From a philosophical point of view I don't like your proposal, since detecting character encodings is always guesswork and error prone.

    That said, we all make mistakes, and I've had to do something very similar already, and I'd be happy if something like were on CPAN.

    As for the name: you're dealing with mixed UTF-8 and Latin-*, maybe something like Encoding::FixMixed would be a good name?

Re: RFC: Text::FixLatin
by Anonymous Monk on Nov 27, 2012 at 23:13 UTC
    Fer anyone wondering, it ended up on cpan as

    Encoding::FixLatin takes mixed encoding input and produces UTF-8 output Documentation

    fix_latin filters a data stream that is predominantly utf8 and 'fixes' any latin (ie: non-ASCII 8 bit) characters

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://754651]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-03-28 11:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found