good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
RFC: Text::FixLatinby grantm (Parson) |
on Apr 01, 2009 at 08:45 UTC ( [id://754651]=perlquestion: print w/replies, xml ) | Need Help?? |
grantm has asked for the wisdom of the Perl Monks concerning the following question: A while back I wrote a script that solved a problem and a number of colleagues have found it to be useful so I thought I'd turn it into a module and release it to CPAN. You may be able to help me with the following questions:
There are many use cases for the script but if I describe the original problem it will give you some context ... The ProblemI had a Postgres database that contained plain ASCII data. I needed to convert the database to a Unicode encoding to support accented characters outside of the basic Latin-1 set. The process for converting the encoding of a Postgres database is:
In my case step 2 was not necessary since I was converting from ASCII to UTF8 and ASCII is a subset of UTF8. So it really just boiled down to a dump and restore back into a database that had been created with the UTF8 encoding. This is where my problems started. It turns out that the database did not just contain plain ASCII data. The Postgres 'SQLASCII' encoding basically just means take whatever bytes are given and store them in the DB. And apparently our application had been giving the database an interesting selection of bytes over the years. Originally our web frontend used the Apache default encoding of iso8859-1, later we fixed that so that it used utf-8. So originally accented characters were mostly arriving encoded as iso8859-1 bytes, but often included characters from windows machines using 'win-latin-1' or CP1252 (especially the so called 'smart quote' characters and em-dashes). After we fixed the web server config the non-ASCII data was coming in as UTF8 byte streams. So it turns out I did need step 2 only I couldn't use iconv because it converts from one encoding on the input side to one encoding on the output and I had two or three encodings in my data dump. The SolutionSo I wrote a script called 'fix_latin' which we piped our dump file through. The bytes were examined and filtered as follows:
The script was used in a pipeline somewhat like this:
The Short StorySo basically, the 'fix_latin' script is a filter taking input which may contain any mixture of ASCII, LATIN-1 (iso8859-1), WIN-LATIN-1 (CP1252) and UTF-8 encodings and producing UTF-8 as output. The ProposalSo unless someone can point me at something on CPAN which already does this, I plan to rework the script into a module with essentially just one public function: fix_latin which will take a byte string and return a UTF-8 string. The distribution will also include a simple command-line filter script which will apply the fix_latin function to each line of input. My initial though on naming the module was 'Text::FixLatin'. It's possible that it might be more at home under the 'Encode' namespace (although from a Perl perspective it really 'decodes' bytes into Perl characters). I'm open to suggestions.
Back to
Seekers of Perl Wisdom
|
|