Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

utf file to ansi, but doesn't work?

by ultranerds (Pilgrim)
on Feb 28, 2011 at 10:15 UTC ( #890539=perlquestion: print w/ replies, xml ) Need Help??
ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've got a UTF8 file (set in "UTF-8 without BOM" in Notepad++), and I'm trying to convert it automatically to ANSI, so I can read it and place in my latin1 mySQL table

Thus far I have:

system("iconv --from-code UTF-8 --to-code iso-8859-15 -c /var/home/sit +e/siteforum.com/www/admin/Plugins/Forum/Advertiser/Import/tmp/allVaca +tions.xml > /var/home/site/siteforum.com/www/admin/Plugins/Forum/Adve +rtiser/Import/tmp/allVacations.xml.new"); print qq|Done converting... \n|; system("perl -p -i -e 's|UTF-8|ISO-8859-15|g' *.new"); system("rm /var/home/site/siteforum.com/www/admin/Plugins/Forum/Ad +vertiser/Import/tmp/allVacations.xml"); print qq|Removed allVacations.xml... \n|; system("mv /var/home/site/siteforum.com/www/admin/Plugins/Forum/Ad +vertiser/Import/tmp/allVacations.xml.new /var/home/site/siteforum.com +/www/admin/Plugins/Forum/Advertiser/Import/tmp/allVacations.xml"); print qq|Moved file ... \n|; system("rm -f /var/home/site/siteforum.com/www/admin/Plugins/Forum +/Advertiser/Import/tmp/allVacations.xml.new"); print qq|Removed .new file... \n|;
This "seems" to work... but for some reason when I open this new file in NotePad++, it doesn't seem to recognise the encoding type (it also comes up weird in the DB, i.e "Vols + transferts + h├ębergement en formule "tout ...")

Am I invoking `iconv` incorrecty, or maybe something else?

Been driving me nuts for hours trying to fix this :(

BTW: doing a conversion with a perl module isn't really idea, due to the fact this file is 30+mb in size, and an XML file (so its not very effecient to keep using utf8($string)->latin; for every value in that XML file)

TIA!

Andy

Comment on utf file to ansi, but doesn't work?
Download Code
Re: utf file to ansi, but doesn't work?
by moritz (Cardinal) on Feb 28, 2011 at 11:06 UTC
    This "seems" to work... but for some reason when I open this new file in NotePad++, it doesn't seem to recognise the encoding type

    So you don't know if the conversion failed, or if your text editor's auto detection failed.

    A sure way to find out is to open the file in a hex editor, and manually compare some bytes via encoding tables (for example on Wikipedia) to the characters in the original files.

    Shouldn't be a problem for a bunch of ultranerds :-)

      Hi, I don't have a hex editor ;) (tried using one before to edit the setting on my Blackberry, but couldn't get the nack of it ;))

      Is there a simple way I can check the header (i.e "type") of a file? Kinda like you can do with finding file types in images by opening them in Notepad, and then looking for stuff like "gif" etc)

      The weird bit though, is that when I run the commands manually via SSH, it updates the "encoding" properly in Notepad++!

      iconv --from-code UTF-8 --to-code iso-8859-15 -c /var/home/user/siteforum.com/www/admin/Plugins/Forum/Advertiser/Import/tmp/allVacations.xml.2 > /var/home/user/siteforum.com/www/admin/Plugins/GForum/Advertiser/Import/tmp/allVacations.xml.new

      It wouldn't be something related to the way perl invokes this would it? Not had problems going from non-utf8 --> utf8 before, so just wondering why its having issues doing it this way around :(

      TIA

      Andy
        Hi, I don't have a hex editor ;)

        Then get one. No excuses.

        Is there a simple way I can check the header (i.e "type") of a file? Kinda like you can do with finding file types in images by opening them in Notepad, and then looking for stuff like "gif" etc)

        No. You suspect the automatic recognition of the encoding to be a problem, so you shouldn't trust it to diagnose your problem for you.

        It wouldn't be something related to the way perl invokes this would it?

        Well, you don't check if the command succeeds, that would be a first step. The documentation tells you how (though autodie is more convenient, if you ask me).

        Update: Since your files seem to be XML files: those usually begin with something like <?xml version="1.0" encoding="windows-1252"?>. If the encoding still says UTF-8 or is missing (it defaults to UTF-8), you need to adjust that so that XML processors later on will not complain.

        I don't have a hex editor
        Sure you do! od -cx file

        Also, have you tried Encode?
        Is there a simple way I can check the header (i.e "type") of a file?

        There isn't in general any such thing. All there is in a text file is what you see; that's what makes it text. UTF-x files can (and sometimes must) have a BOM, but ISO-8859's won't.

        Your text editor either has to be told (by you, by default config, etc) what encoding to use, or it can try (and occasionally even succeed) to guess by the patterns of bytes in it. But it has no way to be know unless you tell it. That's why text encoding is such a mess...

Re: utf file to ansi, but doesn't work?
by ikegami (Pope) on Feb 28, 2011 at 16:54 UTC

    I'm trying to convert it automatically to ANSI

    Then why are you converting to iso-8859-15? There is no ANSI code page equivalent to iso-8859-15. The closest is cp1252, a superset of iso-8859-15.

    The real problem, though, is how you handle the characters not present in the new encoding. You should convert them into XML entities.

    (This is not the problem you are asking about.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://890539]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2014-04-17 22:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (458 votes), past polls