Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Some generic hint for encoding problems, especially with Unicode:

Look at input and output files with a hex dumper / hex editor.
Far too many editors convert encodings behind the scenes and display garbage when they have a different idea of how the file is encoded from how it is really encoded. od is available on most Unix systems, and there are tons of other hex dumper and hex editors.
Check the length of strings in perl.
length always returns the number of characters. If you mess up the encodings, make perl read a "Unicode string" (a string encoded as UTF-8, UTF-16, and so on) as bytes, process it, and write it out as bytes, the code appears to work, but some things (e.g. matching characters) behave strangely. The string "AOUń÷‹", encoded as UTF-8, uses the byte sequence 41 4F 55 C3 84 C3 96 C3 9C. Read with the proper encoding setting, length will return 6. Read as a byte stream, or with a "byte=character" encoding like ISO-8859-1, length will return 9.
Check the length of strings in the database.
(Lesson learned from patching DBD::ODBC.) When communicating with a database, encoding problems hide until you read / write with a different tool or check the lengths. This is essentially the same problem as with file I/O, but there is no simple way to get a hex dump.

Feel free to copy from the files t/40UnicodeRoundTrip.t, t/41Unicode.t, and t/UChelp.pm included in DBD::ODBC.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re: Problem with Unicode Characters while reading from oracle database in perl script by afoken
in thread Problem with Unicode Characters while reading from oracle database in perl script by venu_hs

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    [ambrus]: Dumu: why would you want to read the same file both in binary mode and utf8 mode?
    [Corion]: Dumu: Yes, but switching needs to be done with caution, as you might read half an UTF-8 character in binary mode, and then switch to utf8 mode...
    [Corion]: Personally, I would do the decoding of data in the program and not in the filehandle
    [Corion]: ambrus: Maybe a record-based file where some data is strings encoded as utf8
    [Corion]: But again, I would read all data as binary and then decode from utf8
    [ambrus]: Corion: that wouldn't be my guess, but I'll wait for Dumu
    [Corion]: Meh. I'm now on Firefox 57 ("Quantum") and again have to remove Pocket, and reinstall the adblocker, uMatrix (because NoScript doesn't work) and silence the about:blank page to not download crap from the internet
    [Corion]: Maybe I shouldn't update software ...
    [ambrus]: Corion: my guess is to first read the start of the file to check what format it is out of two or more possibilities, then rewind it and handle it differently depending on the program
    [ambrus]: s/program/format/

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (12)
    As of 2017-11-20 17:50 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      In order to be able to say "I know Perl", you must have:













      Results (290 votes). Check out past polls.

      Notices?