Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Thank you, Tux, for your reply.

If otoh 0xAE is just a placeholder for embedded newlines, that is easy to do (see below).

Yes, U+00AE (®, REGISTERED SIGN, 0xC2 0xAE in UTF-8) is used as a placeholder for literal newlines in quoted strings. The CSV records in Concordance DAT files are ordinary ones with standard EOL characters:  <CR><LF> pairs.

Another point of care is that Text::CSV_XS does not deal with BOM's, so you'll need File::BOM or other means to deal with that.

This would be a nice feature to add to Text::CSV_XS:  proper handling of Unicode byte order marks in UTF-8, UTF-16 and UTF-32 CSV files.

Note that the encoded U+00FE is 0xC3BE, which is two bytes, and two bytes cannot be used as a sep_char in Text::CSV_XS, which parses the data as bytes, so the stream has to be properly coded before parsing.

This settles it. It's not the answer I'd hoped for, but I'm glad to know now with certainty that Text::CSV_XS cannot parse a UTF-8 Concordance DAT file. I'll stop trying hopelessly to make it work. ;-)

How difficult would it be to enhance Text::CSV_XS to handle metacharacters in Unicode CSV files that are outside the Basic Latin block (i.e., not ASCII characters)? The Concordance DAT file is a de facto standard format for data interchange in the litigation support and e-discovery industry. As I've explained, the only thing special about it is the unusual and unfortunate characters it uses for metacharacters:  U+0014, which is a control code; U+00FE, which is word constituent character; and U+00AE, which is a common character in ordinary text.

Jim


In reply to Re^6: Peculiar Reference To U+00FE In Text::CSV_XS Documentation by Jim
in thread Peculiar Reference To U+00FE In Text::CSV_XS Documentation by Jim

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-19 10:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found