Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Thank you again, Tux, for your thoughtful reply.

The newline placeholder convention is unique to the Concordance DAT file and doesn't fall within the scope of ordinary CSV parsing. In hindsight, I shouldn't have mentioned it here. You're right:  it's trivial to convert REGISTERED SIGN characters to newlines after the CSV records are parsed.

Imagine a fully Unicode-based finite state machine that only operates on Unicode code points (better) or Unicode extended grapheme clusters (best). It would tokenize only true Unicode strings, notionally like this in Perl:

for my $grapheme ($csv_stream =~ m/(\X)/g) { ... }

This probably isn't easily done in C, is it?

You can still use Text::CSV_XS if you are sure that there are no U_0014 characters inside fields, but I bet you cannot be (binary fields tend to hold exactly what causes trouble).

In the particular case of the Concordance DAT records I'm working with right now, I'm simply using split. The CSV records are being generated by our own software, so I know they will always be well-formed, they'll never have literal CR or LF characters in them, and every string is enclosed in the "quote" character U+00FE. I expect it will be a decade or two before I'm unlucky enough to encounter a CSV record in a Concordance DAT file that the following Perl code won't handle correctly enough:

use utf8;
use charnames qw( :full );
use open qw( :encoding(UTF-8) :std );
use English qw( -no_match_vars );

# ...

    if ($INPUT_LINE_NUMBER == 1) {
        $record =~ s/^\N{BYTE ORDER MARK}//; # Remove Unicode BOM...

        # ...

        $record =~ s/^/\N{BYTE ORDER MARK}/; # Restore Unicode BOM...
    }

# ...

sub parse {
    my $record = shift;

    chomp $record;

    $record =~ s/^//;
    $record =~ s/$//;

    return split m/\x{0014}/, $record;
}

sub combine {
    my $record = join "\x{0014}", @{ $_[0] };

    $record =~ s/^//;
    $record =~ s/$/\n/;

    return $record;
}

Thanks again.

Jim


In reply to Re^8: Peculiar Reference To U+00FE In Text::CSV_XS Documentation by Jim
in thread Peculiar Reference To U+00FE In Text::CSV_XS Documentation by Jim

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (3)
    As of 2014-08-31 03:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (294 votes), past polls