Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^7: Peculiar Reference To U+00FE In Text::CSV_XS Documentation

by Tux (Abbot)
on Dec 10, 2012 at 18:02 UTC ( #1008144=note: print w/replies, xml ) Need Help??

in reply to Re^6: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
in thread Peculiar Reference To U+00FE In Text::CSV_XS Documentation

  1. Newlines

    If U+00AE is just a placeholder for newlines *inside* fields, my proposed solution works fine.

  2. BOM

    I have been playing with thoughts about BOM handling quite a few times already, but came to the same conclusion time after time: the advantage is not worth the performance penalty, which is huge.

    Text::CSV_XS is written for sheer speed, and having to check BOM on every record-start (yes, eventually that is what it turns out to be if one wants to support streams) is not worth it. It is relatively easy to

    • Do BOM handling before Text::CSV_XS starts parsing
    • Write a wrapper or a super-class that does BOM handling

  3. Non-ASCII characters for sep/quote/escape

    Any of these will imply a speed penalty, even if I would allow it and implement it. That is because the parser is a state machine, which means that the internal structure should change to both allowing multi-byte characters and handling them (1st check on start of each of them, then read-ahead if the next is part of the "character" and so on. I already allow this on eol up to 8 characters, which was a pain in the ass to do safely. I'm not saying it is impossible, but I'm not sure if it is worth development time.

    You can still use Text::CSV_XS if you are sure that there are no U_0014 characters inside fields, but I bet you cannot be (binary fields tend to hold exactly what causes trouble).

Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^8: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
by Jim (Curate) on Dec 10, 2012 at 21:24 UTC

    Thank you again, Tux, for your thoughtful reply.

    The newline placeholder convention is unique to the Concordance DAT file and doesn't fall within the scope of ordinary CSV parsing. In hindsight, I shouldn't have mentioned it here. You're right:  it's trivial to convert REGISTERED SIGN characters to newlines after the CSV records are parsed.

    Imagine a fully Unicode-based finite state machine that only operates on Unicode code points (better) or Unicode extended grapheme clusters (best). It would tokenize only true Unicode strings, notionally like this in Perl:

    for my $grapheme ($csv_stream =~ m/(\X)/g) { ... }

    This probably isn't easily done in C, is it?

    You can still use Text::CSV_XS if you are sure that there are no U_0014 characters inside fields, but I bet you cannot be (binary fields tend to hold exactly what causes trouble).

    In the particular case of the Concordance DAT records I'm working with right now, I'm simply using split. The CSV records are being generated by our own software, so I know they will always be well-formed, they'll never have literal CR or LF characters in them, and every string is enclosed in the "quote" character U+00FE. I expect it will be a decade or two before I'm unlucky enough to encounter a CSV record in a Concordance DAT file that the following Perl code won't handle correctly enough:

    use utf8;
    use charnames qw( :full );
    use open qw( :encoding(UTF-8) :std );
    use English qw( -no_match_vars );
    # ...
        if ($INPUT_LINE_NUMBER == 1) {
            $record =~ s/^\N{BYTE ORDER MARK}//; # Remove Unicode BOM...
            # ...
            $record =~ s/^/\N{BYTE ORDER MARK}/; # Restore Unicode BOM...
    # ...
    sub parse {
        my $record = shift;
        chomp $record;
        $record =~ s/^//;
        $record =~ s/$//;
        return split m/\x{0014}/, $record;
    sub combine {
        my $record = join "\x{0014}", @{ $_[0] };
        $record =~ s/^//;
        $record =~ s/$/\n/;
        return $record;

    Thanks again.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1008144]
[marto]: usemodperl I guess that depends on what you mean by a safe space, since many people seem to have the impression a safe space allows them to do/say whatever they feel like, without question or critque
[marto]: 'typos'->'typo'
[usemodperl]: it's like you guys are retarded or something, no sense of humor? autism?
[usemodperl]: take things too literally, nothing is funny, everyhting must be perfect, or else, SCOLD SCOLD SCOLD, haha
[Veltro]: usemodperl I think you are offensive right now.
[marto]: people are very defnsive about their bad ideas behaviour' :P
[usemodperl]: sorry veltro, venting...
[usemodperl]: (is that wrong marto?)
[aitap]: usemodperl: maybe it's you who has changed
[marto]: demonstrably yes, since you claim to want a safe space, your definition for which seems to be a place where you can name call, make things up or otherwise post without being challenged

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2018-06-24 15:51 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.