Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Best Way To Parse Concordance DAT File Using Modern Perl?

by space_monk (Chaplain)
on Dec 10, 2012 at 14:46 UTC ( #1008106=note: print w/ replies, xml ) Need Help??


in reply to Best Way To Parse Concordance DAT File Using Modern Perl?

If it's a UTF-8 file, isn't it meant to have a 3 byte BOM? Your BOM indicates that it's a UTF-16 file, not UTF-8.

Anyway UTF-8 text files with Byte Order Mark discussed this, and the comments in that node may be helpful.

See the module File::BOM which was mentioned in there as a means of opening files which may contain a BOM.

A Monk aims to give answers to those who have none, and to learn from those who know more.


Comment on Re: Best Way To Parse Concordance DAT File Using Modern Perl?
Re^2: Best Way To Parse Concordance DAT File Using Modern Perl?
by Jim (Curate) on Dec 10, 2012 at 22:09 UTC
    If it's a UTF-8 file, isn't it meant to have a 3 byte BOM? Your BOM indicates that it's a UTF-16 file, not UTF-8.

    It is a Unicode BOM encoded in three bytes in the UTF-8 character encoding scheme. But it's just one character (one Unicode code point), represented in Perl as \x{FEFF} or \N{BYTE ORDER MARK}. In a decoded, abstract Unicode string, distinctions between various encodings (serializations) of the string don't exist.

    Jim

      I realize it is probably impossible because the file contains evidence and attorney work product, but can you isolate and anonymize a few exemplar records that would cause the CSV or CSV_XS modules to fail in a properly formatted file somewhere?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1008106]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (8)
As of 2015-07-05 04:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls