Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Text::xSV -- how to filter only first line?

by blahblahblah (Priest)
on Feb 14, 2007 at 04:22 UTC ( [id://599851]=perlquestion: print w/replies, xml ) Need Help??

blahblahblah has asked for the wisdom of the Perl Monks concerning the following question:

I wrote some CSV-reading code a few weeks ago and used the Text::xSV module instead of our old home-grown logic that we use in other parts of our application. One of the things our old code does is strip out a possible UTF Byte Order Mark at the beginning of the file. I played around with various ways of getting that same result using Text::xSV, which lets you specify a custom pre-processing filter for each line of data. Short of modifying the module, I was never able to figure out a way to apply that filter to only the first line, and eventually settled on applying it to all lines (and hoping that no other line would start with /^\x{EF}\x{BB}\x{BF}/).

I had forgotten about it until I saw UTF-8 text files with Byte Order Mark recently.

Just curious if anyone can think of a clever way to make a Text::xSV filter work on only the first line of the file it's reading?

Thanks,
Joe

  • Comment on Text::xSV -- how to filter only first line?

Replies are listed 'Best First'.
Re: Text::xSV -- how to filter only first line?
by graff (Chancellor) on Feb 14, 2007 at 05:18 UTC
    Well, if you are in fact dealing with utf8-encoded "xSV" files, then you shouldn't worry about applying a BOM-removal filter on all lines. Just in case there might be any more BOMs scattered throughout the file, you'll want to remove them all, because they should not be treated as if they were part of the actual table data.

    In other words, it is OKAY to have a pre-processing filter that removes BOM characters from every line -- something like  s/\x{feff}//g as a filter is perfectly sane.

    Note that there are some situations that really can create a text file with a BOM at the start of every line (I've seen it happen), so having logic that applies that filter to every line might just save you from some real trouble. (And of course, on lines that don't have a BOM, such a filter doesn't do anything at all, so it's quite harmless.)

    The particular unicode character called "BOM" serves no other purpose than to be the byte-order-mark -- at least, that is the intent it was chosen for; it simply gets in the way and makes trouble if you happen to treat it as if it were data, and it is of course logically useless in a utf8 file anyway (even though some MS-Windows apps insert it routinely when creating utf8 text files -- and, heaven help us, Redmond or MS-centric tool developers may start using file-initial BOM as a kind of "signature" or "magic number" that they "need" to use for identifying files as being utf8 text).*

    As for having Text:xSV do anything special with just the first line of a file, it already has logic to treat the first line as containing the "column headings", as opposed to containing actual data. If you need anything special beyond that for just the first line, you'd need to be a little more specific about your intended usage and the nature of what you are trying to accomplish -- e.g. what you've tried, how it failed, etc.

    (* update/footnote: Now that I think of it, Notepad, which is one of those apps that automatically puts BOM at the beginning of every plain-text utf8 file it creates, appears to be already depending on the BOM as a "magic number" for identifying utf8 text files -- if you use perl to create a utf8 file with wide characters but without an initial BOM, then open that file in Notepad, it's likely not to display the wide-character text correctly.)

      Thanks for the thorough explanation.

      It just felt kind of wrong to be doing a substitution on all lines when I knew it was only the first that would have any effect, but it's comforting to know that it should be harmless.

      Joe

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://599851]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-24 21:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found