Text::xSV -- how to filter only first line?

blahblahblah has asked for the wisdom of the Perl Monks concerning the following question:

I wrote some CSV-reading code a few weeks ago and used the Text::xSV module instead of our old home-grown logic that we use in other parts of our application. One of the things our old code does is strip out a possible UTF Byte Order Mark at the beginning of the file. I played around with various ways of getting that same result using Text::xSV, which lets you specify a custom pre-processing filter for each line of data. Short of modifying the module, I was never able to figure out a way to apply that filter to only the first line, and eventually settled on applying it to all lines (and hoping that no other line would start with /^\x{EF}\x{BB}\x{BF}/).

I had forgotten about it until I saw UTF-8 text files with Byte Order Mark recently.

Just curious if anyone can think of a clever way to make a Text::xSV filter work on only the first line of the file it's reading?

Thanks,
Joe

Comment on Text::xSV -- how to filter only first line?

Replies are listed 'Best First'.
Re: Text::xSV -- how to filter only first line? by graff (Chancellor) on Feb 14, 2007 at 05:18 UTC
Well, if you are in fact dealing with utf8-encoded "xSV" files, then you shouldn't worry about applying a BOM-removal filter on all lines. Just in case there might be any more BOMs scattered throughout the file, you'll want to remove them all, because they should not be treated as if they were part of the actual table data. In other words, it is OKAY to have a pre-processing filter that removes BOM characters from every line -- something like `s/\x{feff}//g` as a filter is perfectly sane. Note that there are some situations that really can create a text file with a BOM at the start of every line (I've seen it happen), so having logic that applies that filter to every line might just save you from some real trouble. (And of course, on lines that don't have a BOM, such a filter doesn't do anything at all, so it's quite harmless.) The particular unicode character called "BOM" serves no other purpose than to be the byte-order-mark -- at least, that is the intent it was chosen for; it simply gets in the way and makes trouble if you happen to treat it as if it were data, and it is of course logically useless in a utf8 file anyway (even though some MS-Windows apps insert it routinely when creating utf8 text files -- and, heaven help us, Redmond or MS-centric tool developers may start using file-initial BOM as a kind of "signature" or "magic number" that they "need" to use for identifying files as being utf8 text).* As for having Text:xSV do anything special with just the first line of a file, it already has logic to treat the first line as containing the "column headings", as opposed to containing actual data. If you need anything special beyond that for just the first line, you'd need to be a little more specific about your intended usage and the nature of what you are trying to accomplish -- e.g. what you've tried, how it failed, etc. (* update/footnote: Now that I think of it, Notepad, which is one of those apps that automatically puts BOM at the beginning of every plain-text utf8 file it creates, appears to be already depending on the BOM as a "magic number" for identifying utf8 text files -- if you use perl to create a utf8 file with wide characters but without an initial BOM, then open that file in Notepad, it's likely not to display the wide-character text correctly.)	[reply] [d/l]
Re^2: Text::xSV -- how to filter only first line? by blahblahblah (Priest) on Feb 14, 2007 at 15:19 UTC
Thanks for the thorough explanation. It just felt kind of wrong to be doing a substitution on all lines when I knew it was only the first that would have any effect, but it's comforting to know that it should be harmless. Joe	[reply]


Syntactic Confectionery Delight
	PerlMonks