http://www.perlmonks.org?node_id=863031


in reply to Re^4: Multiline CSV and XML
in thread Multiline CSV and XML

The first thing I want to point out is that splitting on commas is not enough to parse CSV. Your parser will break on data that includes a comma: Bring your water, beer, and trash bag. That alone is enough of a reason for me to say, use Text::CSV_PP instead so you get a real parser.

Your XML generator is fragile and does lots of unneccesary work. You can see that it's fragile because you're already going through a lot of pain trying to change your generator when the requirements change. I have to second dHarry and recommend you use XML::Simple instead. Even when the hash structure changes, generating output can be as simple as print XMLout($hash_ref);

The bulk of the unneccesary work in the XML generator is that you assign a variable for each of the values in the hash when there's no reason to do so. Just use the hash directly...

my $gXtext = <<"GXEOF"; <?xml version="1.0" encoding="UTF-8"?> <package xmlns="http://greateventbulatine/event/organizer" xmlns:xsi=" +http://www +.w3.org/2001/XMLSchema-instance"> <theme>${$gXHHRef}{'picnic theme'}</theme>

For escaping HTML entities, you might want HTML::Entities.

Now, if, after all that, you still want fragile code that's hard to maintain, well, you've already spelled out what you want to do, so do it. You have to store the last header and add a conditional that loops through the file until your conditions are met. Assuming that each record is in a seperate file, I suppose I'd go at it something like this...

use strict; use warnings; use XML::Simple; my @headers = parseCVSLine(<CSV>); my @record = parseCVSLine(<CSV>); my @extra_records; while ( my $line = <CSV> ) { my @xrec = parseCVSline(<CSV>); push @extra_records, \@xrec if defined } my %data; @data{@headers} = @record; # but I still don't know what to do with those extra records so ... print XMLout(\%data); # yet another lame CSV "parser", use Text::CSV sub parseCVSLine { return $_[0] ? split /,/ : undef; }

I still don't know what the boundary between records is. Normally, you would expect a line terminator but with multiline records, you need to use something other than the typical line terminator. In the code above, I assumed that the boundary was the file but perhaps that's wrong. In any case, you really ought to use a clear record boundary.

I really think you should use the modules so you don't end up where you are now: with a collection of scripts that all have fragile parsers and you have to go tweak every script every time there's a minor change to the data format. Like, what happens when people want a link to the map to get to the picnic? Because you've locked yourself into this format, you'll have to tweak every file that touches that data. Try to take a more fluid approach where you can. My lame example script above doesn't care at all what data is in the files and will still just work.

Above all, my number one reason to recommend the modules is, you could have been done by now and you would have something that's stable, robust, easy to read and, therefore, easy to maintain.