http://www.perlmonks.org?node_id=862914


in reply to Re^2: Multiline CSV and XML
in thread Multiline CSV and XML

I can see where you might have trouble building Text::CSV_XS but you should be able to use Text::CSV_PP.

I mean no disrespect and I realize that English is (probably) not your first language but I'm having a hard time trying to figure out what the problem is here. You said ...

I am not getting how to device the parseCSVLine and readCSV modules

I don't know what you mean when you say you want to "device" those "modules" (which are actually subroutines). If you can use Text::CSV_PP will you still have the same problem? If not, can you ask the question again in a different way?

I don't think your language is Spanish but, For What It's Worth, puedes preguntarme en privado en Espanol. (My apologies to any actual Spanish speakers with real keyboards.)

Replies are listed 'Best First'.
Re^4: Multiline CSV and XML
by sanju7 (Acolyte) on Oct 01, 2010 at 19:38 UTC

    Let me ask the question a bit clearly. Consideration 1(no use of external module):

    How can i have readCSV subroutine read and populate multiple array lists if there is more than one rows? I am fairly new to perl and yet to get over common mistakes.

    The readCSV sub does read the csv files on the fly and creates each XML at its directory. This goes fine with XMLs where there is only one row. In readCSV i could read the rows to array list and extract the elements of the array against its column heading (as key) and extract that value to variable to be used to create the XML with generateXML sub.So i don't need to care about computing rows and populating no more than one array list

    Now when there is multiple rows at the CSV the situation changes . Now readCSV not only have to sort keys so as to get the values from the first array list it reads it would need to count rows there, if more than one then read one row at a time each one to separate array list, it should do that dynamically so parseCSV can parse each array list do the regex substitution etc (whatever operation may require). This would do the cleanup/ manipulations apply rules and pass it to generateXML all in one go.

    Consideration 2(use of external module):

    As a personal preference i should probably use a module the do the job, making the life a lot easier but saying that is easy, than doing because i have inherited bunch of other scripts and i really don't want to see them break if i change the code entirely. I would start creating the whole thing ground up fresh but for now i thought to fix this would be of less time than to start afresh with readCSV_PP start afresh with readCSV_PP or any other module in this regard.

      The first thing I want to point out is that splitting on commas is not enough to parse CSV. Your parser will break on data that includes a comma: Bring your water, beer, and trash bag. That alone is enough of a reason for me to say, use Text::CSV_PP instead so you get a real parser.

      Your XML generator is fragile and does lots of unneccesary work. You can see that it's fragile because you're already going through a lot of pain trying to change your generator when the requirements change. I have to second dHarry and recommend you use XML::Simple instead. Even when the hash structure changes, generating output can be as simple as print XMLout($hash_ref);

      The bulk of the unneccesary work in the XML generator is that you assign a variable for each of the values in the hash when there's no reason to do so. Just use the hash directly...

      my $gXtext = <<"GXEOF"; <?xml version="1.0" encoding="UTF-8"?> <package xmlns="http://greateventbulatine/event/organizer" xmlns:xsi=" +http://www +.w3.org/2001/XMLSchema-instance"> <theme>${$gXHHRef}{'picnic theme'}</theme>

      For escaping HTML entities, you might want HTML::Entities.

      Now, if, after all that, you still want fragile code that's hard to maintain, well, you've already spelled out what you want to do, so do it. You have to store the last header and add a conditional that loops through the file until your conditions are met. Assuming that each record is in a seperate file, I suppose I'd go at it something like this...

      use strict; use warnings; use XML::Simple; my @headers = parseCVSLine(<CSV>); my @record = parseCVSLine(<CSV>); my @extra_records; while ( my $line = <CSV> ) { my @xrec = parseCVSline(<CSV>); push @extra_records, \@xrec if defined } my %data; @data{@headers} = @record; # but I still don't know what to do with those extra records so ... print XMLout(\%data); # yet another lame CSV "parser", use Text::CSV sub parseCVSLine { return $_[0] ? split /,/ : undef; }

      I still don't know what the boundary between records is. Normally, you would expect a line terminator but with multiline records, you need to use something other than the typical line terminator. In the code above, I assumed that the boundary was the file but perhaps that's wrong. In any case, you really ought to use a clear record boundary.

      I really think you should use the modules so you don't end up where you are now: with a collection of scripts that all have fragile parsers and you have to go tweak every script every time there's a minor change to the data format. Like, what happens when people want a link to the map to get to the picnic? Because you've locked yourself into this format, you'll have to tweak every file that touches that data. Try to take a more fluid approach where you can. My lame example script above doesn't care at all what data is in the files and will still just work.

      Above all, my number one reason to recommend the modules is, you could have been done by now and you would have something that's stable, robust, easy to read and, therefore, easy to maintain.