Multiline CSV and XML

sanju7 has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a event card case. All data related to these events are stored in CSVs identified by event's name as i "filename.csv" in a directory.The detail include the event's description, detail of place etc and filename of a corresponding pic. The script to read the directory containing these CSVs read individual values there against each columns, on certain case(when same event is happening at different location. eg suppose summer picnic happening at two location) when there are multiple rows read them too. Depending on the value/s at particular columns create a xml file that represents the event, puts the xml in a separate directory named as the value in csv column indicating pic file.

The idea is to parse the csv files on the go and keep processing csvs and once done delete csvs, so very minimum footprint(no need to open all files in memory) allowing processing 1000s of csvs. I am not aware any external module can help this way so i used bare minimum external modules. At the end of parsing all csvs i will be having all event files created in individual directories.

Here is my code:

#!/usr/bin/perl -w

use strict;
use Getopt::Std;
use URI::Escape;

if ($#ARGV==-1) {
    displayUsage();
    exit -1;
}

getopts('ndc:o:s:');

my $csvDir;
if ($::opt_c) {
    if (-d $::opt_c) {
        $csvDir = $::opt_c;
    } else {
        print STDERR "FATAL: Specified CSV directory nof found: ".$::o
+pt_c."\n";
        displayUsage();
        exit -1;
    }
} else {
    print STDERR "FATAL: A CSV directory needed. (-c)\n";
    displayUsage();
    exit -1;
}

my $outputDir;
if ($::opt_o) {
    if (-d $::opt_o) {
        $outputDir = $::opt_o;
    } else {
        print STDERR "FATAL: Specified output not found: ".$::opt_o."\
+n";
        displayUsage();
        exit -1;
    }
} else {
    print STDERR "FATAL: An output directory needed. (-o)\n";
    displayUsage();
    exit -1;
}

##

my $addSuffix;
$addSuffix = $::opt_s if ($::opt_s);

my $suppressFileTag;
$suppressFileTag++ if ($::opt_n);

my $deleteCSVs;
$deleteCSVs++ if ($::opt_d);

# Sanitize slashes
if ($outputDir !~ /[\\\/]/) {
    $outputDir .= "\\";
} else {
    my $outputDirSlash = ($outputDir =~ /\\/ ? "\\" : "\/");
    $outputDir .= $outputDirSlash unless($outputDir =~ /$outputDirSlas
+h$/);
}

if ($csvDir !~ /[\\\/]/) {
    $csvDir .= "\\";
} else {
    my $csvDirSlash = ($csvDir =~ /\\/ ? "\\" : "\/");
    $csvDir .= $csvDirSlash unless($csvDir =~ /$csvDirSlash$/);
}

# Check directories

unless(-d $csvDir) {
    print STDERR "FATAL: Could not find specified CSV directory: ".$cs
+vDir."\n";
    exit -1;
}

unless(-d $outputDir) {
    print STDERR "FATAL: Could not find specified output directory: ".
+$outputDir."\n";
    exit -1;
}

# Read all CSVs
unless(opendir(CSV, $csvDir)) {
    print STDERR "FATAL: Unable to open CSV directory [".$csvDir."]: "
+.$!."\n";
    exit -1;
}

while(my $entry = readdir(CSV)) {
    next if ($entry =~ /^\.\.*$/);
    print STDERR "INFO: Processing CSV: ".$entry."\n";

    my %metaData;

    readCSV($csvDir.$entry, \%metaData);

    if ($::opt_d) {
        unless(unlink($csvDir.$entry)) {
            print STDERR "ERROR: Unable to remove CSV file [".$csvDir.
+$entry."]: ".$!."\n";
        }
    }
    
# Create metadata directories
    foreach my $metaEntry (sort keys %metaData) {
        my $entryDir = $outputDir.$metaEntry.".card";
        if (!-d $entryDir) {
            unless(mkdir($entryDir)) {
                print STDERR "FATAL: Unable to create directory [".$en
+tryDir."]: ".$!."\n";
                exit -1;
            }
        }
    }    
    
##
    # Generate XML
    foreach my $metaEntry (sort keys %metaData) {
        my $eventDataFile = $outputDir.$metaEntry.".card".($outputDir 
+=~ /\\/ ? "\\" : "\/")."eventDetail.xml";
                my %fileHash = %{$metaData{$metaEntry}};
        my $XML = generateXML(\%fileHash, $addSuffix, $suppressFileTag
+);
        print STDERR "INFO: Writing [".$eventDataFile."]...\n";
        unless(open(OUT, ">", $eventDataFile)) {
            print STDERR "FATAL: Unable to open file [".$eventDataFile
+."] for write: ".$!."\n";
            exit -1;
        }
        print OUT $XML."\n";
        close(OUT);
        print STDERR "INFO: Done!\n";
    }
    print STDERR "INFO: Done Processing CSV: ".$entry."\n";
}

closedir(CSV);

exit 0;

sub readCSV {
    return undef unless($#_==1);
    my $mKfile = shift(@_);
    my $mKHHRef = shift(@_);

    unless(open(CSV, "<", $mKfile)) {
        print STDERR "FATAL: Unable to open CSV file [".$mKfile."]: ".
+$!."\n";
        exit -1;
    }

    my $mKlineCnt = 0;

    # Fields Setup
    my @mKHDFields = parseCSVLine(scalar(<CSV>));

    my %mKHH;

    for(my $i=0;$i!=($#mKHDFields+1);$i++) {
        # Adjust Fields to Lower Case
        $mKHDFields[$i] =~ tr/A-Z/a-z/;
        # Clean out tags like <au> <gb>, etc.
        $mKHDFields[$i] =~ s/\<([^>]+)\>//g;
        # Clean leading and trailing spaces
        $mKHDFields[$i] =~ s/^\s+//g;
        $mKHDFields[$i] =~ s/\s+$//g;
    }

    # Load Data
    while(my $mKline = <CSV>) {
        $mKline =~ s/(\n|\r)//g;
        $mKlineCnt++;
        my $advance=5;
        if (((() = $mKline =~ /\"/g) % 2)!=0) {
            while(($advance)&&(((() = $mKline =~ /\"/g) % 2)!=0)) {
                my $temp = <CSV>;
                $temp =~ s/(\n|\r)//g;
                $advance--;
                $mKline .= $temp;
            }
        }
        my @mKrowFields = parseCSVLine($mKline);

        my %tempHash;

        my $mKfieldCnt = 0;
        foreach my $mKrowField (@mKrowFields) {
            my $key = $mKHDFields[$mKfieldCnt]; $mKfieldCnt++;
            next if ($mKrowField =~ /^(\s+)*None(\s+)*$/i); # Patch to
+ avoid 'None' values
            $tempHash{$key} = $mKrowField;
        }

        if ((!exists($tempHash{'filename'}))&&
            (exists($tempHash{'file name'}))) {
            $tempHash{'filename'} = $tempHash{'file name'};
        }

        if ($tempHash{'filename'}) {
            ${$mKHHRef}{${tempHash{'filename'}}} = \%tempHash;
        }
    }
    close(CSV);
    return 1;
}

sub parseCSVLine {
    return undef unless($#_>=0);
    my $zzzline = shift(@_);
    my $zzzfieldCnt = shift(@_);

    # Must have an even number of quotes
    return undef if (((() = $zzzline =~ /\"/g) % 2)!=0);

    my @zzzarray = split(/\,/,$zzzline);
    my @zzzreturn;

    while($#zzzarray!=-1) {
        my $zzzentry = shift(@zzzarray);
        if ($zzzentry =~ /\"/) {
            while (((() = $zzzentry =~ /\"/g) % 2)!=0) {

                $zzzentry .= ",".shift(@zzzarray);
            }
        }
        # Strip leading quotes
        $zzzentry =~ s/^\"//g;
        # Strip trailing quotes
        $zzzentry =~ s/\"$//g;
        # Strip leading spaces
        $zzzentry =~ s/^\s+//g;
        # Strip trailing spaces
        $zzzentry =~ s/\s+$//g;
        # Convert double quotes to single.
        $zzzentry =~ s/\"\"/\"/g;
        push(@zzzreturn,$zzzentry);
    }
    return (@zzzreturn) if ($#zzzreturn!=-1);
}

## trim down the sub xml etc

sub generateXML {
    return undef unless(ref($_[0]) eq 'HASH');

    my $gXHHRef = shift(@_);

    my $gXfileSuffix; # optional
    if ($#_>=0) {
        my $suffix = shift(@_);

        $gXfileSuffix = ".".$suffix if ($suffix);
    }

    my $gXsuppressFileTag;
    if ($#_>=0) {
        $gXsuppressFileTag = shift(@_);
    }

    

    my $gXpthe = ${$gXHHRef}{'picnic theme'};
    my $gXorgname = ${$gXHHRef}{'organizer name'};
    my $gXvaddr = ${$gXHHRef}{'venue address'};
    my $gXeLoc = ${$gXHHRef}{'event location'};
    my $gXpFull = html_escape(${$gXHHRef}{'event detail'});
    my $gXtot = html_escape(${$gXHHRef}{'total allowed heads'});
    my $gXgpar = html_escape(${$gXHHRef}{'maximum game participants'})
+;
    my $gXecont = ${$gXHHRef}{'star attraction'};

    my $gXpdate = reformatDate(${$gXHHRef}{'picnic date'});




  

    my $gXtext = <<"GXEOF";
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://greateventbulatine/event/organizer" xmlns:xsi="
+http://www.w3.org/2001/XMLSchema-instance">
    <theme>$gXpthe</theme>
    <video>
        <xml line1>Event Detail</xml line1>
        <xml line2>$gXpthe</xml line2>
        <xml line3>$gXorgname</xml line3>
        <xml line4>$gXvaddr</xml line4>
        <xml line5>$gXpFull</xml line5>
        <xml line6>$gXtot</xml line6>
        <xml line7>$gXgpar</xml line7>
        <xml line8>$gXecont</xml line8>
        <xml line9>$gXpdate</xml line9>
        <xml line10>$gXeLoc</xml line10>

GXEOF

    unless($gXsuppressFileTag) {
    $gXtext .= <<"GXEOF";
        <data_file>
            <file_name>$gXorgname$gXfileSuffix</file_name>
        </data_file>
GXEOF
    }

    return $gXtext;
}



sub reformatDate {
    my $rDdate = shift(@_);
    if ($rDdate =~ /([^\/+]+)\/([^\/+]+)\/(\d\d\d\d)/) {
        my ($rDmon,$rDday,$rDyear) = ($1,$2,$3);
        $rDdate = sprintf('%-04.04d-%-02.02d-%-02.02d', $rDyear, $rDmo
+n, $rDday);
    }
    return $rDdate;
}


sub html_escape {
   my $value = shift(@_);
   $value =~ s/\&/\&amp\;/g;
   $value =~ s/([\x00-\x1f\x21-\x25\x27-\x2b\x2f\x3a\x3c-\x40\x5b-\x5e
+\x60\x7b-\xff])/"&#".uc(sprintf "%lu" , unpack("C", $1)).";"/eg;
#    Checking ..
    print "variable [value] has = $value \n" ;   
   return $value;
}


sub displayUsage {
    print STDERR <<"DU_EOF";
Usage: $0 [-d] [-c csv dir] [-o output dir] 
        -d      Delete CSV
        -c      CSV dir
        -o      Output (usually current dir)
        
DU_EOF
}
[download]

The above code successfully parces the csv and create an xml that represent the event inside a separare directory called "eventname.card". Below is the input data and output data.

Input file name "Info_venuename_Eventdata.csv". file content below (Cat "Info_venuename_Eventdata.csv")

PICNIC THEME,ORGANIZER NAME,TYPE,VENUE ADDRESS,EVENT LOCATION,EVENT DE
+TAIL,TOTAL ALLOWED HEADS,MAXIMUM GAME PARTICIPANTS,STAR ATTRACTION,PI
+CNIC DATE,FILENAME,bbbb,cccc,dddd,eeee,gggg,ffff,hhhh,iiii,jjjj,llll,
+kkkk,mmmm,nnnn,oooo,pppp,
summer,xyz corp,web,address-w-zip,san pedro,Summer picnic with celibri
+ty. Be creating and bring your idea. There is sporting event. One mea
+l served at picnic. Bring your water and trash bag. No camera allowed
+.,100,20,Ben Stiller,11/11/2010,summer_picnic,bb,cccc,dddd,eeee,gggg,
+ffff,hhhh,iiii,jjjj,llll,kkkk,mmmm,nnnn,oooo,pppp,
,,,,west harlem,,,,Chris Rock,10/10/2010,,,,,,,,,,,,,,,,,
,,,,san redo,,,,Mr Miller,9/9/2010,,,,,,,,,,,,,,,,,
[download]

Output xml file (cat /.../eventdir/"eventDetail.xml")

<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://greateventbulatine/event/organizer" xmlns:xsi="
+http://www.w3.org/2001/XMLSchema-instance">
    <theme>summer</theme>
    <video>
        <xml line1>Event Detail</xml line1>
        <xml line2>summer</xml line2>
        <xml line3>xyz corp</xml line3>
        <xml line4>address-w-zip</xml line4>
        <xml line5>Summer picnic with celibrity. Be creating and bring
+ your idea. There is sporting event. One meal served at picnic. Bring
+ your water and trash bag. No camera allowed.</xml line5>
        <xml line6>100</xml line6>
        <xml line7>20</xml line7>
        <xml line8>Ben Stiller</xml line8>
        <xml line9>2010-11-11</xml line9>
        <xml line10>san pedro</xml line10>

        <data_file>
            <file_name>xyz corp.jpg</file_name>
        </data_file>
[download]

Puzzle area :

I am not getting how to device the parseCSVLine and readCSV subroutines in this regard to get values from second row and third row and onwards. This was relatively easy to read each row one at time, however when there is more than 1 row present then ideally i need to create a decrementing loop =number of rows and read each row in separate hash table. However i am kind of not making through it yet to read 2nd row onwards. Ideally readCSV need to do the following:

<1> count the number of rows. <2> if more than 1 --first(primary; only this row has filename of pic ) row read first row to a temphash . <3> read rest of rows (secondary; 2nd row onward only have addnl location specif info only) to a different temphashes. <4> Once i get each row in separate temphash i would be fine getting values from them as needed. <5> there could be as many rows . Your suggestion is deeply appreciated.(code posted here is tested)

Comment on Multiline CSV and XML Select or Download Code

Replies are listed 'Best First'.
Re: Multiline CSV and XML by dHarry (Abbot) on Sep 29, 2010 at 12:41 UTC
I have difficulty understanding what exactly you do in your code. You seem to over-complicate things. There are several modules available for parsing CSV files, no need to reinvent the weel. It would probably reduce your code by 30% or so. I would not worry too much about footprint, you control the opening/closing of files and wether you slurp them or not. Also for generating the XML files you can use a module. You could for example start by taking a look at XML::Simple which seems enough for what you need to do.	[reply]
Re^2: Multiline CSV and XML by sanju7 (Acolyte) on Sep 29, 2010 at 14:50 UTC
Hi dHarry, Its not about reinventing the wheel, the system hosting the code is a hybrid one and has an embedded perl (dds perl) running on them. That simplest of task like installing perl modules is not as straight forward and not necessarily all modules supported. It breaks the code and in my previous attempts had been failure with text::csv modules .	[reply]
Re^3: Multiline CSV and XML by sanju7 (Acolyte) on Sep 30, 2010 at 22:38 UTC
I am still to figure the puzzle. Is there a better way to construct the code even with using modules. I will try to install them if its possible. Can anyone tell me how can i access multiple lines as hash of hashes against 1st row fields(column heading) as key from a multiline csv	[reply]
Re^3: Multiline CSV and XML by rowdog (Curate) on Oct 01, 2010 at 06:46 UTC
I can see where you might have trouble building Text::CSV_XS but you should be able to use Text::CSV_PP. I mean no disrespect and I realize that English is (probably) not your first language but I'm having a hard time trying to figure out what the problem is here. You said ... I am not getting how to device the parseCSVLine and readCSV modules I don't know what you mean when you say you want to "device" those "modules" (which are actually subroutines). If you can use Text::CSV_PP will you still have the same problem? If not, can you ask the question again in a different way? I don't think your language is Spanish but, For What It's Worth, puedes preguntarme en privado en Espanol. (My apologies to any actual Spanish speakers with real keyboards.)	[reply]
Re^4: Multiline CSV and XML by sanju7 (Acolyte) on Oct 01, 2010 at 19:38 UTC
Re^5: Multiline CSV and XML by rowdog (Curate) on Oct 02, 2010 at 01:31 UTC


P is for Practical
	PerlMonks