I am working on a event card case. All data related to these events are stored in CSVs identified by event's name as i "filename.csv" in a directory.The detail include the event's description, detail of place etc and filename of a corresponding pic. The script to read the directory containing these CSVs read individual values there against each columns, on certain case(when same event is happening at different location. eg suppose summer picnic happening at two location) when there are multiple rows read them too.
Depending on the value/s at particular columns create a xml file that represents the event, puts the xml in a separate directory named as the value in csv column indicating pic file.
The idea is to parse the csv files on the go and keep processing csvs and once done delete csvs, so very minimum footprint(no need to open all files in memory) allowing processing 1000s of csvs. I am not aware any external module can help this way so i used bare minimum external modules. At the end of parsing all csvs i will be having all event files created in individual directories.
Here is my code:
#!/usr/bin/perl -w
use strict;
use Getopt::Std;
use URI::Escape;
if ($#ARGV==-1) {
displayUsage();
exit -1;
}
getopts('ndc:o:s:');
my $csvDir;
if ($::opt_c) {
if (-d $::opt_c) {
$csvDir = $::opt_c;
} else {
print STDERR "FATAL: Specified CSV directory nof found: ".$::o
+pt_c."\n";
displayUsage();
exit -1;
}
} else {
print STDERR "FATAL: A CSV directory needed. (-c)\n";
displayUsage();
exit -1;
}
my $outputDir;
if ($::opt_o) {
if (-d $::opt_o) {
$outputDir = $::opt_o;
} else {
print STDERR "FATAL: Specified output not found: ".$::opt_o."\
+n";
displayUsage();
exit -1;
}
} else {
print STDERR "FATAL: An output directory needed. (-o)\n";
displayUsage();
exit -1;
}
##
my $addSuffix;
$addSuffix = $::opt_s if ($::opt_s);
my $suppressFileTag;
$suppressFileTag++ if ($::opt_n);
my $deleteCSVs;
$deleteCSVs++ if ($::opt_d);
# Sanitize slashes
if ($outputDir !~ /[\\\/]/) {
$outputDir .= "\\";
} else {
my $outputDirSlash = ($outputDir =~ /\\/ ? "\\" : "\/");
$outputDir .= $outputDirSlash unless($outputDir =~ /$outputDirSlas
+h$/);
}
if ($csvDir !~ /[\\\/]/) {
$csvDir .= "\\";
} else {
my $csvDirSlash = ($csvDir =~ /\\/ ? "\\" : "\/");
$csvDir .= $csvDirSlash unless($csvDir =~ /$csvDirSlash$/);
}
# Check directories
unless(-d $csvDir) {
print STDERR "FATAL: Could not find specified CSV directory: ".$cs
+vDir."\n";
exit -1;
}
unless(-d $outputDir) {
print STDERR "FATAL: Could not find specified output directory: ".
+$outputDir."\n";
exit -1;
}
# Read all CSVs
unless(opendir(CSV, $csvDir)) {
print STDERR "FATAL: Unable to open CSV directory [".$csvDir."]: "
+.$!."\n";
exit -1;
}
while(my $entry = readdir(CSV)) {
next if ($entry =~ /^\.\.*$/);
print STDERR "INFO: Processing CSV: ".$entry."\n";
my %metaData;
readCSV($csvDir.$entry, \%metaData);
if ($::opt_d) {
unless(unlink($csvDir.$entry)) {
print STDERR "ERROR: Unable to remove CSV file [".$csvDir.
+$entry."]: ".$!."\n";
}
}
# Create metadata directories
foreach my $metaEntry (sort keys %metaData) {
my $entryDir = $outputDir.$metaEntry.".card";
if (!-d $entryDir) {
unless(mkdir($entryDir)) {
print STDERR "FATAL: Unable to create directory [".$en
+tryDir."]: ".$!."\n";
exit -1;
}
}
}
##
# Generate XML
foreach my $metaEntry (sort keys %metaData) {
my $eventDataFile = $outputDir.$metaEntry.".card".($outputDir
+=~ /\\/ ? "\\" : "\/")."eventDetail.xml";
my %fileHash = %{$metaData{$metaEntry}};
my $XML = generateXML(\%fileHash, $addSuffix, $suppressFileTag
+);
print STDERR "INFO: Writing [".$eventDataFile."]...\n";
unless(open(OUT, ">", $eventDataFile)) {
print STDERR "FATAL: Unable to open file [".$eventDataFile
+."] for write: ".$!."\n";
exit -1;
}
print OUT $XML."\n";
close(OUT);
print STDERR "INFO: Done!\n";
}
print STDERR "INFO: Done Processing CSV: ".$entry."\n";
}
closedir(CSV);
exit 0;
sub readCSV {
return undef unless($#_==1);
my $mKfile = shift(@_);
my $mKHHRef = shift(@_);
unless(open(CSV, "<", $mKfile)) {
print STDERR "FATAL: Unable to open CSV file [".$mKfile."]: ".
+$!."\n";
exit -1;
}
my $mKlineCnt = 0;
# Fields Setup
my @mKHDFields = parseCSVLine(scalar(<CSV>));
my %mKHH;
for(my $i=0;$i!=($#mKHDFields+1);$i++) {
# Adjust Fields to Lower Case
$mKHDFields[$i] =~ tr/A-Z/a-z/;
# Clean out tags like <au> <gb>, etc.
$mKHDFields[$i] =~ s/\<([^>]+)\>//g;
# Clean leading and trailing spaces
$mKHDFields[$i] =~ s/^\s+//g;
$mKHDFields[$i] =~ s/\s+$//g;
}
# Load Data
while(my $mKline = <CSV>) {
$mKline =~ s/(\n|\r)//g;
$mKlineCnt++;
my $advance=5;
if (((() = $mKline =~ /\"/g) % 2)!=0) {
while(($advance)&&(((() = $mKline =~ /\"/g) % 2)!=0)) {
my $temp = <CSV>;
$temp =~ s/(\n|\r)//g;
$advance--;
$mKline .= $temp;
}
}
my @mKrowFields = parseCSVLine($mKline);
my %tempHash;
my $mKfieldCnt = 0;
foreach my $mKrowField (@mKrowFields) {
my $key = $mKHDFields[$mKfieldCnt]; $mKfieldCnt++;
next if ($mKrowField =~ /^(\s+)*None(\s+)*$/i); # Patch to
+ avoid 'None' values
$tempHash{$key} = $mKrowField;
}
if ((!exists($tempHash{'filename'}))&&
(exists($tempHash{'file name'}))) {
$tempHash{'filename'} = $tempHash{'file name'};
}
if ($tempHash{'filename'}) {
${$mKHHRef}{${tempHash{'filename'}}} = \%tempHash;
}
}
close(CSV);
return 1;
}
sub parseCSVLine {
return undef unless($#_>=0);
my $zzzline = shift(@_);
my $zzzfieldCnt = shift(@_);
# Must have an even number of quotes
return undef if (((() = $zzzline =~ /\"/g) % 2)!=0);
my @zzzarray = split(/\,/,$zzzline);
my @zzzreturn;
while($#zzzarray!=-1) {
my $zzzentry = shift(@zzzarray);
if ($zzzentry =~ /\"/) {
while (((() = $zzzentry =~ /\"/g) % 2)!=0) {
$zzzentry .= ",".shift(@zzzarray);
}
}
# Strip leading quotes
$zzzentry =~ s/^\"//g;
# Strip trailing quotes
$zzzentry =~ s/\"$//g;
# Strip leading spaces
$zzzentry =~ s/^\s+//g;
# Strip trailing spaces
$zzzentry =~ s/\s+$//g;
# Convert double quotes to single.
$zzzentry =~ s/\"\"/\"/g;
push(@zzzreturn,$zzzentry);
}
return (@zzzreturn) if ($#zzzreturn!=-1);
}
## trim down the sub xml etc
sub generateXML {
return undef unless(ref($_[0]) eq 'HASH');
my $gXHHRef = shift(@_);
my $gXfileSuffix; # optional
if ($#_>=0) {
my $suffix = shift(@_);
$gXfileSuffix = ".".$suffix if ($suffix);
}
my $gXsuppressFileTag;
if ($#_>=0) {
$gXsuppressFileTag = shift(@_);
}
my $gXpthe = ${$gXHHRef}{'picnic theme'};
my $gXorgname = ${$gXHHRef}{'organizer name'};
my $gXvaddr = ${$gXHHRef}{'venue address'};
my $gXeLoc = ${$gXHHRef}{'event location'};
my $gXpFull = html_escape(${$gXHHRef}{'event detail'});
my $gXtot = html_escape(${$gXHHRef}{'total allowed heads'});
my $gXgpar = html_escape(${$gXHHRef}{'maximum game participants'})
+;
my $gXecont = ${$gXHHRef}{'star attraction'};
my $gXpdate = reformatDate(${$gXHHRef}{'picnic date'});
my $gXtext = <<"GXEOF";
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://greateventbulatine/event/organizer" xmlns:xsi="
+http://www.w3.org/2001/XMLSchema-instance">
<theme>$gXpthe</theme>
<video>
<xml line1>Event Detail</xml line1>
<xml line2>$gXpthe</xml line2>
<xml line3>$gXorgname</xml line3>
<xml line4>$gXvaddr</xml line4>
<xml line5>$gXpFull</xml line5>
<xml line6>$gXtot</xml line6>
<xml line7>$gXgpar</xml line7>
<xml line8>$gXecont</xml line8>
<xml line9>$gXpdate</xml line9>
<xml line10>$gXeLoc</xml line10>
GXEOF
unless($gXsuppressFileTag) {
$gXtext .= <<"GXEOF";
<data_file>
<file_name>$gXorgname$gXfileSuffix</file_name>
</data_file>
GXEOF
}
return $gXtext;
}
sub reformatDate {
my $rDdate = shift(@_);
if ($rDdate =~ /([^\/+]+)\/([^\/+]+)\/(\d\d\d\d)/) {
my ($rDmon,$rDday,$rDyear) = ($1,$2,$3);
$rDdate = sprintf('%-04.04d-%-02.02d-%-02.02d', $rDyear, $rDmo
+n, $rDday);
}
return $rDdate;
}
sub html_escape {
my $value = shift(@_);
$value =~ s/\&/\&\;/g;
$value =~ s/([\x00-\x1f\x21-\x25\x27-\x2b\x2f\x3a\x3c-\x40\x5b-\x5e
+\x60\x7b-\xff])/"&#".uc(sprintf "%lu" , unpack("C", $1)).";"/eg;
# Checking ..
print "variable [value] has = $value \n" ;
return $value;
}
sub displayUsage {
print STDERR <<"DU_EOF";
Usage: $0 [-d] [-c csv dir] [-o output dir]
-d Delete CSV
-c CSV dir
-o Output (usually current dir)
DU_EOF
}
The above code successfully parces the csv and create an xml that represent the event inside a separare directory called "eventname.card". Below is the input data and output data.
Input file name "Info_venuename_Eventdata.csv". file content below (Cat "Info_venuename_Eventdata.csv")
PICNIC THEME,ORGANIZER NAME,TYPE,VENUE ADDRESS,EVENT LOCATION,EVENT DE
+TAIL,TOTAL ALLOWED HEADS,MAXIMUM GAME PARTICIPANTS,STAR ATTRACTION,PI
+CNIC DATE,FILENAME,bbbb,cccc,dddd,eeee,gggg,ffff,hhhh,iiii,jjjj,llll,
+kkkk,mmmm,nnnn,oooo,pppp,
summer,xyz corp,web,address-w-zip,san pedro,Summer picnic with celibri
+ty. Be creating and bring your idea. There is sporting event. One mea
+l served at picnic. Bring your water and trash bag. No camera allowed
+.,100,20,Ben Stiller,11/11/2010,summer_picnic,bb,cccc,dddd,eeee,gggg,
+ffff,hhhh,iiii,jjjj,llll,kkkk,mmmm,nnnn,oooo,pppp,
,,,,west harlem,,,,Chris Rock,10/10/2010,,,,,,,,,,,,,,,,,
,,,,san redo,,,,Mr Miller,9/9/2010,,,,,,,,,,,,,,,,,
Output xml file (cat /.../eventdir/"eventDetail.xml")
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://greateventbulatine/event/organizer" xmlns:xsi="
+http://www.w3.org/2001/XMLSchema-instance">
<theme>summer</theme>
<video>
<xml line1>Event Detail</xml line1>
<xml line2>summer</xml line2>
<xml line3>xyz corp</xml line3>
<xml line4>address-w-zip</xml line4>
<xml line5>Summer picnic with celibrity. Be creating and bring
+ your idea. There is sporting event. One meal served at picnic. Bring
+ your water and trash bag. No camera allowed.</xml line5>
<xml line6>100</xml line6>
<xml line7>20</xml line7>
<xml line8>Ben Stiller</xml line8>
<xml line9>2010-11-11</xml line9>
<xml line10>san pedro</xml line10>
<data_file>
<file_name>xyz corp.jpg</file_name>
</data_file>
Puzzle area :
I am not getting how to device the parseCSVLine and readCSV subroutines in this regard to get values from second row and third row and onwards. This was relatively easy to read each row one at time, however when there is more than 1 row present then ideally i need to create a decrementing loop =number of rows and read each row in separate hash table. However i am kind of not making through it yet to read 2nd row onwards. Ideally readCSV need to do the following:
<1> count the number of rows.
<2> if more than 1 --first(primary; only this row has filename of pic ) row read first row to a temphash .
<3> read rest of rows (secondary; 2nd row onward only have addnl location specif info only) to a different temphashes.
<4> Once i get each row in separate temphash i would be fine getting values from them as needed.
<5> there could be as many rows .
Your suggestion is deeply appreciated.(code posted here is tested)