Hi Guys,
I'm pulling XML data from a REST API feed and I need to convert said data to CSV for entry in a database. The API is for web analytics data which spits out slightly different XML files depending upon the report chosen. I want my conversion script to be able to handle slight variations in the XML schema.
First, here's the current state of my code. It works for reports which have only 2 dimensions. If a report has more dimensions, additional 'DataRow' nodes are added as children. As the Xpaths change, my script breaks, printing only the first 2 dimensions and no data. I'm a using a loop per level, so I'm hoping there's a cleaner way to go about it than having 5+ nested loops.
perl <<'EOF'
use warnings;
use strict;
use XML::LibXML;
my($data) = './work/data.xml';
my($file) = './out/data.csv';
open(my $out, '>', $file) or die "Could not open file '$file' $!";
my($parser) = XML::LibXML->new();
my $xmldata = $parser->parse_file($data);
my @childarray;
#The first 2 loop blocks print column headings
#Dimensions
###########################
#Fixed Date Heading
###########################
push( @childarray, qq("WTDate"));
for my $node ( $xmldata->findnodes( '/DimensionalReport/ReportDefiniti
+on/list/Dimension' )) {
for my $dimension ($node->findnodes('*[2]')) {
my $child = $dimension->to_literal;
if ($dimension->nodeName eq "string") {
$child = qq("$child");
}
push( @childarray, $child);
}
}
#Measures
for my $node ( $xmldata->findnodes( '/DimensionalReport/ReportDefiniti
+on/list/Measure' )) {
for my $measure ($node->findnodes('*[1]')) {
my $child = $measure->to_literal;
if ($measure->nodeName eq "string") {
$child = qq("$child");
}
push( @childarray, $child);
}
}
print $out join(",", @childarray), "\n";
undef @childarray;
#Print Data
#Records
for my $node ( $xmldata->findnodes( '/DimensionalReport/list/DataRow'
+)) {
for my $lvldim ($node->findnodes('./list/list/DataRow')) {
my $dim1 = $node->getAttribute("name");
if ($node->nodeName eq "string" or $node->nodeName eq "list")
+{
$dim1 = qq("$dim1");
}
push( @childarray, $dim1);
my $child = $lvldim->getAttribute("name");
if ($lvldim->nodeName eq "string" or $lvldim->nodeName eq "lis
+t") {
$child = qq("$child");
}
push( @childarray, $child);
for my $lvlmeas ($lvldim->findnodes('./list/*')) {
my $child = $lvlmeas->to_literal;
if ($lvlmeas->nodeName eq "string" or $lvlmeas->nodeName e
+q "list") {
$child = qq("$child");
}
push( @childarray, $child);
}
print $out join(",", @childarray), "\n";
undef @childarray;
}
}
close $out;
EOF
}
Now I will provide two data examples. Here is a simple one:
https://ezcrypt.it/jk6n#4qRm2gc3F7f0RnOMqL5bPaYl
Expected output should be something like this:
"Time Period","Active Visits","Page Views","Clickthroughs","Daily Visi
+tors","Weekly Visitors","Monthly Visitors","Quarterly Visitors","Year
+ly Visitors","Single Page View Visits","Entry Page Visits","Bounce Ra
+te"
1/1/2013,9609.00,36456.00,604.00,9265.00,8948.00,9265.00,9265.00,9265.
+00,4444.00,9608.00,46.25
1/2/2013,22088.00,76762.00,10291.00,21460.00,21088.00,21263.00,21263.0
+0,21263.00,12022.00,22054.00,54.51
And, something with a few more dimensions:
https://ezcrypt.it/kk6n#eVINakVb1teCRmxKvdbtirFi
And the goal output:
"WTDate","DMA","Most Recent Campaign Demand Channel","Most Recent Camp
+aign Partner","Most Recent Campaign Marketing Program","Most Recent C
+ampaign Marketing Activity","Most Recent Campaign Description","Most
+Recent Campaign ID","Visits","Page Views","Clickthroughs","Orders","R
+evenue","Average Revenue per Order","Units","Average Units per Order"
+,"Average Visit Duration (Minutes)","Average Visit Page Views","Hits"
+,"Daily Campaign Visitors","Weekly Campaign Visitors","Monthly Campai
+gn Visitors","Quarterly Campaign Visitors","Yearly Campaign Visitors"
+,"New Campaign Visitors"
2013-01-01,"500 (Portland-Auburn:ME-NH)","online","null","null","null"
+,"null","ban_384938",384,4859,38,3,4958,50.00,4,2,2.3,3.1,3844,200,40
+0,500,593,2,203
2013-01-01,"500 (Portland-Auburn:ME-NH)","online","null","null","null"
+,"null","ban_384950",390,4859,38,3,4958,50.00,4,2,2.3,3.1,3844,200,40
+0,500,593,2,203
2013-01-01,"501 (New York:CT-NJ-NY-PA)","online","null","null","null",
+"null","ban_384950",35,200,38,3,4958,50.00,4,2,2.3,3.1,3844,200,400,5
+00,593,2,203
I apologize for my sloppy code; this is my first perl script. Any feedback to push me in the right direction would be much appreciated.
Thanks