Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

LibXml - Convert XML to CSV from API Feed

by Devon (Acolyte)
on Apr 09, 2013 at 02:24 UTC ( #1027624=perlquestion: print w/replies, xml ) Need Help??
Devon has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys,

I'm pulling XML data from a REST API feed and I need to convert said data to CSV for entry in a database. The API is for web analytics data which spits out slightly different XML files depending upon the report chosen. I want my conversion script to be able to handle slight variations in the XML schema.

First, here's the current state of my code. It works for reports which have only 2 dimensions. If a report has more dimensions, additional 'DataRow' nodes are added as children. As the Xpaths change, my script breaks, printing only the first 2 dimensions and no data. I'm a using a loop per level, so I'm hoping there's a cleaner way to go about it than having 5+ nested loops.
perl <<'EOF' use warnings; use strict; use XML::LibXML; my($data) = './work/data.xml'; my($file) = './out/data.csv'; open(my $out, '>', $file) or die "Could not open file '$file' $!"; my($parser) = XML::LibXML->new(); my $xmldata = $parser->parse_file($data); my @childarray; #The first 2 loop blocks print column headings #Dimensions ########################### #Fixed Date Heading ########################### push( @childarray, qq("WTDate")); for my $node ( $xmldata->findnodes( '/DimensionalReport/ReportDefiniti +on/list/Dimension' )) { for my $dimension ($node->findnodes('*[2]')) { my $child = $dimension->to_literal; if ($dimension->nodeName eq "string") { $child = qq("$child"); } push( @childarray, $child); } } #Measures for my $node ( $xmldata->findnodes( '/DimensionalReport/ReportDefiniti +on/list/Measure' )) { for my $measure ($node->findnodes('*[1]')) { my $child = $measure->to_literal; if ($measure->nodeName eq "string") { $child = qq("$child"); } push( @childarray, $child); } } print $out join(",", @childarray), "\n"; undef @childarray; #Print Data #Records for my $node ( $xmldata->findnodes( '/DimensionalReport/list/DataRow' +)) { for my $lvldim ($node->findnodes('./list/list/DataRow')) { my $dim1 = $node->getAttribute("name"); if ($node->nodeName eq "string" or $node->nodeName eq "list") +{ $dim1 = qq("$dim1"); } push( @childarray, $dim1); my $child = $lvldim->getAttribute("name"); if ($lvldim->nodeName eq "string" or $lvldim->nodeName eq "lis +t") { $child = qq("$child"); } push( @childarray, $child); for my $lvlmeas ($lvldim->findnodes('./list/*')) { my $child = $lvlmeas->to_literal; if ($lvlmeas->nodeName eq "string" or $lvlmeas->nodeName e +q "list") { $child = qq("$child"); } push( @childarray, $child); } print $out join(",", @childarray), "\n"; undef @childarray; } } close $out; EOF }
Now I will provide two data examples. Here is a simple one:

Expected output should be something like this:
"Time Period","Active Visits","Page Views","Clickthroughs","Daily Visi +tors","Weekly Visitors","Monthly Visitors","Quarterly Visitors","Year +ly Visitors","Single Page View Visits","Entry Page Visits","Bounce Ra +te" 1/1/2013,9609.00,36456.00,604.00,9265.00,8948.00,9265.00,9265.00,9265. +00,4444.00,9608.00,46.25 1/2/2013,22088.00,76762.00,10291.00,21460.00,21088.00,21263.00,21263.0 +0,21263.00,12022.00,22054.00,54.51

And, something with a few more dimensions:

And the goal output:
"WTDate","DMA","Most Recent Campaign Demand Channel","Most Recent Camp +aign Partner","Most Recent Campaign Marketing Program","Most Recent C +ampaign Marketing Activity","Most Recent Campaign Description","Most +Recent Campaign ID","Visits","Page Views","Clickthroughs","Orders","R +evenue","Average Revenue per Order","Units","Average Units per Order" +,"Average Visit Duration (Minutes)","Average Visit Page Views","Hits" +,"Daily Campaign Visitors","Weekly Campaign Visitors","Monthly Campai +gn Visitors","Quarterly Campaign Visitors","Yearly Campaign Visitors" +,"New Campaign Visitors" 2013-01-01,"500 (Portland-Auburn:ME-NH)","online","null","null","null" +,"null","ban_384938",384,4859,38,3,4958,50.00,4,2,2.3,3.1,3844,200,40 +0,500,593,2,203 2013-01-01,"500 (Portland-Auburn:ME-NH)","online","null","null","null" +,"null","ban_384950",390,4859,38,3,4958,50.00,4,2,2.3,3.1,3844,200,40 +0,500,593,2,203 2013-01-01,"501 (New York:CT-NJ-NY-PA)","online","null","null","null", +"null","ban_384950",35,200,38,3,4958,50.00,4,2,2.3,3.1,3844,200,400,5 +00,593,2,203

I apologize for my sloppy code; this is my first perl script. Any feedback to push me in the right direction would be much appreciated.


Replies are listed 'Best First'.
Re: LibXml - Convert XML to CSV from API Feed
by Anonymous Monk on Apr 09, 2013 at 03:49 UTC
Re: LibXml - Convert XML to CSV from API Feed
by Anonymous Monk on Apr 10, 2013 at 11:03 UTC
      I haven't had a chance to work on this in a while, but I just wanted to say I appreciate the responses.

      I should have time to flesh it out this week and will post back with the final product.

      Thanks again

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1027624]
Approved by marto
erix wonders what this question is really about: Best way to store/access large dataset?
[hippo]: Intriguing, isn't it?
[erix]: one would need the actuall query / queries
[erix]: he has obfuscated it effectively

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2018-06-22 14:40 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (124 votes). Check out past polls.