Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: XML Parsing from URL

by Anonymous Monk
on Jun 26, 2015 at 21:51 UTC ( [id://1132235]=note: print w/replies, xml ) Need Help??


in reply to SOLVED: XML Parsing from URL

That's an interesting question; unfortunately I don't have enough time to test it right now but I think that XML::Twig might be able to help you, since it processes documents piece by piece, and it's supposed to be able to read from an IO::Handle object. I just don't (yet) know of an HTTP client that provides one...

Replies are listed 'Best First'.
Re^2: XML Parsing from URL
by jshank (Initiate) on Jul 03, 2015 at 04:30 UTC

    I went with XML::Twig and got a little further (thanks!) unfortunately it's still upset that the XML isn't quite "well-formed" XML example:

    --boundary Content-Type: application/xml; charset="UTF-8" Content-Length: 478 <EventNotificationAlert version="1.0" xmlns="http://www.hikvision.com/ +ver10/XMLSchema"> <ipAddress>10.1.10.23</ipAddress> <portNo>80</portNo> <protocol>HTTP</protocol> <macAddress>c4:2f:90:00:00:00</macAddress> <channelID>1</channelID> <dateTime>2015-06-24T19:37:22--8:00</dateTime> <activePostCount>0</activePostCount> <eventType>videoloss</eventType> <eventState>inactive</eventState> <eventDescription>videoloss alarm</eventDescription> </EventNotificationAlert> --boundary Content-Type: application/xml; charset="UTF-8" Content-Length: 514 <EventNotificationAlert version="1.0" xmlns="http://www.hikvision.com/ +ver10/XMLSchema"> <ipAddress>10.1.10.23</ipAddress> <portNo>80</portNo> <protocol>HTTP</protocol> <macAddress>c4:2f:90:00:00:00</macAddress> <channelID>1</channelID> <dateTime>2015-06-24T19:37:22--8:00</dateTime> <activePostCount>1</activePostCount> <eventType>VMD</eventType> <eventState>active</eventState> <eventDescription>Motion alarm</eventDescription> <DetectionRegionList> </DetectionRegionList> </EventNotificationAlert>

      This "boundary" stuff and the two "Content"-lines look like HTTP multipart POST data (RFC2388) to me. On the other hand, HTTP POST data should also have a "Content-Disposition" header with a name attribute after each boundary.

      Is this real data or shortened? Where does the data come from?

      In a HTTP context, I would expect some library to parse the HTTP data and provide them in a more accessible form. For example, using the classic CGI module, each XML document would be available by its parameter name using the param() or upload() methods.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Thanks Alexander. This is real data coming from a Hikvision IP camera. There is a goo.gl shortlink in the original code if you want to look at the documentation for the interface. LWP was able to properly handle the multipart and then I parse out the XML portion.

      Is a very interesting problem but difficult to experiment with.. anyway you can try to use twig_roots or you can try to preprocess your input.
      In fact I see a declared lenght in the header: will be possible to read only what is declared in Content-Length and pass this chunk to XML::Twig to be processed.

      Maybe you can elaborate a specific XML::Twig question as new SOPW, the author of the module lurks here sometimes..
      L*

      UPDATE: you can also read this interesting article

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
        maybe someting ugly as (modified the data to get rid of the --boundary thing):
        #!/usr/bin/perl use strict; use warnings; use XML::Twig; $|++; open my $fh,'<','xmlstream.xml' or die; while (<DATA>) { chomp; if ($_ =~/Content-Length: (\d+)/){ my $len = $1; my $xml; my $read = read (DATA,$xml,$len,0); print "read: tried $len got $read: [$xml]\n"; my $t= XML::Twig->new( pretty_print => 'indented', twig_handlers => { 'ipAddress'=>sub{print "\t\tIP ADDRESS:\t",$ +_[1]->text,"\n"; } } ); $t->parse($xml); } } __DATA__ Content-Type: application/xml; charset="UTF-8" Content-Length: 478 <EventNotificationAlert version="1.0" xmlns="http://www.hikvision.com/ +ver10/XMLSchema"> <ipAddress>10.1.10.23</ipAddress> <portNo>80</portNo> <protocol>HTTP</protocol> <macAddress>c4:2f:90:00:00:00</macAddress> <channelID>1</channelID> <dateTime>2015-06-24T19:37:22--8:00</dateTime> <activePostCount>0</activePostCount> <eventType>videoloss</eventType> <eventState>inactive</eventState> <eventDescription>videoloss alarm</eventDescription> </EventNotificationAlert> Content-Type: application/xml; charset="UTF-8" Content-Length: 514 <EventNotificationAlert version="1.0" xmlns="http://www.hikvision.com/ +ver10/XMLSchema"> <ipAddress>10.1.10.23</ipAddress> <portNo>80</portNo> <protocol>HTTP</protocol> <macAddress>c4:2f:90:00:00:00</macAddress> <channelID>1</channelID> <dateTime>2015-06-24T19:37:22--8:00</dateTime> <activePostCount>1</activePostCount> <eventType>VMD</eventType> <eventState>active</eventState> <eventDescription>Motion alarm</eventDescription> <DetectionRegionList> </DetectionRegionList> </EventNotificationAlert>


        HtH
        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1132235]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-04-24 04:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found