Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Convert XML To Perl Data Structures Using XML::Twig

by mirod (Canon)
on May 25, 2011 at 07:19 UTC ( [id://906607]=note: print w/replies, xml ) Need Help??


in reply to Convert XML To Perl Data Structures Using XML::Twig

First you can have a look at the simplify, which is compatible with XML::Simple's XMLin, that might help.

Then if simplify is not what you are looking for, the "official" way to pass parameters to a handler is to use a closure. I am not sure why you deem this to be "not very elegant". It is a widely used technique, described for example in Achieving Closure (nearly 9 years ago!).

The code would look like this:

my $data; # a ref to the data structure XML::Twig->new( twig_handlers => { foo => sub { foo( @_, $data); }) ->parsefile( 'my.xml'); sub foo { my( $twig, $foo, $data)= @_; # update the data referenced by $data }

It is even a FAQ: I want to pass additional arguments to XML::Twig handlers, not just the twig and the element, and I'd rather not use global variables. Can I do this?. I will add a paragraph about it in the main docs of the module though.

Replies are listed 'Best First'.
Re^2: Convert XML To Perl Data Structures Using XML::Twig
by Limbic~Region (Chancellor) on May 25, 2011 at 13:34 UTC
    mirod,
    I love closures. See How A Function Becomes Higher Order, and Understanding And Using Iterators for examples ;-)

    I believe I did a poor job of explaining my goal and my hangup. I am processing a log with millions of XML messages. Each message must be converted to a distinct perl data structure. While I can see several ways of accomplishing this, none of them seem to let me have my cake and eat it too.

    To use a closure in the way you describe, I would need a factory to create a brand new closure for each message and either instantiate a new instance of XML::Twig for each message or call $twig->setTwigHandlers() in between each call to $twig->parse(). The alternative would be to leave the XML::Twig object alone and perform a deep copy and "reset" of reference that was closed over in between each message.

    My comprimise - which I am fine with, was to write my own dispatch table where I could do something akin to:

    # ... my $twig = XML::Twig->new(); while (<$fh>) { chomp; my $msg = {}; $twig->parse($_); for my $child ($twig->root->children) { my $handler = $child->tag; if ($dispatch{$handler}) { $dispatch{$handler}->($child, $msg); } else { die "Haven't written handler for '$handler' yet"; } } # do something with $msg }

    I asked for advice here to make sure I wasn't missing anything obvious. I will certainly check out simplify.

    Cheers - L~R

      Show me a few example messages and the desired datastructure and let's see how it goes with XML::Rules ... and whether you like the resulting code. This really looks like a perfect task for that module.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

        mirod,
        I can't share the actual data (work) but I think the following might make things a little more clear. If not, then I will live happily with the solution that I am currently constructing.

        Mock up of the log file that I am working with:

        2011-04-28 13:25:47 INFO [main:114] <Message><Tag attribute="value">An +swer</Tag></Message> 2011-04-28 13:45:12 DEBUG [Populate::List:31] <Message><Tag attribute= +"value">Answer</Tag></Message>

        In other words, a Log4J standard log where the log entry is an XML document. I am parsing the log similar to the code below:

        while (<$fh>) { chomp; my ($date, $time, $log_lvl, $trace, $xml) = split ' ', $_, 5; }

        For each XML document, I need to convert it to a perl data structure and do something with it. That would look something like:

        my $twig = XML::Twig->new(); while (<$fh>) { chomp; my ($date, $time, $log_lvl, $trace, $xml) = split ' ', $_, 5; my %data_structure; $twig->parse($xml); # Build up %data_structure using $twig }

        I could easily change this code to be "elegant" as such:

        while (<$fh>) { chomp; my ($date, $time, $log_lvl, $trace, $xml) = split ' ', $_, 5; my $data_structure = extract_data($xml); } sub extract_data { my ($xml) = @_; my $data = {}; my $twig = XML::Twig->new( twig_handlers => { Message => sub { handle_message(@_, $data) } } ); $twig->parse($xml); return $data; } sub handle_message { # ... }

        There is absolutely nothing wrong with this and I haven't profiled it to see that it isn't fast enough but that is my concern. I would like to inline as much as possible. So now that I have laid it out there I realize if it were someone else asking this question I would tell them to quit being falsely lazy, write it in a clear maintainable way and profile it and only worry about performance if it was unacceptable.

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://906607]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2024-04-23 13:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found