http://www.perlmonks.org?node_id=46517

redbeard has asked for the wisdom of the Perl Monks concerning the following question: (regular expressions)

I have an XML document with multiple repetitive fields all in a single string, as obtained from a web service using LWP::Simple::get.

I would like to parse out those multiple repetitive values (e.g. <email> and <name>) and put them in their own arrays (e.g. @email and @name).

Example document content:

$xml = <<EOF; <xml> <email>toto@foo.com</email><name>Toto</name> <email>tata@bar.com</email><name>Tata</name> <email>tutu@baz.com</email><name>Tutu</name> </xml> EOF

Originally posted as a Categorized Question.

Replies are listed 'Best First'.
Re: Strip that XML!
by mirod (Canon) on Dec 14, 2000 at 04:46 UTC

    Using XML::Twig:

    use XML::Twig; my $xml = <<EOF; <xml> <email>toto@foo.com</email><name>Toto</name> <email>tata@bar.com</email><name>Tata</name> <email>tutu@baz.com</email><name>Tutu</name> </xml> EOF my @email; my @name; my $twig = new XML::Twig( TwigHandlers => { email => sub { push @email, $_[1]->text; }, # $_[1] is the element + name => sub { push @name , $_[1]->text; } } ); $twig->parse( $xml ); print "email: @email\n"; print "name: @name\n";
Re: Strip that XML!
by mirod (Canon) on Dec 14, 2000 at 12:00 UTC

    Using XML::Parser (basic mode, no style):

    #!/bin/perl -w use strict; use XML::Parser; my @email; my @name; my $stored_content = ''; # global used to store text sub start { my( $expat, $gi, %atts ) = @_; $stored_content=''; # reset } sub char { my( $expat, $string ) = @_; $stored_content .= $string; # accumulate } sub end { my( $expat, $gi ) = @_; # now we can do some "real" processing with the element content push @name, $stored_content if( $gi eq 'name'); push @email, $stored_content if( $gi eq 'email'); $stored_content = ''; # reset } # create the parser my $parser = new XML::Parser( Handlers => { Start => \&start, # called for each start tag Char => \&char, # called for all text (including \n between tags +) End => \&end # called for each end tag } ); my $xml = <<EOF; <xml> <email>toto@foo.com</email><name>Toto</name> <email>tata@bar.com</email><name>Tata</name> <email>tutu@baz.com</email><name>Tutu</name> </xml> EOF $parser->parse( $xml ); print "email: @email\n"; print "name: @name\n";
Re: Strip that XML!
by myocom (Deacon) on Dec 14, 2000 at 03:55 UTC

    I'd look to something like XML::Parser with a quickness. No sense in reinventing the wheel.

    Originally posted as a Categorized Answer.

Re: Strip that XML!
by davorg (Chancellor) on Dec 14, 2000 at 13:57 UTC

    If you're parsing XML, then you should really be using XML::Parser (or one of its subsclasses to do it. Any regex-based solution is bound to fail at some point.

    Originally posted as a Categorized Answer.

Re: Strip that XML!
by Ovid (Cardinal) on Dec 14, 2000 at 03:53 UTC
    Without knowing the structure of the XML document, it would be tough to answer the question. However, you may wish to learn about Parse::RecDescent for working with complex data. There is also a humorous tutorial for it.

    You can also check out Email::Find for getting the e-mail. It's not perfect, but the RFC822 specification for e-mail addresses is so broad that it's tough to match accurately.

    Originally posted as a Categorized Answer.

Re: Strip that XML!
by ruzam (Curate) on Feb 20, 2009 at 21:11 UTC

    Not an answer, but in all the answer examples given the XML source is not proper XML (at least for this purpose). Sure there are 'email' tags and there are 'name' tags but there's no 'email/name' container gluing the two together. At best you can only guarantee that the sample can be parsed into an array of emails and an array of names, but making assumptions that there is an association between the email and name is eventually going to bite you.

    Here's a 'functionally' equivalent source sample:
    my $_='<xml> <email>toto@foo.com</email> <email>tata@bar.com</email> <email>tutu@baz.com</email> <name>Toto</name> <name>Tata</name> <name>Tutu</name> </xml>';
    And so is this:
    my $_='<xml> <email>tata@bar.com</email> <email>toto@foo.com</email> <name>Tutu</name> <email>tutu@baz.com</email> <name>Toto</name> <name>Tata</name> </xml>';

    The examples above are both valid representations of the original source sample and I'm pretty sure XML parsers are within their design rights to re-organize the data as they see fit.

    A better source for example answer code would be this:
    my $_='<xml> <user><email>toto@foo.com</email><name>Toto</name></user> <user><email>tata@bar.com</email><name>Tata</name></user> <user><email>tutu@baz.com</email><name>Tutu</name></user> </xml>';

    Originally posted as a Categorized Answer.

Re: Strip that XML!
by ramrod (Curate) on Feb 20, 2009 at 20:38 UTC
    Here's another easy method:
    $_ = <<EOF; <xml> <email>toto@foo.com</email><name>Toto</name> <email>tata@bar.com</email><name>Tata</name> <email>tutu@baz.com</email><name>Tutu</name> </xml> EOF my @email = /<email>(.*?)<\/email>/g; my @name = /<name>(.*?)<\/name>/g;
    (Note that it leaves XML entities undecoded, as noted in the reply.)
      That doesn't decode entities (its broken).
Re: Strip that XML!
by perl_addict (Initiate) on Apr 07, 2011 at 06:27 UTC

    Just a small correction to above perl snipplet. As far as my understanding we cannot decalre $_ as "my" since its a special global variable. This code will throw a compilation error in this case. We need to declare $_ as local to work it in proper way.

    I would rather suggest using following code:-

    use strict; my $str_xml='<xml> <email>toto@foo.com</email><name>Toto</name> <email>tata@bar.com</email><name>Tata</name> <email>tutu@baz.com</email><name>Tutu</name> </xml>'; my @email = ($str_xml =~ /<email>(.*?)<\/email>/g); my @name = ($str_xml =~ /<name>(.*?)<\/name>/g); print "email: @email\n"; print "name: @name\n";
      FWIW, with a recent enough version of perl,you can have my $_