Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Parsing XML with XML::Simple

by madbombX (Hermit)
on Dec 18, 2006 at 00:41 UTC ( #590364=perlquestion: print w/ replies, xml ) Need Help??
madbombX has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have been at this for much longer than should be necessary. I am using XML::Simple, but I have also unsuccessfully tried XML::Twig. The file (with an example below) I have a file that looks like what is below. Each tag is only used once. I know it is possible to use a few regexen, but the way I need to use the data later, having the data in a hash similar to one returned by XML::Simple.

<CVS> $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ This That <this@that.com> Desc: Test file </CVS> <DATE>2006-12-10</DATE> <INTRODUCTION>Blah, <b>blah</b>, blah</INTRODUCTION> <TITLE>Foo</TITLE> <AUTHOR>Bar</AUTHOR> ... <ARTICLE> <p>foo, test</p> <p>bar</p> <p>baz</p> </ARTICLE>

When I have the above text, all I get is what's inside of <CVS>. When I add <XML> tags surrounding the whole file, all I get is the following using Data::Dumper:

$VAR1 = \{ 'CVS' => '<The entire file and this is the only tag in the + hash with no other tags even making it in here>'};

What is the best method to extract the data out of this file (either using an XML module or not) and pulling out what I need into a hash? Thanks.

Update: I forgot to include the code I actually tried. I know the $file is correct and @articles is populated.

foreach my $file (@articles) { my $article = XMLin($file, NoAttr => 1); use Data::Dumper; print "<pre>", Dumper(\$article), "</pre>"; }

Update 2: I think I need to provide a better example of the data. So look above at the data example and that gives a much better perception of what I am dealing with.

Comment on Parsing XML with XML::Simple
Select or Download Code
Re: Parsing XML with XML::Simple
by brian_d_foy (Abbot) on Dec 18, 2006 at 00:53 UTC

    This seems to work for me. You mentioned wrapping a root element around it, but did you include the <?xml ...?> bit? Even when I remove that, though, it still works. Does it fail with your sample file?

    #!/usr/bin/perl use XML::Simple; my $xml = do { local $/; <DATA> }; my $hash = XMLin( $xml ); use Data::Dumper; print Dumper( $hash ); __END__ <?xml version="1.0"?> <root> <CVS> $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ </CVS> <DATE>2006-12-10</DATE> <INTRODUCTION>Blah</INTRODUCTION> <TITLE>Foo</TITLE> <AUTHOR>Bar</AUTHOR> <ARTICLE> foo bar baz </ARTICLE> </root>

    It looks like the hash does what it should:

    ]$ perl test $VAR1 = { 'INTRODUCTION' => 'Blah', 'CVS' => ' $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ ', 'ARTICLE' => ' foo bar baz ', 'TITLE' => 'Foo', 'AUTHOR' => 'Bar', 'DATE' => '2006-12-10' };

    Update: If I move the data out to a file and use XMLin( $ARGV[0], NoAttr => 1 ) it still works.

    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
Re: Parsing XML with XML::Simple
by ferreira (Chaplain) on Dec 18, 2006 at 00:54 UTC

    The problem is that you don't have a well-formed XML file. If it were well-formed,
    there would be a root element, which is the ancestral of every other ones. Something like this:

    <root> <CVS> $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ </CVS> <DATE>2006-12-10</DATE> ... <ARTICLE> foo bar baz </ARTICLE> </root>

    I think if you do this simple correction, XML::Simple will work right for you. And, by the default,
    this root element disappears, so that you'll get at the first level a hash with the keys you want:
    CVS, ARTICLE, DATE, etc.

    You could always try your XML files against a typical browser (like FireFox, Opera, IE, etc.)
    to see if they are well-formed or if some error is pointed.

      You hit the nail on the head. Everything I had worked just fine. The problem is that when I ran it through Firefox, I noticed that a few points throughout the articles, I have the author emails in the following format:
      First Last <this@that.com>

      That messed everything up and all my original code actually works. Is there a way around that using XML::Simple or XML::Twig so that I don't have to go through EVERY file and remove all instances of that?

        The big issue is that if you have First Last <this@that.com> within your XML, you have bad XML. (It should be First Last &lt;this@that.com&gt;.) It is better to fix these files. I am not sure how you came into this, because XML::Simple usually escapes these things:
        $ perl -MXML::Simple -e "print XMLout({ a => 'a <b>' })" <opt a="a &lt;b&gt;" />

        It could be the version you're using. The example above used

        $ which_pm XML::Simple XML::Simple 2.13 c:/tools/apache/Perl/site/lib/XML/Simple.pm
Re: Parsing XML with XML::Simple
by GrandFather (Cardinal) on Dec 18, 2006 at 00:54 UTC

    XML::Twig seems to do what you want if you Simplify things a little:

    use strict; use warnings; use XML::Twig; use Data::Dump::Streamer; my $twig = XML::Twig->new (); $twig->parse (*DATA); my $hash = $twig->simplify (); Dump ($hash); __DATA__ <XML> <CVS> $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ </CVS> <DATE>2006-12-10</DATE> <INTRODUCTION>Blah</INTRODUCTION> <TITLE>Foo</TITLE> <AUTHOR>Bar</AUTHOR> ... <ARTICLE> foo bar baz </ARTICLE> </XML>

    Prints:

    $HASH1 = { ARTICLE => "\nfoo bar baz\n", AUTHOR => 'Bar', content => "\n\$Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 e +ric Exp \$\n2006-12-10BlahFo". "oBar\n...\n\nfoo bar baz\n", CVS => "\n\$Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 e +ric Exp \$\n", DATE => '2006-12-10', INTRODUCTION => 'Blah', TITLE => 'Foo' };

    DWIM is Perl's answer to Gödel

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://590364]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2014-07-12 01:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (238 votes), past polls