Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Parsing XML with XML::Simple

by madbombX (Hermit)
on Dec 18, 2006 at 00:41 UTC ( #590364=perlquestion: print w/ replies, xml ) Need Help??
madbombX has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have been at this for much longer than should be necessary. I am using XML::Simple, but I have also unsuccessfully tried XML::Twig. The file (with an example below) I have a file that looks like what is below. Each tag is only used once. I know it is possible to use a few regexen, but the way I need to use the data later, having the data in a hash similar to one returned by XML::Simple.

<CVS> $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ This That <this@that.com> Desc: Test file </CVS> <DATE>2006-12-10</DATE> <INTRODUCTION>Blah, <b>blah</b>, blah</INTRODUCTION> <TITLE>Foo</TITLE> <AUTHOR>Bar</AUTHOR> ... <ARTICLE> <p>foo, test</p> <p>bar</p> <p>baz</p> </ARTICLE>

When I have the above text, all I get is what's inside of <CVS>. When I add <XML> tags surrounding the whole file, all I get is the following using Data::Dumper:

$VAR1 = \{ 'CVS' => '<The entire file and this is the only tag in the + hash with no other tags even making it in here>'};

What is the best method to extract the data out of this file (either using an XML module or not) and pulling out what I need into a hash? Thanks.

Update: I forgot to include the code I actually tried. I know the $file is correct and @articles is populated.

foreach my $file (@articles) { my $article = XMLin($file, NoAttr => 1); use Data::Dumper; print "<pre>", Dumper(\$article), "</pre>"; }

Update 2: I think I need to provide a better example of the data. So look above at the data example and that gives a much better perception of what I am dealing with.

Comment on Parsing XML with XML::Simple
Select or Download Code
Re: Parsing XML with XML::Simple
by brian_d_foy (Abbot) on Dec 18, 2006 at 00:53 UTC

    This seems to work for me. You mentioned wrapping a root element around it, but did you include the <?xml ...?> bit? Even when I remove that, though, it still works. Does it fail with your sample file?

    #!/usr/bin/perl use XML::Simple; my $xml = do { local $/; <DATA> }; my $hash = XMLin( $xml ); use Data::Dumper; print Dumper( $hash ); __END__ <?xml version="1.0"?> <root> <CVS> $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ </CVS> <DATE>2006-12-10</DATE> <INTRODUCTION>Blah</INTRODUCTION> <TITLE>Foo</TITLE> <AUTHOR>Bar</AUTHOR> <ARTICLE> foo bar baz </ARTICLE> </root>

    It looks like the hash does what it should:

    ]$ perl test $VAR1 = { 'INTRODUCTION' => 'Blah', 'CVS' => ' $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ ', 'ARTICLE' => ' foo bar baz ', 'TITLE' => 'Foo', 'AUTHOR' => 'Bar', 'DATE' => '2006-12-10' };

    Update: If I move the data out to a file and use XMLin( $ARGV[0], NoAttr => 1 ) it still works.

    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
Re: Parsing XML with XML::Simple
by ferreira (Chaplain) on Dec 18, 2006 at 00:54 UTC

    The problem is that you don't have a well-formed XML file. If it were well-formed,
    there would be a root element, which is the ancestral of every other ones. Something like this:

    <root> <CVS> $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ </CVS> <DATE>2006-12-10</DATE> ... <ARTICLE> foo bar baz </ARTICLE> </root>

    I think if you do this simple correction, XML::Simple will work right for you. And, by the default,
    this root element disappears, so that you'll get at the first level a hash with the keys you want:
    CVS, ARTICLE, DATE, etc.

    You could always try your XML files against a typical browser (like FireFox, Opera, IE, etc.)
    to see if they are well-formed or if some error is pointed.

      You hit the nail on the head. Everything I had worked just fine. The problem is that when I ran it through Firefox, I noticed that a few points throughout the articles, I have the author emails in the following format:
      First Last <this@that.com>

      That messed everything up and all my original code actually works. Is there a way around that using XML::Simple or XML::Twig so that I don't have to go through EVERY file and remove all instances of that?

        The big issue is that if you have First Last <this@that.com> within your XML, you have bad XML. (It should be First Last &lt;this@that.com&gt;.) It is better to fix these files. I am not sure how you came into this, because XML::Simple usually escapes these things:
        $ perl -MXML::Simple -e "print XMLout({ a => 'a <b>' })" <opt a="a &lt;b&gt;" />

        It could be the version you're using. The example above used

        $ which_pm XML::Simple XML::Simple 2.13 c:/tools/apache/Perl/site/lib/XML/Simple.pm
Re: Parsing XML with XML::Simple
by GrandFather (Sage) on Dec 18, 2006 at 00:54 UTC

    XML::Twig seems to do what you want if you Simplify things a little:

    use strict; use warnings; use XML::Twig; use Data::Dump::Streamer; my $twig = XML::Twig->new (); $twig->parse (*DATA); my $hash = $twig->simplify (); Dump ($hash); __DATA__ <XML> <CVS> $Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 eric Exp $ </CVS> <DATE>2006-12-10</DATE> <INTRODUCTION>Blah</INTRODUCTION> <TITLE>Foo</TITLE> <AUTHOR>Bar</AUTHOR> ... <ARTICLE> foo bar baz </ARTICLE> </XML>

    Prints:

    $HASH1 = { ARTICLE => "\nfoo bar baz\n", AUTHOR => 'Bar', content => "\n\$Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 e +ric Exp \$\n2006-12-10BlahFo". "oBar\n...\n\nfoo bar baz\n", CVS => "\n\$Id: File_Find.pl,v 1.1 2006-12-17 19:25:03 e +ric Exp \$\n", DATE => '2006-12-10', INTRODUCTION => 'Blah', TITLE => 'Foo' };

    DWIM is Perl's answer to Gödel

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://590364]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2015-07-02 23:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (47 votes), past polls