http://www.perlmonks.org?node_id=861077

ethrbunny has asked for the wisdom of the Perl Monks concerning the following question:

I have an app that downloads thousands of XML files every night. Many of these have small errors ( they come from an encrypted source ) that Im trying to clean up before I parse them. Each file is checked line by line for noise.

If I have n possible tags in a file with a different list of attributes for each, is there a regex that could be used to look for missing attributes? IE if I have <cat tail='text' meow='text'/> and <dog tail='text' bark='text'/> can I find instances of 'cat' that don't have 'meow' without discarding 'dog' entries? (assume that each line in the file is single XML statement (IE its closed)) and tags aren't nested.)

Replies are listed 'Best First'.
Re: XML cleanup - regex or ?
by ikegami (Patriarch) on Sep 21, 2010 at 16:23 UTC

    XPath '//cat[not(@meow)]' will identify such nodes.

    XML::LibXML example:

    use strict; use warnings; use XML::LibXML qw( ); my $xml = <<'__EOI__'; <root> <cat tail='text' meow='text'/> <cat tail='text' meow=''/> <cat tail='text'/> <dog tail='text' bark='text'/> </root> __EOI__ my $doc = XML::LibXML->new()->parse_string($xml); my $root = $doc->documentElement(); for my $node ($root->findnodes('//cat[not(@meow)]')) { $node->setAttribute(meow => 'default'); } print $doc->toString();
    <?xml version="1.0"?> <root> <cat tail="text" meow="text"/> <cat tail="text" meow=""/> <cat tail="text" meow="default"/> <dog tail="text" bark="text"/> </root>

    Or if you prefer, $node->parentNode()->removeChild($node); would remove the offending node.

Re: XML cleanup - regex or ?
by Utilitarian (Vicar) on Sep 21, 2010 at 14:44 UTC
    It sounds as though you wish to verify that the XML conforms to a particular DTD specification.

    XML::Rules looks like a good fit for your needs.

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
      The issue is that the file won't parse at all until I remove the noisy entries. I comb through it line by line looking for errors before I pass it to XML::Twig.
        It's not an XML error to be missing a meow attribute on a cat element. Did you misrepresent the problem, or are you actually getting the error when you validate the XML against a schema? There's no reason you can't do that after removing the offending elements.
Re: XML cleanup - regex or ?
by dasgar (Priest) on Sep 21, 2010 at 14:46 UTC

    Here's one approach, if you want to do it by hand.

    • Slurp the file into a variable
    • Grab all of the 'cat' tags and put them in an array.
      my (@cats) = ($file =~ m/(<cat.+?\/>)/ig)
    • Then find the ones that have the missing attributes.
      foreach my $cat (@cats) { if ($cat !~ m/meow=.+?/i) { # do action here } }

    The process should work. However, if you have hundreds of tag/attribute combinations, you probably wouldn't want to hard code those combinations. Instead, you might prefer to do a subroutine and pass in the tag and attribute combo.

    Hope this helps.

      I have to clean the file line by line. Many of the downloads are 2+Gb long and I get memory errors if I do too much in RAM.

      I've been considering a 'cascade' of regex to toss out the noise. Something like testing for the cat, dog, etc, then looking for the param list. Lots of nested ifs. It seems messy but it might be the only avenue. I was hoping there was a slick regex process to do this instead.

           Many of the downloads are 2+Gb long and I get memory errors if I do too much in RAM.

        Well, that's a constraint that you didn't share initially. Had I been aware of that I would not have proposed slurping the file(s) into memory.

        Now that I have a better understanding of the constraints, I would probably do something like the untested code below. For each file that needs 'cleaning', run the script below with the perl -i.bak, which opens the file for in place editing and backs it up to a file with the .bak file extension before opening the file for editing. (Without the .bak, Perl just overwrites the file with no backup.)

        Basically, the code below will check a file line by line for each tag/attribute pairs specified. If an attribute is missing for a tag, that line is 'deleted' from the file. This might not be exactly what you want to do, but it should give you a framework to use for your own 'noise' handling operations.

Re: XML cleanup - regex or ?
by murugu (Curate) on Sep 22, 2010 at 09:08 UTC

    I dont know whether i understood the question correctly. As you mentioned that each line in the file is single XML statement, I used XML::Twig to check the element and attributes by processing the file line by line.

    This below code will print the line number at which you find the discrepancy. You can tweak this code to accommodate the changes you need.

    #!/usr/bin/perl use strict; use XML::Twig; my %elem_att = qw(cat meow dog bark); my $reg = join '|', keys %elem_att; while (<DATA>) { next unless (m/<(?:$reg)/); my $line = $_; my $line_num = $.; my $elt = parse XML::Twig::Elt($line); my $element = $elt->name; my $att = $elem_att{$element}; unless ($elt->att_exists($att)) { print "Attribute $att is not found at line number $line_num\n" +; next; } } __DATA__ <root> <a/> <b/> <cat tail='text' meow='text'/> <cat tail='text'/> <cat tail='text'/> <dog tail='text' bark='text'/> <dog tail='text'/> </root>

    Regards,
    Murugesan Kandasamy
    use perl for(;;);