http://www.perlmonks.org?node_id=254628

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Below is a snippet of code that I wrote to simply insert XML data where appropriate. I've used XML::Parser, and found it to complex for what I needed.

Any suggestions on code improvement?

#!/usr/bin/perl use warnings; use strict; open (READ, "test.xml") || die "ERROR: $!\n"; my @array = <READ>; close READ; open (WRITE, ">new.xml") || die "ERROR: $!\n"; foreach (@array) { if ($_ =~ /\<\!\-\- Testing XML \-\-\>/){ print WRITE '<bar>', "\n"; print WRITE '<name> TEST </name>', "\n"; print WRITE '<type> Foo </type>', "\n"; print WRITE '<!-- PDP Status -->', "\n"; print WRITE '<unknown_sec> 0 </unknown_sec>', "\n"; print WRITE '</bar>', "\n\n"; } if ($_ =~ /\<\/ Test Tag\>/) { print WRITE '<bar><value> TEST </value>'; print WRITE '</bar>', "\n"; } if ($_ =~ /\<\/v\>\<\/row\>\n$/) { $_ =~ s/\<\/row\>\n$//; $_ .= '<v> UnKnown </v></row>' . "\n"; print WRITE $_; } else { print WRITE $_; } } close WRITE;
Thanks

2003-05-01 edit ybiC: retitle from "Code efficiency ?"

Replies are listed 'Best First'.
Re: Code efficiency ?
by broquaint (Abbot) on May 01, 2003 at 14:04 UTC
    Any suggestions on code improvement
    Yes, use XML::Simple :) As you can guess from the name XML::Simple is a module with a very simple interface for manipulating XML - perl data structures. If you need more info on data structures in perl see perldsc|perldsc and perlref for info on references.
    HTH

    _________
    broquaint



      Hear, hear! I've recently jumped into XML and agree, XML::Simple does indeed appear to be the way to go for something like this.

      Or, perhaps, XML::Simpler is what you need. ;P
      -----------------------
      You are what you think.

Re: Code efficiency ?
by hardburn (Abbot) on May 01, 2003 at 14:03 UTC

    I'm about to repeat a statement that a lot of people around here have heard time and time again: Don't use regular expressions for parsing HTML or XML. They might work in a few simple cases, but you must use a real parser for anything in the least bit complex.

    There are many other modules besides XML::Parser to do the job. I've never used any of them, but I'm sure there will be suggestions below for just about all of them :)

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

Re: Code efficiency ?
by BrowserUk (Patriarch) on May 01, 2003 at 14:43 UTC

    I won't go into the (for me, vexed) question of the rights and wrongs of using regexes this way, but if you are going to do it this way, you can certinly make your code a lot more readable.

    #!/usr/bin/perl use warnings; use strict; open (READ, "test.xml") || die "ERROR: $!\n"; my @array = <READ>; close READ; open (WRITE, ">new.xml") || die "ERROR: $!\n"; foreach (@array) { if ($_ =~ m'<!-- Testing XML -->'){ print WRITE "<bar>\n", "<name> TEST </name>\n", "<type> Foo </type>\n", "<!-- PDP Status -->\n", "<unknown_sec> 0 </unknown_sec>\n", "</bar>\n\n"; } if ($_ =~ m'</Test Tag>' ) { print WRITE "<bar><value> TEST </value></bar>\n"; } if ($_ =~ m[\Q</v></row>\E\n$] ) { $_ =~ s[\Q</row>\E\n$][<v> UnKnown </v></row>\n]; } print WRITE $_; } close WRITE;

    I beleive that the above is equivalent to your posted code, but it is untested and I may have introduced errors, but a picture is worth a thousand words.

    The first thing I would change are the multiple print statements for a single print statement.

    Then there is little to be gained by using single quoted strings for part of you output if you need to use a double quoted string to add the newlines. The compiler will probably make an better job of optimising this than you will:) Using a single print statement is probably slightly more efficient that multiple calls, but the main benefit is readability (IMO:).

    Another change I would make is to avoid having to escape characters in regexes where it isn't needed. Most of the characters you were escaping simply didn't need to be escaped, but where the regex doesn't contain any meta-characters, using single quotes as an alternative delimiter (eg. m'') avoids interpolation makes things a lot cleaner.

    It would possibly be more efficient (given that was your question) to use index in these cases anyway.

    Where the regex does contain some meta-characters and some which you would need to escape to prevent them being read as such (not the case here I think, but it serves as an example), then using \Q and \E around the bits you want escaped is often cleaner and more readable than escaping each character individually.

    The final change was to remove the duplicated print WRITE $_; statement and the redundant else clause. You could also reduce that to a slightly simpler print WRITE; as $_ is the default, but whether that clarifies or obfuscates is an open question.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
      Thanks for the advice BrowserUK,

      I know I should be using an XML parser but I still think its overkill for this simple XML parse.

(jeffa) Re: Code efficiency ?
by jeffa (Bishop) on May 01, 2003 at 14:55 UTC
    Instead of printing out the XML 'by hand', use something like XML::Writer:
    use strict; use warnings; use IO::File; use XML::Writer; my $output = IO::File->new('>output.xml'); my $writer = XML::Writer->new(OUTPUT => $output); $writer->xmlDecl('UTF-8'); $writer->doctype('xml'); $writer->startTag('xml'); $writer->startTag('bar'); $writer->startTag('name'); $writer->characters(' TEST '); $writer->endTag('name'); $writer->comment(' PDP Status '); $writer->startTag('type'); $writer->characters(' Foo '); $writer->endTag('type'); $writer->endTag('bar'); $writer->endTag('xml'); $writer->end();

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Code efficiency ?
by converter (Priest) on May 01, 2003 at 15:26 UTC

    As others have already mentioned, simple regular expressions aren't the best tool for extracting data from SGML documents, but there are other parts of your code that are interesting and could be improved.

    After I wrote this I refreshed the page and noticed that a few others had already done the same, but I think there are a few changes and comments here that aren't covered in the other posts. Sorry if I'm redundant.

    This code compiles but has not been tested.

    #!/usr/bin/perl use warnings; use strict; # if the argument to die ends with newline, die # won't print the line number: # open (READ, "test.xml") || die "ERROR: $!\n"; # you probably want to know which file the error # relates to and the mode (read or write): open READ, "< test.xml" or die "ERROR opening test.xml for reading: $! +"; # NOTE that we use 'or', a lower-precedence logical # operator so that we can leave off the parentheses # around the arguments to open() my @array = <READ>; # close can fail. if it does, you probably want to know # about it: close READ or die "ERROR closing READ filehandle: $!"; # see above comments # open (WRITE, ">new.xml") || die "ERROR: $!\n"; open WRITE, "> new.xml" or die "ERROR creating new.xml: $!"; # make WRITE the default filehandle select WRITE; foreach (@array) { # '<' and '!' are not meta-characters so there's no # need to escape them: if (/<!-- Testing XML -->/) { # use a here-document (see the perlop man page) print <<_EOF_; <bar> <name> TEST </name> <type> Foo </type> <!-- PDP Status --> <unknown_sec> 0 </unknown_sec> </bar> _EOF_ } # in /PATTERN/, the '/' character is your delimiter. # if the pattern delimiter occurs in the pattern, # instead of escaping the delimiter, use a different # delimiter with the explicit m operator. here we # use '!' as our delimiter: if (m!</ Test Tag>!) { print "<bar><value> TEST </value></bar>\n"; } # \n$ : this is redundant. the $ anchor matches an optional newlin +e if (m!</v></row>$!) { # note the use of '@' instead of '/' as the delimiter: s@</row>$@@; $_ .= "<v> UnKnown </v></row>\n"; print; # prints $_ by default } else { print; } } close WRITE or die "ERROR closing WRITE filehandle: $!";
Re: Code efficiency ?
by Improv (Pilgrim) on May 01, 2003 at 15:18 UTC
    #!/usr/bin/perl -w use strict; open(READ, "test.xml") || die "Error opening test.xml: $!\n"; open(WRITE, ">new.xml") || die "Error opening new.xml: $!\n"; while(<READ>) { if( $_ eq q{<!-- Testing XML -->}) { print WRITE <<"EWRITE"; <bar> <name> TEST </name> <type> Foo </type> <!-- PDP Status --> <unknown_sec> 0 </unknown_sec> </bar> EWRITE } if($_ eq "</ Test Tag>/) { print WRITE <<ETT; <bar><value> TEST </value> </bar> ETT } if($_ eq "</v></row>\n") { s/\<\/row\>\n$//; $_ .= "<v> UnKnown </v></row>\n"; } print WRITE; } close(READ); close(WRITE);