comment on

Speedy XML-like parser (not 100% compliant). It takes a funky approach and matches against substrs instead of gathering in the match, this made it over 20x faster in my test case.

This is going into a module that needs to be very light weight, which more or less precludes dependencies. XML::Parser::Lite isn't available by itself, and it along with XML::SAX::PurePerl have pretty hefty interfaces.

Fixed issues

<foo/>

<foo bar="1" baz="=" quux="don't"/>

<foo><bar/>quux</foo>

Last Modified: Sat May 11 04:15:44 UTC 2002

UPDATE: Please see XML::RSSLite for further updates.

=pod

=head1 SYNOPSIS

parse(\%parsedTree, \$parseThis, 'topTag', $comments);

=head1 DESCRIPTION

=over

=item parsedTree - required

reference to hash to store the parsed document within

=item parseThis  - required

reference to scalar containg the document to parse

=item topTag     - optional

tag to consider the root node, leaving this undefined is not recommend
+ed.

=item comments   - optional

=over

=item false will remove contents from

=item true will not remove comments

=item array reference is true, comments are stored here

=back

=back

=head1 CAVEATS

Not a conforming parser, it does not handle the following

=over

=item <foo bar=">">

=item <foo><bar> <bar></bar> <bar></bar> </bar></foo>

=item <![CDATA[ ]]>

=item PI

=back

It's non-validating, without a DTD the following cannot be addressed

=over

=item entities

=item namespace

=back

=cut

sub parseXML{
  my($hash, $xml, $tag, $comments) = @_;
  my($begin, $end, @comments);
  local $_;

  #Kill comments
  while( ($begin =  index(${$xml}, '<!--')) > -1 &&
    ${$xml} =~ m%<!--.*?--(>)%sg ){
    my $str = substr(${$xml}, $begin, pos(${$xml})-$begin, '');
    
    #Save them if requested
    do{ unshift @comments, [$begin, substr($str, 4, length($str)-7)] }
      if $comments;
  }


  #Find topTag and set pos to start matching from there
  ${$xml} =~ /<$tag(?:>|\s)/g;
  ($begin, $end) = (0, pos(${$xml})||0);

  #Match either <foo></foo> or <bar />, optional attributes, stash tag
+ name
  while( ${$xml} =~ m%<([^\s>]+)(?:\s+[^>]*?)?(?:/|>.*?</\1)>%sg ){   
+  

    #Save the tag name, we'll need it
    $tag = $1 || $2;

    #Save the new beginning and end
    ($begin, $end) = ($end, pos(${$xml}));

    #Get the bit we just matched.
    my $str = substr(${$xml}, $begin, $end-$begin);
    
    #Extract the actual attributes and contents of the tag
#   $str =~ m%<$tag\s*([^>]*?)?>(.*?)</$tag>%s ||
    $str =~ s%^.*?<$tag\s*([^>]*?)?>(.*?)</$tag>%<$tag>$2</$tag>%s ||
      $str =~ m%<$tag\s*([^>]*?)?\s*/>%;
    my($attr, $content) = ($1, $2);

    #Did we get attributes? clean them up and chuck them in a hash.
    if( $attr ){
      ($_, $attr) = ($attr, {});
      $attr->{$1} = $3 while m/([^\s=]+)\s*=\s*(['"])(.*?)\2/g;
    }

    #Recurse if contents has more tags, replace contents with referenc
+e we get
    if( $content && index($content, '<') > -1 ){
      parseXML($content={}, \$str, $tag, 0);
      #Was there any data in the contents? We should extract that...
#      if( $str =~ />[^><\s]+</ || $str =~ />(?:[^><\s]+\s+)+</ ){
      if( $str =~ />[^><]+</ ){
    #The odd RE above \S+\s+ shortcircuits unnecessary entry

    my $length = length($str);
    my $taglen = length($tag)+2;
    $str= substr($str, $taglen, $length-1-2*$taglen);

    #Clean whitespace between tags
    #$str =~ s%(?<=>)?\s*(?=<)%%g; #XXX ~same speed, wacko warning
    #$str =~ s%(>?)\s*<%$1<%g;

#    $str =~ s%<$_\s*(?:[^>]*?)?(?:/|>.*?</$_)>%%sg for keys %{$conten
+t};
    my $qr = qr{@{[join('|', keys %{$content})]}};
    $str =~ s%<($qr)\s*(?:[^>]*?)?(?:/|>.*?</\1)>%%sg;

    $content->{'<>'} = $str;#XXX if $str;
      }
    }

    my($inhash);
    if( ref($content) ){
      #We have attributes? Then we should save them.
      $inhash = $attr || {};

      #Contents too? Save them as well.
      if( $content ){
    for( keys %{$content} ){
      $inhash->{$_} = exists($inhash->{$_})   ?
        (ref($inhash->{$_})  eq 'ARRAY'       ?
         [@{$inhash->{$_}}, $content->{$_}]   :
         [  $inhash->{$_},  $content->{$_}] ) : $content->{$_};
    }
    
      }
    }
    else{
      #Otherwise save our content
      $inhash = $content;
    }
    
    $hash->{$tag} = exists($hash->{$tag}) ?
      (ref($hash->{$tag})  eq 'ARRAY'     ?
    [@{$hash->{$tag}}, $inhash]       :
    [  $hash->{$tag},  $inhash]  )    : $inhash;
  }

  if( $comments ){
    #Restore comments if requested
    substr(${$xml}, $_->[0], 0, '<!--'.$_->[1].'-->') for @comments;

    #Expose comments if requested
    do{ push(@$comments, $_->[1]) for @comments } if ref($comments) eq
+ 'ARRAY';
  }
}
[download]

In reply to (YA) Perl XML-like parser by belg4mit

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks