Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Parse XML with Perl regex

by rowdog (Curate)
on Jul 07, 2010 at 23:44 UTC ( #848574=note: print w/ replies, xml ) Need Help??


in reply to Parse XML with Perl regex

You're pretty close. In a scalar context a regex returns the number of matches. In a list context, it returns the list of matches.

my ($info_name) = $line =~ /\<info_name\>(\S+)\<\/info_name\>/i;

And now for some notes...

  • DON'T DO THAT! Using regexs on XML is fragile.
  • Use something like XML::LibXML
  • I see fasta in there so you may like Perl and Bioinformatics
  • use strict;
  • use warnings;
  • XML element names should always be lower case, so you don't need to ignore the case in your regex.
  • Your example XML has 3 copies of the same structure so you will end up with one unique key in your hash.
  • Unless this is the beginning of nested t_volumes, you missed the / in </t_volume>

For my example, I decided to rely on the fact that the interesting tags do not contain other tags. If that changes, my code breaks. I also rely on the order of the tags as shown in the example XML, which is generally a dumb assumption since things like XML::LibXML can reorder the elements.

#!/usr/bin/perl use strict; use warnings; my @files = glob('./*.xml'); my %results; foreach my $xmlname (@files) { open my $fh, '<', $xmlname or die "$xmlname: $!"; while ( my $line = <$fh> ) { my ($name) = $line =~ /\<info_name\>([^<]+)\<\/info_name\>/ or next; while ( my $l = <$fh> ) { $l =~ /\<it_size\>([^<]+)\<\/it_size\>/ or next; $results{$name} = $1; last; } } } print map { "$_ => $results{$_}\n" } keys %results;
jth@reina:~/tmp$ perl 848551.pl FZGA34177.b1 => 35000

And finally, my XML::LibXML alternative which does not rely on tag ordering or the content of the tag.

#!/usr/bin/perl use strict; use warnings; use XML::LibXML; my @files = glob('./*.xml'); my %results; foreach my $xmlname (@files) { my $dom = XML::LibXML->load_xml( location => $xmlname, recover => 1, # no </t_volume> in example ) or die $!; foreach my $node ( $dom->findnodes('//info') ) { $results{ $node->find('info_name') } = $node->find('it_size'); } } print map { "$_ => $results{$_}\n" } keys %results;
jth@reina:~/tmp$ perl 848551.pl ./848551.xml:55: parser error : Premature end of data in tag t_volume +line 54 ^ ./848551.xml:55: parser error : Premature end of data in tag t_volume +line 2 ^ FZGA34177.b1 => 35000


Comment on Re: Parse XML with Perl regex
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://848574]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (7)
As of 2015-07-30 11:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (271 votes), past polls