This regex will work as well: /molecule_idref="([^"]+)/
This will match everything in between double quotes after molecule_idref=. If you are sure your id only contains numbers then you indeed better check for "digits" (\d+) as was already suggested.Note that when parsing XML-files, there is no guarantee white-space, EOL, ... will be where you expect them to be, so reading such files on a line by line basis or expecting your "start of line" anchors to always be reliable may be causing subtle errors. What would you have done if your tags did not start at the beginning of the line, or the tag was broken over several lines? Consider using an XML-parser, such as XML::Simple which will turn your XML into a nice Perl-datastructure. For example: use strict;
use warnings;
use XML::Simple;
use Data::Dumper;
my $xml;
{
local $/='';
$xml = <DATA>;
}
my $xs = XML::Simple->new();
my $ref = $xs->XMLin($xml);
print Dumper($ref);
__DATA__
<xml><ComplexComponent1 molecule_idref="1"/>
<ComplexComponent2 molecule_idref="2"/><ComplexComponent3 molecule_idr
+ef="3"/>
<ComplexComponent4
molecule_idref="4"/><ComplexComponent5
molecule_idref="5"/>
</xml>
Will turn the mess in the __DATA__ section into:$VAR1 = {
'ComplexComponent3' => {'molecule_idref' => '3'},
'ComplexComponent5' => {'molecule_idref' => '5'},
'ComplexComponent1' => {'molecule_idref' => '1'},
'ComplexComponent4' => {'molecule_idref' => '4'},
'ComplexComponent2' => {'molecule_idref' => '2'}
};
a nice hash-of-hashes which you can access in any "Perlish"-way.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
|