extract ids

snape has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: extract ids by ww (Archbishop) on Sep 16, 2009 at 04:57 UTC
Your regex is close; very close. However, the character class `[A-Za-z]` matches ONLY ONE character. Since you want to match the many characters before the `\smolecule`, you need a quantifier after the char class, thusly: `if ($line =~ /^<[A-Za-z]+\smolecule_idref="(\d+)">$/) {` where the "+" says "Match one or more members of the class. (Note that even though the "+" makes the regex" greedy," there's no harm here, because you specify a whitespace character next.) I've also changed your capture to specify one or more digits. `.*` matches ZERO or more of anything, which isn't what you've specified.	[reply] [d/l] [select]
Re: extract ids by CountZero (Bishop) on Sep 16, 2009 at 06:24 UTC
This regex will work as well: `/molecule_idref="([^"]+)/` [download] This will match everything in between double quotes after `molecule_idref=`. If you are sure your id only contains numbers then you indeed better check for "digits" (\d+) as was already suggested. Note that when parsing XML-files, there is no guarantee white-space, EOL, ... will be where you expect them to be, so reading such files on a line by line basis or expecting your "start of line" anchors to always be reliable may be causing subtle errors. What would you have done if your tags did not start at the beginning of the line, or the tag was broken over several lines? Consider using an XML-parser, such as XML::Simple which will turn your XML into a nice Perl-datastructure. For example: `use strict; use warnings; use XML::Simple; use Data::Dumper; my $xml; { local $/=''; $xml = <DATA>; } my $xs = XML::Simple->new(); my $ref = $xs->XMLin($xml); print Dumper($ref); __DATA__ <xml><ComplexComponent1 molecule_idref="1"/> <ComplexComponent2 molecule_idref="2"/><ComplexComponent3 molecule_idr +ef="3"/> <ComplexComponent4 molecule_idref="4"/><ComplexComponent5 molecule_idref="5"/> </xml>` [download] Will turn the mess in the `__DATA__` section into: `$VAR1 = { 'ComplexComponent3' => {'molecule_idref' => '3'}, 'ComplexComponent5' => {'molecule_idref' => '5'}, 'ComplexComponent1' => {'molecule_idref' => '1'}, 'ComplexComponent4' => {'molecule_idref' => '4'}, 'ComplexComponent2' => {'molecule_idref' => '2'} };` [download] a nice hash-of-hashes which you can access in any "Perlish"-way. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: extract ids by snape (Pilgrim) on Sep 23, 2009 at 20:48 UTC
Hi, Thanks a lot for answering my doubts. I would like to know how `/molecule_idref="([^"]+)/` [download] will match everything between the quotes. I didn't understand why `"([^"]+)`will match anything between the quotes. Thanks.	[reply] [d/l] [select]
Re^3: extract ids by CountZero (Bishop) on Sep 24, 2009 at 18:16 UTC
`[^"]` [download] in a regex means: everything BUT a double quote. `"([^"]+)` [download] therefore means: start with a double quote, then capture everything but a double quote and end the capture. In other words, the capture starts after the first double quote and ends just before the next double quote. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: extract ids by Jenda (Abbot) on Sep 17, 2009 at 14:33 UTC
As I just wrote elsewhere, regexps are not a good tool for parsing XML. There may be comments, <!CDATA ... ]> sections, escaped data, newlines and other whitespace at unexpected places, ... `use strict; use XML::Parser; my $parser = new XML::Parser( Handlers => { Start => sub { my ($expat,$tag,%attr) = @_; print $attr{'molecule_idref'}, "\n" if %attr and exists $a +ttr{'molecule_idref'}; } } ); $parser->parse(\DATA); __DATA__ <root> <foo molecule_idref="123">fgdfg</foo> <bar molecule_idref="456">fgd <foo other="74" molecule_idref="789">fgdfg</foo> fg</bar> some text about how to write things containing molecule_idref="666". <baz molecule_idref="987"/> </root>` [download] Jenda Enoch was right!* Enjoy the last years of Rome.	[reply] [d/l]
Re: extract ids by ccn (Vicar) on Sep 16, 2009 at 07:10 UTC
What about this? `perl -lne 'print for /molecule_idref="([^"]+)/g' xmlfile` [download] I've used 'g' modifier to catch ids in a case they occur more than one on a line.	[reply] [d/l]
Re: extract ids by umasuresh (Hermit) on Jan 30, 2010 at 23:06 UTC
A must read book for getting a good grip on regexp is Mastering Regular Expressions by Friedl. http://oreilly.com/catalog/9781565922570.	[reply]