Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

extract ids

by snape (Pilgrim)
on Sep 16, 2009 at 04:31 UTC ( #795516=perlquestion: print w/ replies, xml ) Need Help??
snape has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Thanks a lot for answering my doubts. I would like to know how
/molecule_idref="([^"]+)/
will match everything between the quotes. I didn't understand why "([^"]+) will match anything between the quotes. Thanks.

Comment on extract ids
Select or Download Code
Re: extract ids
by ww (Bishop) on Sep 16, 2009 at 04:57 UTC
    Your regex is close; very close.

    However, the character class [A-Za-z] matches ONLY ONE character.

    Since you want to match the many characters before the \smolecule, you need a quantifier after the char class, thusly:

    if ($line =~ /^<[A-Za-z]+\smolecule_idref="(\d+)">$/) {

    where the "+" says "Match one or more members of the class. (Note that even though the "+" makes the regex" greedy," there's no harm here, because you specify a whitespace character next.)

    I've also changed your capture to specify one or more digits. .* matches ZERO or more of anything, which isn't what you've specified.

Re: extract ids
by CountZero (Bishop) on Sep 16, 2009 at 06:24 UTC
    This regex will work as well:
    /molecule_idref="([^"]+)/
    This will match everything in between double quotes after molecule_idref=. If you are sure your id only contains numbers then you indeed better check for "digits" (\d+) as was already suggested.

    Note that when parsing XML-files, there is no guarantee white-space, EOL, ... will be where you expect them to be, so reading such files on a line by line basis or expecting your "start of line" anchors to always be reliable may be causing subtle errors. What would you have done if your tags did not start at the beginning of the line, or the tag was broken over several lines?

    Consider using an XML-parser, such as XML::Simple which will turn your XML into a nice Perl-datastructure.

    For example:

    use strict; use warnings; use XML::Simple; use Data::Dumper; my $xml; { local $/=''; $xml = <DATA>; } my $xs = XML::Simple->new(); my $ref = $xs->XMLin($xml); print Dumper($ref); __DATA__ <xml><ComplexComponent1 molecule_idref="1"/> <ComplexComponent2 molecule_idref="2"/><ComplexComponent3 molecule_idr +ef="3"/> <ComplexComponent4 molecule_idref="4"/><ComplexComponent5 molecule_idref="5"/> </xml>
    Will turn the mess in the __DATA__ section into:
    $VAR1 = { 'ComplexComponent3' => {'molecule_idref' => '3'}, 'ComplexComponent5' => {'molecule_idref' => '5'}, 'ComplexComponent1' => {'molecule_idref' => '1'}, 'ComplexComponent4' => {'molecule_idref' => '4'}, 'ComplexComponent2' => {'molecule_idref' => '2'} };
    a nice hash-of-hashes which you can access in any "Perlish"-way.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Hi, Thanks a lot for answering my doubts. I would like to know how
      /molecule_idref="([^"]+)/
      will match everything between the quotes. I didn't understand why "([^"]+) will match anything between the quotes. Thanks.
        [^"]
        in a regex means: everything BUT a double quote.

        "([^"]+)
        therefore means: start with a double quote, then capture everything but a double quote and end the capture. In other words, the capture starts after the first double quote and ends just before the next double quote.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: extract ids
by ccn (Vicar) on Sep 16, 2009 at 07:10 UTC
    What about this?
    perl -lne 'print for /molecule_idref="([^"]+)/g' xmlfile

    I've used 'g' modifier to catch ids in a case they occur more than one on a line.

Re: extract ids
by Jenda (Abbot) on Sep 17, 2009 at 14:33 UTC

    As I just wrote elsewhere, regexps are not a good tool for parsing XML. There may be comments, <!CDATA ... ]> sections, escaped data, newlines and other whitespace at unexpected places, ...

    use strict; use XML::Parser; my $parser = new XML::Parser( Handlers => { Start => sub { my ($expat,$tag,%attr) = @_; print $attr{'molecule_idref'}, "\n" if %attr and exists $a +ttr{'molecule_idref'}; } } ); $parser->parse(\*DATA); __DATA__ <root> <foo molecule_idref="123">fgdfg</foo> <bar molecule_idref="456">fgd <foo other="74" molecule_idref="789">fgdfg</foo> fg</bar> some text about how to write things containing molecule_idref="666". <baz molecule_idref="987"/> </root>

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: extract ids
by umasuresh (Hermit) on Jan 30, 2010 at 23:06 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://795516]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (14)
As of 2014-08-01 16:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (33 votes), past polls