Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Problem with Regex

by ŞuRvīvőr (Novice)
on Aug 01, 2011 at 09:55 UTC ( [id://917791]=perlquestion: print w/replies, xml ) Need Help??

ŞuRvīvőr has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I'm using perl 5.8 and im having a problem with my regex. the regex is supposed to return the data of a certain tag (between the opening and closing of the tag) in an XML file. Ex.
<page> <title>Some Title</title> <id>1149707</id> <revision> <id>4220</id> <timestamp>2011-04-02T16:47:40Z</timestamp> <contributor> <username>some User Name</username> <id>268</id> </contributor> <minor /> <text xml:space="preserve">some Text ... ......... ..................... ..................... .....................</text> </revision> </page>

I tried to do a function that takes the text and the tag as an input and returns the data of that tag, but it returns nothing.

sub My_Regex { my($raw_text, $tag) = @_; $raw_text =~ /<$tag>{1}(.*)(<\/$tag>){1}/; $output = $1; print "OUTPUT: $output\n"; }

Can someone help me with that!! knowing that, I dont use any XML parsers nor a higher version of perl

Replies are listed 'Best First'.
Re: Problem with Regex
by moritz (Cardinal) on Aug 01, 2011 at 10:24 UTC

    A solution using a pure-Perl XML parser:

    use strict; use warnings; use Mojo::DOM; my $dom = Mojo::DOM->new()->parse(do { local $/; <DATA> }); # here you can use any other tag than 'title' too: print $dom->at('title')->text; __DATA__ <page> <title>Some Title</title> <id>1149707</id> <revision> <id>4220</id> <timestamp>2011-04-02T16:47:40Z</timestamp> <contributor> <username>some User Name</username> <id>268</id> </contributor> <minor /> <text xml:space="preserve">some Text ... ......... ..................... ..................... .....................</text> </revision> </page> <page> <title>Some other Title</title> <id>11497077</id> <revision> <id>42290</id> <timestamp>2011-05-02T16:47:40Z</timestamp> <contributor> <username>some Other User Name</username> <id>2688</id> </contributor> <minor /> <text xml:space="preserve">some Other Text ... ......... ......... ..................... ..................... .....................</text> </revision> </page>

    See Mojo::DOM for other extraction methods it offers.

      thanks a lot for your help, I used the XML::DOM XML parser, it's easy to use, but I'm still having a problem with finding matches in a multi-line string. For Example, I wanna look for a category in a multi-line text:

      [category: SOME CATEGORY]

      How do i do that !!!

        actually .. I wanna find all categories in a certain text
Re: Problem with Regex
by JavaFan (Canon) on Aug 01, 2011 at 09:59 UTC
    A dot on its own doesn't match a newline. Use /(?s:.)/ instead.

    knowing that, I dont use any XML parsers
    Maybe you should, otherwise, you'll come back with more questions.
      I tried (?s:.) instead of .* and still not working, the code has become
      sub My_Regex { my($raw_text, $tag) = @_; $raw_text =~ /<$tag>{1}(?s:.)(<\/$tag>){1}/; my $output = $1; print "OUTPUT: $output\n"; }
      and still showing nothing
         $raw_text =~ /<$tag>{1}(.*)(<\/$tag>){1}/s;

        Try this, the 's' at the end modifies '.' to also match line endings and is guaranteed to work with 5.8.

        By the way, your pattern only works if this tag only occurs once in your string. Otherwise using (.*?) instead of (.*) will at least allow you to find the first occurence. The difference is that .* tries to find the longest possible string while .*? will get the shortest.

        Please follow my instructions. I said to replace the dot with (?s:.), I did not say to replace (.*) with (?s:.).
      I tried to use an XML parser but I couldn't install it, it needs some C parsers to install and i couldn't do that
Re: Problem with Regex
by jwkrahn (Abbot) on Aug 01, 2011 at 10:20 UTC

    I tried to do a function that takes the text and the tag as an input and returns the data of that tag, but it returns nothing.

    sub My_Regex { my($raw_text, $tag) = @_; $raw_text =~ /<$tag>{1}(.*)(<\/$tag>){1}/; my $output = $1; print "OUTPUT: $output\n"; }

    But it does return something, it returns the return value of the last statement in the sub which is print which returns true or false.

    But anyway, you probebly want something like this:

    sub My_Regex { my ( $raw_text, $tag ) = @_; return $raw_text =~ /<$tag>(.*?)<\/$tag>/s ? $1 : ''; }
Re: Problem with Regex
by Logicus (Initiate) on Aug 02, 2011 at 13:07 UTC
    #!/usr/bin/perl use Modern::Perl; $_ = qq@ <page> <title>Some Title</title> <id>1149707</id> <revision> <id>4220</id> <timestamp>2011-04-02T16:47:40Z</timestamp> <contributor> <username>some User Name</username> <id>268</id> </contributor> <minor /> <text xml:space="preserve">some Text ... ......... ..................... ..................... .....................</text> </revision> </page> @; s@.*?<text.*?>(.*?)</text>.*@$1@s; say;
    Tested.
Re: Problem with Regex
by delirium (Chaplain) on Aug 02, 2011 at 16:31 UTC
    I've found range operators pretty helpful when looking for text between XML tags. A variation on something like this might work:
    perl -e '$tag = "text"; while(<>) {print if(/<$tag/ .. m!</$tag!)}' fi +le.txt
    Output:
    <text xml:space="preserve">some Text ... ......... ..................... ..................... .....................</text>
Re: Problem with Regex
by juampatronics (Initiate) on Aug 02, 2011 at 09:52 UTC

    Hi there,

    I guess this is what you need:

    my $str = join(" ",<$file>); print $1 while $str =~ /<$tagname>(.*)<\/$tagname>/msg;

    I think you are missing two things: 1. Multiline matching (for example the contents of "revision" spans several lines). See regexp flags below. 2. In addition, you need to slurp the contents of your file first. Otherwise you'll get empty results when you try to get the contents of elements like "revision".

    In any case, IMHO the best thing to do always is to use standard/well known libraries. Try XML::XPath. XPath is actually not that hard (the brief tutorial at http://www.w3schools.com/ may suffice).

    Hope this helps anyway, Juampa

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://917791]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2026-01-16 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your view on AI coding assistants?





    Results (119 votes). Check out past polls.

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.