Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Help with RegEx

by mr_p (Scribe)
on Jul 01, 2010 at 15:40 UTC ( [id://847535]=perlquestion: print w/replies, xml ) Need Help??

mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I am trying to pull out the href from these lines and I don't know why the RegEx is not returning before and after characters of stylesheet.

#!/usr/bin/perl -w use strict; my @ss = getItemsFromFile(); foreach my $s (@ss) { print "$s\n"; } sub getItemsFromFile { local $/=undef; my ($line_in) = "\n<?xml-stylesheet href=\"perl1.css\" type=\"text +/css\"?>\n \n<link href=\"//www.perl.org/css/perl1.css\" rel=\"stylesheet\">\n \n<link href=\"/css/perl.css\" rel=\"stylesheet\">\n"; my @allItems=(); while ($line_in =~ m{<(.*?)stylesheet(.*?)>}gis) { push (@allItems, $1); } return @allItems; }

Replies are listed 'Best First'.
Re: Help with RegEx
by kennethk (Abbot) on Jul 01, 2010 at 15:59 UTC
    The regular expression .*? can be translated to English as "match 0 or more of any character in a non-greedy fashion". That last bit is your problem - the shortest string of arbitrary characters that can be matched following stylesheet is, of course, "". You may mean to have something closer to

    while ($line_in =~ m{<^(.*?)stylesheet(.*?)$>}gis)

    which anchors your regular expression at the start and end of your string. In this context, you probably don't want non-greedy matching, so you could also use

    while ($line_in =~ m{<(.*)stylesheet(.*)>}gis)

    See perlre and perlretut for more info.

      Thanks so much.

      Is there a way I can get href value in one step process or I have to do another step of regex for it.

      Thanks so much.

      Is there a way I can get href in one step process or I have to do another step of regex for it.

      The results are weird...It is stripping off everything after style sheet in this line '<?xml-stylesheet href=\"perl1.css\" type=\"text/css\"?>'

        Maybe you want to use a proper HTML parser, like HTML::Parser instead?

Re: Help with RegEx
by furry_marmot (Pilgrim) on Jul 01, 2010 at 21:58 UTC

    No offense, but you don't even have the basics. Your dept won't worry about the speed of Perl so much as your proficiency with it. You might want to start with Learning Perl. Then you might want to take a look at Mastering Regular Expressions, though you could save a few bucks and start with perlrequick and perlretut first.

    To help you understand what's going on, let's reformat what you've got so we can see what we should search for:

    my $line_in = <<EOT; <?xml-stylesheet href="perl1.css" type="text/css"?> <link href="//www.perl.org/css/perl1.css" rel="stylesheet"> <link href="/css/perl.css" rel="stylesheet"> EOT
    You say you want to pull out the hrefs, but you use this pattern:
    my @ss = $line_in =~ m{<(.*?)stylesheet(.*?)>}gis;
    which says to find an angle bracket and save 0 or more chars as $1 (that's what the parens do) until you find "stylesheet". Skip "stylesheet" and then save 0 or more chars as $2 until you find a closing angle bracket. Ignore line breaks. Here is what you'll get:
    $1 $2 |----| |--------------------------------| <?xml-stylesheet href="perl1.css" type="text/css"?> $1 $2 |--------------------------------------------| | <link href="//www.perl.org/css/perl1.css" rel="stylesheet"> $1 $2 |-----------------------------| | <link href="/css/perl.css" rel="stylesheet">
    If you want to capture the hrefs, try matching them instead:
    my @ss = $line_in =~ m/(href="[^"]+")/gi; print "$_\n" for @ss; # # href="perl1.css" # href="//www.perl.org/css/perl1.css" # href="/css/perl.css"
    If you're looking for an href within a tag that contains the word stylesheet, where the word stylesheet may appear before or after the href...well, that's a little more complicated. Here it is, but you'll have to figure out how it works on your own.
    my @ss = $line_in =~ m/<(?=[^>]*stylesheet).*(href="[^">]+")/gis;
    --marmot
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Help with RegEx
by ikegami (Patriarch) on Jul 01, 2010 at 19:47 UTC
      I tried to use this. But I am trying to find link off of an attribute. And It does not support that.
        The entire purpose of the module is to do exactly that, so saying it's not supported makes absolutely no sense.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://847535]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-03-19 03:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found