Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Help with RegEx

by mr_p (Scribe)
on Jul 01, 2010 at 15:40 UTC ( #847535=perlquestion: print w/ replies, xml ) Need Help??
mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I am trying to pull out the href from these lines and I don't know why the RegEx is not returning before and after characters of stylesheet.

#!/usr/bin/perl -w use strict; my @ss = getItemsFromFile(); foreach my $s (@ss) { print "$s\n"; } sub getItemsFromFile { local $/=undef; my ($line_in) = "\n<?xml-stylesheet href=\"perl1.css\" type=\"text +/css\"?>\n \n<link href=\"//www.perl.org/css/perl1.css\" rel=\"stylesheet\">\n \n<link href=\"/css/perl.css\" rel=\"stylesheet\">\n"; my @allItems=(); while ($line_in =~ m{<(.*?)stylesheet(.*?)>}gis) { push (@allItems, $1); } return @allItems; }

Comment on Help with RegEx
Download Code
Re: Help with RegEx
by kennethk (Monsignor) on Jul 01, 2010 at 15:59 UTC
    The regular expression .*? can be translated to English as "match 0 or more of any character in a non-greedy fashion". That last bit is your problem - the shortest string of arbitrary characters that can be matched following stylesheet is, of course, "". You may mean to have something closer to

    while ($line_in =~ m{<^(.*?)stylesheet(.*?)$>}gis)

    which anchors your regular expression at the start and end of your string. In this context, you probably don't want non-greedy matching, so you could also use

    while ($line_in =~ m{<(.*)stylesheet(.*)>}gis)

    See perlre and perlretut for more info.

      Thanks so much.

      Is there a way I can get href value in one step process or I have to do another step of regex for it.

      Thanks so much.

      Is there a way I can get href in one step process or I have to do another step of regex for it.

      The results are weird...It is stripping off everything after style sheet in this line '<?xml-stylesheet href=\"perl1.css\" type=\"text/css\"?>'

        Maybe you want to use a proper HTML parser, like HTML::Parser instead?

Re: Help with RegEx
by ikegami (Pope) on Jul 01, 2010 at 19:47 UTC
      I tried to use this. But I am trying to find link off of an attribute. And It does not support that.
        The entire purpose of the module is to do exactly that, so saying it's not supported makes absolutely no sense.
Re: Help with RegEx
by furry_marmot (Pilgrim) on Jul 01, 2010 at 21:58 UTC

    No offense, but you don't even have the basics. Your dept won't worry about the speed of Perl so much as your proficiency with it. You might want to start with Learning Perl. Then you might want to take a look at Mastering Regular Expressions, though you could save a few bucks and start with perlrequick and perlretut first.

    To help you understand what's going on, let's reformat what you've got so we can see what we should search for:

    my $line_in = <<EOT; <?xml-stylesheet href="perl1.css" type="text/css"?> <link href="//www.perl.org/css/perl1.css" rel="stylesheet"> <link href="/css/perl.css" rel="stylesheet"> EOT
    You say you want to pull out the hrefs, but you use this pattern:
    my @ss = $line_in =~ m{<(.*?)stylesheet(.*?)>}gis;
    which says to find an angle bracket and save 0 or more chars as $1 (that's what the parens do) until you find "stylesheet". Skip "stylesheet" and then save 0 or more chars as $2 until you find a closing angle bracket. Ignore line breaks. Here is what you'll get:
    $1 $2 |----| |--------------------------------| <?xml-stylesheet href="perl1.css" type="text/css"?> $1 $2 |--------------------------------------------| | <link href="//www.perl.org/css/perl1.css" rel="stylesheet"> $1 $2 |-----------------------------| | <link href="/css/perl.css" rel="stylesheet">
    If you want to capture the hrefs, try matching them instead:
    my @ss = $line_in =~ m/(href="[^"]+")/gi; print "$_\n" for @ss; # # href="perl1.css" # href="//www.perl.org/css/perl1.css" # href="/css/perl.css"
    If you're looking for an href within a tag that contains the word stylesheet, where the word stylesheet may appear before or after the href...well, that's a little more complicated. Here it is, but you'll have to figure out how it works on your own.
    my @ss = $line_in =~ m/<(?=[^>]*stylesheet).*(href="[^">]+")/gis;
    --marmot
      This is my code. I loved your one liner but I cant use it , because it doesn't work if there are no line breaks.
      #!/usr/bin/perl -w use strict; my @ss = getItemsFromFile(); foreach my $s (@ss) { print "$s\n"; } sub getItemsFromFile { local $/=undef; my ($file_in) = "\n<?xml-stylesheet href=\"perl1.css\" type=\"text +/css\"?><link href=\"//www.perl.org/css/perl1.css\" rel=\"stylesheet\ +"><link href=\"/css/perl.css\" rel=\"stylesheet\"> <?xml-stylesheet +href=\"perl1.css\" type=\"text/css\"?>\n\n"; my @allItems=(); while ( $file_in =~ m{(.*stylesheet.*)}igs ) { my $line = $1; $line =~ s/>\s+\</></igs; $line =~ s/></>\n</igs; while ( $line =~ m/.*[<] *(.*stylesheet.*) *[>]/ig ) { my $line1 = $1; if ( $line1 =~ m/(href *= *['"])([^'"]+)['"]/ ) { push(@allItems, $2); } } } return @allItems; }

        Maybe it is now time for you to look at one of the proven and tested methods of using an HTML parser, like HTML::HeadParser or HTML::Parser? They all know about how to parse tags.

        my ($line_in) = "\n<?xml-stylesheet href=\"perl1.css\" type=\"text/css +\"?><link href=\"//www.perl.org/css/perl1.css\" rel=\"stylesheet\"><l +ink href=\"/css/perl.css\" rel=\"stylesheet\"> <?xml-stylesheet href +=\"perl1.css\" type=\"text/css\"?>\n\n"; my @ss = $line_in =~ /<(?=[^>]*stylesheet).*?href *= *"*([^">]+)"/gis; print "$_\n" for @ss; Prints: perl1.css //www.perl.org/css/perl1.css /css/perl.css perl1.css

        I was tempted not to post that, since all you'd do is copy it into your own code. But maybe you can yet learn.

        But you seriously need to read a book.
        # I added print lines so you can see your own handiwork. And # comments you can learn from. Nothing else is changed. #!/usr/bin/perl -w use strict; my @ss = getItemsFromFile(); print "-----\n"; foreach my $s (@ss) { print "$s\n"; } sub getItemsFromFile { local $/=undef; my ($file_in) = "\n<?xml-stylesheet href=\"perl1.css\" type=\"text +/css\"?><link href=\"//www.perl.org/css/perl1.css\" rel=\"stylesheet\ +"><link href=\"/css/perl.css\" rel=\"stylesheet\"> <?xml-stylesheet +href=\"perl1.css\" type=\"text/css\"?>\n\n"; # Print the data so we can compare as we go along. print "0: $file_in"; my @allItems=(); # WHILE you are able to match everything before and after "stylesh +eet"... # which will match the entire file... # which means this will only work once... # which means you don't need a loop here. while ( $file_in =~ m{(.*stylesheet.*)}igs ) { # Copy the entire file to $line. my $line = $1; print "1: $line"; # Remove whitespace between tags, including newlines. $line =~ s/>\s+\</></igs; # NOW add newlines instead of doing so on the previous line. $line =~ s/></>\n</igs; # Why not $line =~ s/>\s+\</>\n</igs ? # .* is greedy. That means it matches as much as it can and st +arts # working backward. If you don't include the /s modifier, it w +ill stop # at the newline. So all you're doing here is putting each tag + on its # own line so you can look for "stylesheet" one line (tag) at +a time, # instead of searching on the whole string. There are much les +s # convoluted ways to do this. print "2: $line"; while ( $line =~ m/.*[<] *(.*stylesheet.*) *[>]/ig ) # By the way, you don't need to set up a character class for o +ne # character. And the way you've set it up # # while ( $line =~ /stylesheet/ig ) { } # # would work just fine. { # Carve off one tag... my $line1 = $1; print "3: $line1\n"; # The code below, or something close to it, could match on + the # whole file, saving you the steps you've done so far. It +works # here because there can only be one tag on instance of $l +ine1. if ( $line1 =~ m/(href *= *['"])([^'"]+)['"]/ ) { push(@allItems, $2); } } } return @allItems; } Prints: 0: <?xml-stylesheet href="perl1.css" type="text/css"?><link href="//www.p +erl.org/css/perl1.css" rel="stylesheet"><link href="/css/perl.css" re +l="stylesheet"> <?xml-stylesheet href="perl1.css" type="text/css"?> 1: <?xml-stylesheet href="perl1.css" type="text/css"?><link href="//www.p +erl.org/css/perl1.css" rel="stylesheet"><link href="/css/perl.css" re +l="stylesheet"> <?xml-stylesheet href="perl1.css" type="text/css"?> 2: <?xml-stylesheet href="perl1.css" type="text/css"?> <link href="//www.perl.org/css/perl1.css" rel="stylesheet"> <link href="/css/perl.css" rel="stylesheet"> <?xml-stylesheet href="perl1.css" type="text/css"?> 3: ?xml-stylesheet href="perl1.css" type="text/css"? 3: link href="//www.perl.org/css/perl1.css" rel="stylesheet" 3: link href="/css/perl.css" rel="stylesheet" 3: ?xml-stylesheet href="perl1.css" type="text/css"? ----- perl1.css //www.perl.org/css/perl1.css /css/perl.css perl1.css

        You made the comment that my original code doesn't work if there are no line breaks. But your approach is to start with a string filled with line breaks, remove them, and then add them back in again, leaving your lines looking identical to the way I reformatted your data (see 2: above).

        Your code works, but only by accident. It's not a generalizable solution. The advantage of using pre-written modules is that they've been tested, they work, they'll save you time, and you can always crack open their code and see how they work. The fact that they either take time to load or run more slowly than some custom code you wrote has to be weighed against the time you take trying to work out one-off solutions.

        --marmot

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://847535]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (12)
As of 2014-08-01 11:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (10 votes), past polls