http://www.perlmonks.org?node_id=847730


in reply to Re: Help with RegEx
in thread Help with RegEx

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re^3: Help with RegEx
by Corion (Patriarch) on Jul 02, 2010 at 14:55 UTC

    Maybe it is now time for you to look at one of the proven and tested methods of using an HTML parser, like HTML::HeadParser or HTML::Parser? They all know about how to parse tags.

      My code works. Was just wondering if it can be combined in one line. Actually I am thinking of trying what u'r saying and timing them both.
Re^3: Help with RegEx
by furry_marmot (Pilgrim) on Jul 03, 2010 at 04:18 UTC
    my ($line_in) = "\n<?xml-stylesheet href=\"perl1.css\" type=\"text/css +\"?><link href=\"//www.perl.org/css/perl1.css\" rel=\"stylesheet\"><l +ink href=\"/css/perl.css\" rel=\"stylesheet\"> <?xml-stylesheet href +=\"perl1.css\" type=\"text/css\"?>\n\n"; my @ss = $line_in =~ /<(?=[^>]*stylesheet).*?href *= *"*([^">]+)"/gis; print "$_\n" for @ss; Prints: perl1.css //www.perl.org/css/perl1.css /css/perl.css perl1.css

    I was tempted not to post that, since all you'd do is copy it into your own code. But maybe you can yet learn.

    But you seriously need to read a book.
    # I added print lines so you can see your own handiwork. And # comments you can learn from. Nothing else is changed. #!/usr/bin/perl -w use strict; my @ss = getItemsFromFile(); print "-----\n"; foreach my $s (@ss) { print "$s\n"; } sub getItemsFromFile { local $/=undef; my ($file_in) = "\n<?xml-stylesheet href=\"perl1.css\" type=\"text +/css\"?><link href=\"//www.perl.org/css/perl1.css\" rel=\"stylesheet\ +"><link href=\"/css/perl.css\" rel=\"stylesheet\"> <?xml-stylesheet +href=\"perl1.css\" type=\"text/css\"?>\n\n"; # Print the data so we can compare as we go along. print "0: $file_in"; my @allItems=(); # WHILE you are able to match everything before and after "stylesh +eet"... # which will match the entire file... # which means this will only work once... # which means you don't need a loop here. while ( $file_in =~ m{(.*stylesheet.*)}igs ) { # Copy the entire file to $line. my $line = $1; print "1: $line"; # Remove whitespace between tags, including newlines. $line =~ s/>\s+\</></igs; # NOW add newlines instead of doing so on the previous line. $line =~ s/></>\n</igs; # Why not $line =~ s/>\s+\</>\n</igs ? # .* is greedy. That means it matches as much as it can and st +arts # working backward. If you don't include the /s modifier, it w +ill stop # at the newline. So all you're doing here is putting each tag + on its # own line so you can look for "stylesheet" one line (tag) at +a time, # instead of searching on the whole string. There are much les +s # convoluted ways to do this. print "2: $line"; while ( $line =~ m/.*[<] *(.*stylesheet.*) *[>]/ig ) # By the way, you don't need to set up a character class for o +ne # character. And the way you've set it up # # while ( $line =~ /stylesheet/ig ) { } # # would work just fine. { # Carve off one tag... my $line1 = $1; print "3: $line1\n"; # The code below, or something close to it, could match on + the # whole file, saving you the steps you've done so far. It +works # here because there can only be one tag on instance of $l +ine1. if ( $line1 =~ m/(href *= *['"])([^'"]+)['"]/ ) { push(@allItems, $2); } } } return @allItems; } Prints: 0: <?xml-stylesheet href="perl1.css" type="text/css"?><link href="//www.p +erl.org/css/perl1.css" rel="stylesheet"><link href="/css/perl.css" re +l="stylesheet"> <?xml-stylesheet href="perl1.css" type="text/css"?> 1: <?xml-stylesheet href="perl1.css" type="text/css"?><link href="//www.p +erl.org/css/perl1.css" rel="stylesheet"><link href="/css/perl.css" re +l="stylesheet"> <?xml-stylesheet href="perl1.css" type="text/css"?> 2: <?xml-stylesheet href="perl1.css" type="text/css"?> <link href="//www.perl.org/css/perl1.css" rel="stylesheet"> <link href="/css/perl.css" rel="stylesheet"> <?xml-stylesheet href="perl1.css" type="text/css"?> 3: ?xml-stylesheet href="perl1.css" type="text/css"? 3: link href="//www.perl.org/css/perl1.css" rel="stylesheet" 3: link href="/css/perl.css" rel="stylesheet" 3: ?xml-stylesheet href="perl1.css" type="text/css"? ----- perl1.css //www.perl.org/css/perl1.css /css/perl.css perl1.css

    You made the comment that my original code doesn't work if there are no line breaks. But your approach is to start with a string filled with line breaks, remove them, and then add them back in again, leaving your lines looking identical to the way I reformatted your data (see 2: above).

    Your code works, but only by accident. It's not a generalizable solution. The advantage of using pre-written modules is that they've been tested, they work, they'll save you time, and you can always crack open their code and see how they work. The fact that they either take time to load or run more slowly than some custom code you wrote has to be weighed against the time you take trying to work out one-off solutions.

    --marmot