Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Grabbing part of an HTML page

by kingdean (Novice)
on Mar 28, 2004 at 21:58 UTC ( #340439=perlquestion: print w/replies, xml ) Need Help??

kingdean has asked for the wisdom of the Perl Monks concerning the following question:

I would love your advice on this problem. I have a small script that isn't working. What I want it to do is to pull data from .html pages on my website. I want to put a start tag and end tag in the other .html files so I know what data I want copied to my new .html file. Here is the code.
#!/usr/bin/perl $start_pattern = '<!-- start section-->'; $end_pattern = '<!-- end section -->'; @files_to_look_in = ("/path/to/files1.html", "/path/to/files2.html"); $write_line = 0; while(@files_to_look_in) { open(HTM_FILE, <$_); while(<HTM_FILE>) { if($_=~/$start_pattern/i) { $write_line = 1; } if($_=~/$end_pattern/i) { $write_line = 0; } if($write_line =~ '1') { print "$_\n"; } } }
Any help would greatly be appreciated. I have tested this, and it doesn't work. I do not get any errors. Nothing shows up. Thanks Dean Update: Thank you. I got it working. There was a line of code missing for html print "Content-type: text/html\n\n";

20040329 Edit by castaway: Changed title from 'Perl Monks - what am I doing wrong?'

Replies are listed 'Best First'.
Re: Grabbing part of an HTML page
by Happy-the-monk (Canon) on Mar 28, 2004 at 22:08 UTC

    To fix the problem change this line:
    -    while(@files_to_look_in)
    +    foreach ( @files_to_look_in )

    Edit:
    Instead of writing   if($_=~/$start_pattern/i)   you can simply do   if ( m/$start_pattern/i )   omitting   $_.

    Edit 2:
    I think you also need to change this:
    -     open(HTM_FILE, <$_);
    +     open( HTM_FILE, "< $_" );

      Tried that. It didn't work. :( Thanks Dean

        Tried that. It didn't work.

        I tried that too, before posting here. And it worked fine.
        But I also made sure the strings you were looking for exist in the test data.

        my test data:

        <!-- start section--> blah <!-- end section -->

        the test script in readmore:

Re: Grabbing part of an HTML page
by pbeckingham (Parson) on Mar 28, 2004 at 22:24 UTC

    Try this, but beware of:

    • lines that contain both the start pattern and the end pattern
    • lines that contain characters before the start pattern, which are not handled here, as aren't lines that contain characters after the end pattern
    • start and end patterns that may span a line
    • errors opening files - you might want to check for them
    • the \n in the print command is redundant, as the original \n was not chomped off
    #! /usr/bin/perl -w use strict; my $start_pattern = '<!-- start section-->'; my $end_pattern = '<!-- end section -->'; my @files_to_look_in = ('/path/to/files1.html', '/path/to/files2.html' +); my $write_line = 0; foreach (@files_to_look_in) { open HTM_FILE, "<$_"; while (<HTM_FILE>) { $write_line = 1 if /$start_pattern/i; $write_line = 0 if /$end_pattern/i; print $_, "\n" if $write_line; } close HTM_FILE; }

      pbeckingham, Thanks for your reply. I tried this and it still doesn't work. I do not get an error, but just nothing gets printed. One question, if I wanted to use a URL path such as "http://www.mysite.com/" instead of the local path /usr/etc could I do this? Thanks Dean

        kingdean, I updated my code to include my @files..., and it works on my test files. That's what I get for adding in use strict at the last moment, and not using it from the start, as we all should!

        Perhaps you could show us your test files?

        If you use a URL instead of a local path, it will not work - the open function does not get web pages. You would use LWP::Simple or similar to read the page for you, but that is something you can easily find elsewhere on this site.

Re: Grabbing part of an HTML page
by cLive ;-) (Prior) on Mar 28, 2004 at 22:27 UTC
    Why not just use one regular expression?
    #!/usr/bin/perl use strict; my $start_pattern = '<!-- start section-->'; my $end_pattern = '<!-- end section -->'; my @files_to_look_in = ("/path/to/files1.html", "/path/to/files2.html" +); for(@files_to_look_in) { local $/; open(HTM_FILE, <$_) || die "Can't open file: $!"; my $file = <HTM_FILE>; if ($file =~ /$start_pattern(.*)$end_pattern/s) { print "$1"; } }

    Note: If I hadn't localized $/ (the input record seperator), I would have needed to add a /s modifier to the regular expression to match on the whole file.

    cLive ;-)

    update: oops, not paying attention "while" is now "for". 2) made a boo-boo - see below...

      For the single regex to work, you would need to add the "s" modifier at the end, so that your ".*" doesn't stop matching at the first line-break character. Also, depending on the nature of the OP's data, it may need to be a non-greedy match:
      # ... if ( $file =~ /$start_pattern(.*?)$end_pattern/s ) { print $1; } # ...
      update: Forgot to mention: if the OP's data happens to contain more than one "start ... end" sequence within the same file, this would have to be structured as a loop -- something like:
      # while ( $file =~ /start_pattern(.*?)$end_pattern/gs ) { print "$1\n"; } #

        What does the /s modifier do if $/ is undefined? :)

        cLive ;-)

      You have written another infinite loop - the while (@files_to_look_in) is always true, and no part of the code shifts or pops values from the array.

Re: Grabbing part of an HTML page
by TomDLux (Vicar) on Mar 29, 2004 at 02:09 UTC

    You are re-infventing the range operator. Instead of

    if($_=~/$start_pattern/i) { $write_line = 1; } if($_=~/$end_pattern/i) { $write_line = 0; } if($write_line =~ '1') { print "$_\n"; }

    You can simply write:

    if(/$start_pattern/i .. /$end_pattern/i) { print "$_\n"; }

    The if condition is false until the first pattern matches, and remains true until the second pattern matches. Funny things happen if both are true on the same line.

    P.S. In your version, you should use numerical comparison, '==', rather than regular expression, '=~'. Regex testing is powerful, but expensive ( slow ), while simple comparison is very fast, but only capable of testing equality.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: Grabbing part of an HTML page
by jZed (Prior) on Mar 28, 2004 at 23:33 UTC
    open(HTM_FILE, <$_);

    Maybe $! would tell you something.

Re: Grabbing part of an HTML page
by pbeckingham (Parson) on Mar 29, 2004 at 01:33 UTC

    Now I think about it, it appears that perhaps your starting and ending patterns may be at fault - they are inconsistently formed. Note the space between the word 'section' and the '-->'.

    <!-- start section--> <!-- end section -->

      I even tried this with no <-- or -->. I tested it with an html file that just has start section blah end section And it still didn't work. :(
        You have answered back to a number of replies now, saying you've tried their ideas and it didn't work. Maybe it would be worth while replying to your own root node with the code as it stands now, so folks can see whether you've followed their ideas as intended.

        Have you changed your outer loop and open statement yet? Let me suggest that it should now look like this:

        for my $file ( @files_to_look_in ) { open( HTML_FILE, "<$file" ) or die "Can't open $file for input: $!" +; ...
        If you don't have the "or die ...", then the failure might be a matter of not opening the file -- Perl won't generate an error on an open statement that fails, unless you explicitly test to see if it succeeds, and tell it what to say and do when it fails.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://340439]
Approved by b10m
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2023-02-08 03:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (40 votes). Check out past polls.

    Notices?