Re: Grabbing part of an HTML page
by Happy-the-monk (Canon) on Mar 28, 2004 at 22:08 UTC
|
To fix the problem change this line:
- while(@files_to_look_in)
+ foreach ( @files_to_look_in )
Edit:
Instead of writing if($_=~/$start_pattern/i) you can simply do if ( m/$start_pattern/i ) omitting $_.
Edit 2:
I think you also need to change this:
- open(HTM_FILE, <$_);
+ open( HTM_FILE, "< $_" );
| [reply] |
|
Tried that. It didn't work. :(
Thanks
Dean
| [reply] |
|
Tried that. It didn't work.
I tried that too, before posting here. And it worked fine.
But I also made sure the strings you were looking for exist in the test data.
my test data:
<!-- start section-->
blah
<!-- end section -->
the test script in readmore:
| [reply] [d/l] [select] |
Re: Grabbing part of an HTML page
by pbeckingham (Parson) on Mar 28, 2004 at 22:24 UTC
|
Try this, but beware of:
- lines that contain both the start pattern and the end pattern
- lines that contain characters before the start pattern, which are not handled here, as aren't lines that contain characters after the end pattern
- start and end patterns that may span a line
- errors opening files - you might want to check for them
- the \n in the print command is redundant, as the original \n was not chomped off
#! /usr/bin/perl -w
use strict;
my $start_pattern = '<!-- start section-->';
my $end_pattern = '<!-- end section -->';
my @files_to_look_in = ('/path/to/files1.html', '/path/to/files2.html'
+);
my $write_line = 0;
foreach (@files_to_look_in)
{
open HTM_FILE, "<$_";
while (<HTM_FILE>)
{
$write_line = 1 if /$start_pattern/i;
$write_line = 0 if /$end_pattern/i;
print $_, "\n" if $write_line;
}
close HTM_FILE;
}
| [reply] [d/l] [select] |
|
pbeckingham,
Thanks for your reply. I tried this and it still doesn't work. I do not get an error, but just nothing gets printed.
One question, if I wanted to use a URL path such as "http://www.mysite.com/" instead of the local path /usr/etc could I do this?
Thanks
Dean
| [reply] |
|
kingdean, I updated my code to include my @files..., and it works on my test files. That's what I get for adding in use strict at the last moment, and not using it from the start, as we all should!
Perhaps you could show us your test files?
If you use a URL instead of a local path, it will not work - the open function does not get web pages. You would use LWP::Simple or similar to read the page for you, but that is something you can easily find elsewhere on this site.
| [reply] [d/l] [select] |
Re: Grabbing part of an HTML page
by cLive ;-) (Prior) on Mar 28, 2004 at 22:27 UTC
|
Why not just use one regular expression?
#!/usr/bin/perl
use strict;
my $start_pattern = '<!-- start section-->';
my $end_pattern = '<!-- end section -->';
my @files_to_look_in = ("/path/to/files1.html", "/path/to/files2.html"
+);
for(@files_to_look_in) {
local $/;
open(HTM_FILE, <$_) || die "Can't open file: $!";
my $file = <HTM_FILE>;
if ($file =~ /$start_pattern(.*)$end_pattern/s) {
print "$1";
}
}
Note: If I hadn't localized $/ (the input record seperator), I would have needed to add a /s modifier to the regular expression to match on the whole file.
cLive ;-)
update: oops, not paying attention "while" is now "for". 2) made a boo-boo - see below...
| [reply] [d/l] |
|
For the single regex to work, you would need to add the "s" modifier at the end, so that your ".*" doesn't stop matching at the first line-break character. Also, depending on the nature of the OP's data, it may need to be a non-greedy match:
# ...
if ( $file =~ /$start_pattern(.*?)$end_pattern/s ) {
print $1;
}
# ...
update: Forgot to mention: if the OP's data happens to contain more than one "start ... end" sequence within the same file, this would have to be structured as a loop -- something like:
#
while ( $file =~ /start_pattern(.*?)$end_pattern/gs ) {
print "$1\n";
}
#
| [reply] [d/l] [select] |
|
| [reply] |
|
|
|
| [reply] [d/l] |
Re: Grabbing part of an HTML page
by TomDLux (Vicar) on Mar 29, 2004 at 02:09 UTC
|
if($_=~/$start_pattern/i)
{
$write_line = 1;
}
if($_=~/$end_pattern/i)
{
$write_line = 0;
}
if($write_line =~ '1')
{
print "$_\n";
}
You can simply write:
if(/$start_pattern/i .. /$end_pattern/i)
{
print "$_\n";
}
The if condition is false until the first pattern matches, and remains true until the second pattern matches. Funny things happen if both are true on the same line.
P.S. In your version, you should use numerical comparison, '==', rather than regular expression, '=~'. Regex testing is powerful, but expensive ( slow ), while simple comparison is very fast, but only capable of testing equality.
--
TTTATCGGTCGTTATATAGATGTTTGCA
| [reply] [d/l] [select] |
Re: Grabbing part of an HTML page
by jZed (Prior) on Mar 28, 2004 at 23:33 UTC
|
| [reply] |
Re: Grabbing part of an HTML page
by pbeckingham (Parson) on Mar 29, 2004 at 01:33 UTC
|
Now I think about it, it appears that perhaps your starting and ending patterns may be at fault - they are inconsistently formed. Note the space between the word 'section' and the '-->'.
<!-- start section-->
<!-- end section -->
| [reply] [d/l] |
|
I even tried this with no <-- or -->.
I tested it with an html file that just has
start section
blah
end section
And it still didn't work. :(
| [reply] |
|
for my $file ( @files_to_look_in ) {
open( HTML_FILE, "<$file" ) or die "Can't open $file for input: $!"
+;
...
If you don't have the "or die ...", then the failure might be a matter of not opening the file -- Perl won't generate an error on an open statement that fails, unless you explicitly test to see if it succeeds, and tell it what to say and do when it fails. | [reply] [d/l] |