Re: Extract and modify IMG SRC tags in an HTML document.
by chromatic (Archbishop) on Apr 27, 2000 at 02:26 UTC
|
If you're willing to invest a few minutes in learning about modules (HTML::Parser and HTML::TokeParser come to mind -- follow the link to CPAN) rather than a few minutes banging your head against the wall figuring out how to catch corner cases with regular expressions, it will pay off greatly.
If all of the HTML is very similar and not too tricky, you can use a regexp like:
$line = s!(<img src=")[^"]+([^>]*">)!$1$newimage$2!gi;
to do your substitution. Be warned, using regular expressions on HTML is very tricky, unless you're dealing with extremely consistent HTML. | [reply] [d/l] |
Re: Extract and modify IMG SRC tags in an HTML document.
by plaid (Chaplain) on Apr 27, 2000 at 02:08 UTC
|
A couple things about turnstep's answer.. on a minor issue,
it will only catch capital SRC, which might not catch them
all. A more important point, though, is that IMG tags
aren't the only ones with SRC attributes.. FRAME and
JAVASCRIPT come to mind. For a nasty one, I'd try something
like this:
$html =~ s/(<\s*img\s+.*src\s*=\s*)(")?.*?(?(2)")([\s>])/$1"newimage.j
+pg"$3/sig;
To go through this in parts.. The first group of
parentheses is catching the beginning of the tag, with
optional whitespace checking, followed by a bunch of junk
(the src attribute doesn't necessarily have to follow the
img, e.g. <img border=0 src="img.gif">). This
matches up to the src= part.
Next, a quote is matched if there is one, and if there is
a quote, the match is taken up to the closing quote.
The match ends with either whitespace or a
tag close. The $1 match is everything up to the name of
the image, which is being preserved. Then, your new image
is subbed in, and the original image name is disregarded.
The i flag is needed to catch src and SRC (and sRc, etc.),
and the s flag in case the image tag is broken up on to
multiple lines.
This is a pretty difficult regular expression (which went
through moderate testing..), but if
you're up to reading through the perlre man pages, you
should be able to understand it all. Let me know if there
are any questions about it. | [reply] [d/l] |
|
(?(2)")
do?
(I'd dig up the regex book but it's at work.)
nice regex, btw.
It seems me that the last pair of parentheses are unnecessary.
(It might be instructive to use the /x modifier and comment it... this is definitely the most complex regex I have ever tried to understand and comments could make this regex a good reference.)
-- anonymous monkey | [reply] |
Re: Extract and modify IMG SRC tags in an HTML document.
by toadi (Chaplain) on Apr 27, 2000 at 11:40 UTC
|
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
$url = "http://www.sn.no/"; # for instance
$ua = new LWP::UserAgent;
# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'img'; # we only look closer at <img ...>
push(@imgs, values %attr);
}
# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);
# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});
# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
# Print them out
print join("\n", @imgs), "\n";
Now it can't be that hard to figure out how to change the src with another one.
'cos:
foreach $img(@img){
$img = $newinput;
}
My opinions may have changed,
but not the fact that I am right | [reply] [d/l] [select] |
Re: Extract and modify IMG SRC tags in an HTML document.
by turnstep (Parson) on Apr 27, 2000 at 01:44 UTC
|
Searching and replacing HTML can be tricky. For example, what
about HTML like this:
<H1>Hello World</H1>
<IMG
HEIGHT="20" WIDTH="20"
SRC="me.gif"
ALT="My picture!"
>
However, if you simply want to replace their picture
with yours, use a regexp:
s/SRC="[^"]*"/SRC="$mypicture"/gi;
but you'll also want to remove and/or replace any
WIDTH, HEIGHT, and ALT tags as well.
....which get real complicated real quick. Consider
using a module to parse the html, or write your own little
subroutine to parse each instance of the IMG tag...
| [reply] [d/l] [select] |
RE: Extract and modify IMG SRC tags in an HTML document.
by johncoswell (Acolyte) on Apr 27, 2000 at 18:09 UTC
|
Sorry, it ate my submission, and I'm still new here...
Here's how I would do it:
1. Read in the whole HTML file into a variable:
open FILE,"filename";
read FILE,$file,100000;
close FILE;
(I've seen few HTML docs that are over 100000 bytes in size)
2. Split the $file by "<IMG":
@lines = split(/\<IMG/,$file);
3. Shift out the first line of @lines (it doesn't have an tag in it, so we don't need it) and begin to create the new HTML file
$newfile = shift @lines;
4. For each line in @lines:
Split the line at the first ">"
Replace the "SRC=" element with the new "SRC=" element, assuming that the new graphic is based on the old graphic's URL
foreach $line (@lines) {
$pos = index($line,'>');
$tag = substr($line,0,$pos+1);
$restofline = substr($line,$pos+1);
$tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi;
$newfile .= $tag . $restofline . "\n";
}
5. Do whatever with the $newfile:
print $newfile;
Complete code:
open FILE,"filename";
read FILE,$file,100000;
close FILE;
@lines = split(/\<IMG/,$file);
$newfile = shift @lines;
foreach $line (@lines) {
$pos = index($line,'>');
$tag = substr($line,0,$pos+1);
$restofline = substr($line,$pos+1);
$tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1 | [reply] |
|
This is a very old thread but its really important for me...
I need to modify the hrefs and the src tags for proxy.
but the problem is some hrefs are like
href="/page.html"
href="../page.html"
href="http://url"
href=url
...
src="/abc.jpg"....
What I am doing is downloading the page source using Lynx.
ThenI have to modify all the links like www.yahoo.com example
www.abc.com/cgi-bin/proxy.pl?http://www.yahoo.com
then clicking on any link on that page will download the other page source and proceeds the same way.
| [reply] |
|
I tried this code snippet however its not working in my case. Any help i could get inorder to change src and add an onclick event in an image tag?
| [reply] |
RE: Extract and modify IMG SRC tags in an HTML document.
by johncoswell (Acolyte) on Apr 27, 2000 at 18:01 UTC
|
Here's how I would do it:
1. Read in the whole HTML file into a variable:
open FILE,"filename";
read FILE,$file,100000;
close FILE;
(I've seen few HTML docs that are over 100000 bytes in size)
2. Split the $file by "<IMG":
@lines = split(/\<IMG/,$file);
3. Shift out the first line of @lines (it doesn't have an <IMG> tag in it, so we don't need it) and begin to create the new HTML file
$newfile = shift @lines;
4. For each line in @lines:
Split the line at the first ">"
Replace the "SRC=" element with the new "SRC=" element, assuming that the new graphic is based on the old graphic's URL
foreach $line (@lines) {
$pos = index($line,'>');
$tag = substr($line,0,$pos+1);
$restofline = substr($line,$pos+1);
$tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi;
$newfile .= $tag . $restofline . "\n";
}
5. Do whatever with the $newfile:
print $newfile;
Complete code:
open FILE,"filename";
read FILE,$file,100000;
close FILE;
@lines = split(/\<IMG/,$file);
$newfile = shift @lines;
foreach $line (@lines) {
$pos = index($line,'>');
$tag = substr($line,0,$pos+1);
$restofline = substr($line,$pos+1);
$tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi;
$newfile .= $tag . $restofline . "\n";
}
% | [reply] [d/l] [select] |