Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

RE: Extract and modify IMG SRC tags in an HTML document.

by johncoswell (Acolyte)
on Apr 27, 2000 at 18:09 UTC ( [id://9403]=note: print w/replies, xml ) Need Help??


in reply to Extract and modify IMG SRC tags in an HTML document.

Sorry, it ate my submission, and I'm still new here...
Here's how I would do it:

1.  Read in the whole HTML file into a variable:

open FILE,"filename";
read FILE,$file,100000;
close FILE;

(I've seen few HTML docs that are over 100000 bytes in size)

2.  Split the $file by "<IMG":

@lines = split(/\<IMG/,$file);

3.  Shift out the first line of @lines (it doesn't have an <IMG> tag in it, so we don't need it) and begin to create the new HTML file

$newfile = shift @lines;

4.  For each line in @lines:
    Split the line at the first ">"
    Replace the "SRC=" element with the new "SRC=" element, assuming that the new graphic is based on the old graphic's URL

foreach $line (@lines) {
  $pos = index($line,'>');
  $tag = substr($line,0,$pos+1);
  $restofline = substr($line,$pos+1);
  $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi;
  $newfile .= $tag . $restofline . "\n";
}

5.  Do whatever with the $newfile:

print $newfile;

Complete code:

open FILE,"filename";
read FILE,$file,100000;
close FILE;

@lines = split(/\<IMG/,$file);
$newfile = shift @lines;

foreach $line (@lines) {
  $pos = index($line,'>');
  $tag = substr($line,0,$pos+1);
  $restofline = substr($line,$pos+1);
  $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1
  • Comment on RE: Extract and modify IMG SRC tags in an HTML document.

Replies are listed 'Best First'.
Re^2: Extract and modify IMG SRC tags in an HTML document.
by Anonymous Monk on Nov 21, 2008 at 08:45 UTC
    This is a very old thread but its really important for me... I need to modify the hrefs and the src tags for proxy. but the problem is some hrefs are like href="/page.html" href="../page.html" href="http://url" href=url ... src="/abc.jpg".... What I am doing is downloading the page source using Lynx. ThenI have to modify all the links like www.yahoo.com example www.abc.com/cgi-bin/proxy.pl?http://www.yahoo.com then clicking on any link on that page will download the other page source and proceeds the same way.
      I tried this code snippet however its not working in my case. Any help i could get inorder to change src and add an onclick event in an image tag?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://9403]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2025-06-19 01:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.