Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

RE: Extract and modify IMG SRC tags in an HTML document.

by johncoswell (Acolyte)
on Apr 27, 2000 at 18:09 UTC ( #9403=note: print w/ replies, xml ) Need Help??


in reply to Extract and modify IMG SRC tags in an HTML document.

Sorry, it ate my submission, and I'm still new here...

Here's how I would do it:

1.  Read in the whole HTML file into a variable:

open FILE,"filename";
read FILE,$file,100000;
close FILE;

(I've seen few HTML docs that are over 100000 bytes in size)

2.  Split the $file by "<IMG":

@lines = split(/\<IMG/,$file);

3.  Shift out the first line of @lines (it doesn't have an <IMG> tag in it, so we don't need it) and begin to create the new HTML file

$newfile = shift @lines;

4.  For each line in @lines:
    Split the line at the first ">"
    Replace the "SRC=" element with the new "SRC=" element, assuming that the new graphic is based on the old graphic's URL

foreach $line (@lines) {
  $pos = index($line,'>');
  $tag = substr($line,0,$pos+1);
  $restofline = substr($line,$pos+1);
  $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi;
  $newfile .= $tag . $restofline . "\n";
}

5.  Do whatever with the $newfile:

print $newfile;

Complete code:

open FILE,"filename";
read FILE,$file,100000;
close FILE;

@lines = split(/\<IMG/,$file);
$newfile = shift @lines;

foreach $line (@lines) {
  $pos = index($line,'>');
  $tag = substr($line,0,$pos+1);
  $restofline = substr($line,$pos+1);
  $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1


Comment on RE: Extract and modify IMG SRC tags in an HTML document.
Re^2: Extract and modify IMG SRC tags in an HTML document.
by Anonymous Monk on Nov 21, 2008 at 08:45 UTC
    This is a very old thread but its really important for me... I need to modify the hrefs and the src tags for proxy. but the problem is some hrefs are like href="/page.html" href="../page.html" href="http://url" href=url ... src="/abc.jpg".... What I am doing is downloading the page source using Lynx. ThenI have to modify all the links like www.yahoo.com example www.abc.com/cgi-bin/proxy.pl?http://www.yahoo.com then clicking on any link on that page will download the other page source and proceeds the same way.
      I tried this code snippet however its not working in my case. Any help i could get inorder to change src and add an onclick event in an image tag?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://9403]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (17)
As of 2014-07-23 17:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (148 votes), past polls