Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

RE: Extract and modify IMG SRC tags in an HTML document.

by johncoswell (Acolyte)
on Apr 27, 2000 at 18:09 UTC ( #9403=note: print w/ replies, xml ) Need Help??

in reply to Extract and modify IMG SRC tags in an HTML document.

Sorry, it ate my submission, and I'm still new here...

Here's how I would do it:

1.  Read in the whole HTML file into a variable:

open FILE,"filename";
read FILE,$file,100000;
close FILE;

(I've seen few HTML docs that are over 100000 bytes in size)

2.  Split the $file by "<IMG":

@lines = split(/\<IMG/,$file);

3.  Shift out the first line of @lines (it doesn't have an <IMG> tag in it, so we don't need it) and begin to create the new HTML file

$newfile = shift @lines;

4.  For each line in @lines:
    Split the line at the first ">"
    Replace the "SRC=" element with the new "SRC=" element, assuming that the new graphic is based on the old graphic's URL

foreach $line (@lines) {
  $pos = index($line,'>');
  $tag = substr($line,0,$pos+1);
  $restofline = substr($line,$pos+1);
  $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi;
  $newfile .= $tag . $restofline . "\n";

5.  Do whatever with the $newfile:

print $newfile;

Complete code:

open FILE,"filename";
read FILE,$file,100000;
close FILE;

@lines = split(/\<IMG/,$file);
$newfile = shift @lines;

foreach $line (@lines) {
  $pos = index($line,'>');
  $tag = substr($line,0,$pos+1);
  $restofline = substr($line,$pos+1);
  $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1

Comment on RE: Extract and modify IMG SRC tags in an HTML document.
Replies are listed 'Best First'.
Re^2: Extract and modify IMG SRC tags in an HTML document.
by Anonymous Monk on Nov 21, 2008 at 08:45 UTC
    This is a very old thread but its really important for me... I need to modify the hrefs and the src tags for proxy. but the problem is some hrefs are like href="/page.html" href="../page.html" href="http://url" href=url ... src="/abc.jpg".... What I am doing is downloading the page source using Lynx. ThenI have to modify all the links like example then clicking on any link on that page will download the other page source and proceeds the same way.
      I tried this code snippet however its not working in my case. Any help i could get inorder to change src and add an onclick event in an image tag?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://9403]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2015-11-28 19:18 GMT
Find Nodes?
    Voting Booth?

    What would be the most significant thing to happen if a rope (or wire) tied the Earth and the Moon together?

    Results (743 votes), past polls