Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Extract and modify IMG SRC tags in an HTML document.

by jmpvm (Novice)
on Apr 27, 2000 at 01:36 UTC ( #9331=perlquestion: print w/ replies, xml ) Need Help??
jmpvm has asked for the wisdom of the Perl Monks concerning the following question:

newbie question:

I am wring a perl script that loads an entire HTML document into an array as lines. What I want to do is search through each line of the array for "img src=" tags and pull the image URL out. I then want to insert a NEW image URL in its place. There MAY be multiple "img src" tags per line.

I was thinking of using INDEX to find the beginning of the img src tag, but then how do I find the end? There MUST be a better way then using INDEX.. Anyone?

Thanks.

Comment on Extract and modify IMG SRC tags in an HTML document.
Re: Extract and modify IMG SRC tags in an HTML document.
by turnstep (Parson) on Apr 27, 2000 at 01:44 UTC

    Searching and replacing HTML can be tricky. For example, what about HTML like this:

    <H1>Hello World</H1> <IMG HEIGHT="20" WIDTH="20" SRC="me.gif" ALT="My picture!" >
    However, if you simply want to replace their picture with yours, use a regexp:
    s/SRC="[^"]*"/SRC="$mypicture"/gi;
    but you'll also want to remove and/or replace any WIDTH, HEIGHT, and ALT tags as well.

    ....which get real complicated real quick. Consider using a module to parse the html, or write your own little subroutine to parse each instance of the IMG tag...

Re: Extract and modify IMG SRC tags in an HTML document.
by plaid (Chaplain) on Apr 27, 2000 at 02:08 UTC
    A couple things about turnstep's answer.. on a minor issue, it will only catch capital SRC, which might not catch them all. A more important point, though, is that IMG tags aren't the only ones with SRC attributes.. FRAME and JAVASCRIPT come to mind. For a nasty one, I'd try something like this:
    $html =~ s/(<\s*img\s+.*src\s*=\s*)(")?.*?(?(2)")([\s>])/$1"newimage.j +pg"$3/sig;
    To go through this in parts.. The first group of parentheses is catching the beginning of the tag, with optional whitespace checking, followed by a bunch of junk (the src attribute doesn't necessarily have to follow the img, e.g. <img border=0 src="img.gif">). This matches up to the src= part. Next, a quote is matched if there is one, and if there is a quote, the match is taken up to the closing quote. The match ends with either whitespace or a tag close. The $1 match is everything up to the name of the image, which is being preserved. Then, your new image is subbed in, and the original image name is disregarded. The i flag is needed to catch src and SRC (and sRc, etc.), and the s flag in case the image tag is broken up on to multiple lines. This is a pretty difficult regular expression (which went through moderate testing..), but if you're up to reading through the perlre man pages, you should be able to understand it all. Let me know if there are any questions about it.
      What does:
      (?(2)")
      
      do? (I'd dig up the regex book but it's at work.) nice regex, btw. It seems me that the last pair of parentheses are unnecessary. (It might be instructive to use the /x modifier and comment it... this is definitely the most complex regex I have ever tried to understand and comments could make this regex a good reference.) -- anonymous monkey
Re: Extract and modify IMG SRC tags in an HTML document.
by chromatic (Archbishop) on Apr 27, 2000 at 02:26 UTC
    If you're willing to invest a few minutes in learning about modules (HTML::Parser and HTML::TokeParser come to mind -- follow the link to CPAN) rather than a few minutes banging your head against the wall figuring out how to catch corner cases with regular expressions, it will pay off greatly.

    If all of the HTML is very similar and not too tricky, you can use a regexp like: $line = s!(<img src=")[^"]+([^>]*">)!$1$newimage$2!gi; to do your substitution. Be warned, using regular expressions on HTML is very tricky, unless you're dealing with extremely consistent HTML.

Re: Extract and modify IMG SRC tags in an HTML document.
by toadi (Chaplain) on Apr 27, 2000 at 11:40 UTC
    use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; $url = "http://www.sn.no/"; # for instance $ua = new LWP::UserAgent; # Set up a callback that collect image links my @imgs = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'img'; # we only look closer at <img ...> push(@imgs, values %attr); } # Make the parser. Unfortunately, we don't know the base yet # (it might be diffent from $url) $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # Expand all image URLs to absolute ones my $base = $res->base; @imgs = map { $_ = url($_, $base)->abs; } @imgs; # Print them out print join("\n", @imgs), "\n";
    Now it can't be that hard to figure out how to change the src with another one.
    'cos:
    foreach $img(@img){ $img = $newinput; }
    My opinions may have changed, but not the fact that I am right
RE: Extract and modify IMG SRC tags in an HTML document.
by johncoswell (Acolyte) on Apr 27, 2000 at 18:01 UTC
    Here's how I would do it: 1. Read in the whole HTML file into a variable:
    open FILE,"filename"; read FILE,$file,100000; close FILE;
    (I've seen few HTML docs that are over 100000 bytes in size)
    2. Split the $file by "<IMG":
    @lines = split(/\<IMG/,$file);

    3. Shift out the first line of @lines (it doesn't have an <IMG> tag in it, so we don't need it) and begin to create the new HTML file
    $newfile = shift @lines;

    4. For each line in @lines:
    Split the line at the first ">"
    Replace the "SRC=" element with the new "SRC=" element, assuming that the new graphic is based on the old graphic's URL
    foreach $line (@lines) { $pos = index($line,'>'); $tag = substr($line,0,$pos+1); $restofline = substr($line,$pos+1); $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi; $newfile .= $tag . $restofline . "\n"; }
    5. Do whatever with the $newfile:
    print $newfile;
    Complete code: open FILE,"filename"; read FILE,$file,100000; close FILE; @lines = split(/\<IMG/,$file); $newfile = shift @lines; foreach $line (@lines) { $pos = index($line,'>'); $tag = substr($line,0,$pos+1); $restofline = substr($line,$pos+1); $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi; $newfile .= $tag . $restofline . "\n"; } %
RE: Extract and modify IMG SRC tags in an HTML document.
by johncoswell (Acolyte) on Apr 27, 2000 at 18:09 UTC
    Sorry, it ate my submission, and I'm still new here...
    Here's how I would do it:
    
    1.  Read in the whole HTML file into a variable:
    
    open FILE,"filename";
    read FILE,$file,100000;
    close FILE;
    
    (I've seen few HTML docs that are over 100000 bytes in size)
    
    2.  Split the $file by "<IMG":
    
    @lines = split(/\<IMG/,$file);
    
    3.  Shift out the first line of @lines (it doesn't have an <IMG> tag in it, so we don't need it) and begin to create the new HTML file
    
    $newfile = shift @lines;
    
    4.  For each line in @lines:
        Split the line at the first ">"
        Replace the "SRC=" element with the new "SRC=" element, assuming that the new graphic is based on the old graphic's URL
    
    foreach $line (@lines) {
      $pos = index($line,'>');
      $tag = substr($line,0,$pos+1);
      $restofline = substr($line,$pos+1);
      $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1}\"/gi;
      $newfile .= $tag . $restofline . "\n";
    }
    
    5.  Do whatever with the $newfile:
    
    print $newfile;
    
    Complete code:
    
    open FILE,"filename";
    read FILE,$file,100000;
    close FILE;
    
    @lines = split(/\<IMG/,$file);
    $newfile = shift @lines;
    
    foreach $line (@lines) {
      $pos = index($line,'>');
      $tag = substr($line,0,$pos+1);
      $restofline = substr($line,$pos+1);
      $tag =~ s/SRC\=\"(.*?)\"/SRC\=\"$newurls{$1
      This is a very old thread but its really important for me... I need to modify the hrefs and the src tags for proxy. but the problem is some hrefs are like href="/page.html" href="../page.html" href="http://url" href=url ... src="/abc.jpg".... What I am doing is downloading the page source using Lynx. ThenI have to modify all the links like www.yahoo.com example www.abc.com/cgi-bin/proxy.pl?http://www.yahoo.com then clicking on any link on that page will download the other page source and proceeds the same way.
        I tried this code snippet however its not working in my case. Any help i could get inorder to change src and add an onclick event in an image tag?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://9331]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (7)
As of 2014-09-22 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (198 votes), past polls