http://www.perlmonks.org?node_id=134076
Category: Web stuff
Author/Contact Info Briac Pilpré
Description:

As strange as it seems, I couldn't find here a code that cleanly modifies HREF attributes in A starting tags in a HTML page.

So here's one that I whipped up quickly to answer a question on fr.comp.lang.perl

It surely could be easily improved to include other links (<script>, <img src="...">, etc.), but you get the idea...

The only (slight) caveats is that the 'a' starting tag is always lowercased and the order of the attributes are lost. But that should not matter at all.
Also, this code won't print 'empty' attributes correctly (though I can't think right now of any empty attributes that are legal with 'a')


To use this script, you have to modify the $new_link variable, and then call the script with the URL of the page to be modified. Every <a href="..."> will have the $new_link added at the start of the href, and the old URL will be properly escaped.

It is probably useless as is, but with a minimum of tweaking, you can easily do what you want.
Actually, it might be a good thing to turn this little script into a module where you would only have to do the URL munging, without worrying about the whole parsing stuff...

#!/usr/bin/perl -w
use strict;
use LWP::Simple qw(get);
use HTML::Parser;
use URI::Escape;

my $new_link = "http://www.baz.com/cgi-bin/doubleclick.cgi?url=";

my $url = $ARGV[0]
    or die "usage: $0 http://www.foo.com/bar.html\n";

my $file     = get($url)
    or die "Cannot get the page '$url'\n";

my $parser = HTML::Parser->new(
    default_h => [ sub { print shift }, 'text' ],
    start_h   => [ \&modify_link, 'tagname, attr, text' ],
)->parse($file);

sub modify_link {
    my ( $tagname, $attr, $text ) = @_;
    print $text and return if $tagname ne 'a';
    $attr->{href} = $new_link . uri_escape( $attr->{href} );
    print '<a', ( map { qq' $_="$attr->{$_}"' } keys %$attr ), '>';
}

__END__
Replies are listed 'Best First'.
(crazyinsomniac) Re: HTML Link Modifier
by crazyinsomniac (Prior) on Dec 24, 2001 at 07:53 UTC
    Here is my take on this. It is pretty similar, except the tag attributes ( src=... alt=...)are printed in the order they appear in the original text (you could also achieve this, if you used 'attrseq'). I however, use HTML::TokeParse and URI, and this one only modifies absolute urls (like links to other http|ftp sites, unlike yours, which will turn a 'mailto:foo@bar.baz.com' into $new_url.'mailto:foo@bar.baz.com'), but can easily be extended to modify any kind of URIs (that is Uniform Resource Identifier, not just Locator's ;D). If you were goning to make this more modular, i'd sugguest not doing it this way (extend HTML::Parser instead, but for a script, this is the perfect strategy, IMHO). Here goes:
    #!/usr/bin/perl -w use strict; use LWP::Simple qw(get); use HTML::TokeParser; use URI; my $new_link = "http://www.baz.com/cgi-bin/doubleclick.cgi?url="; my $url = $ARGV[0] or die "usage: ". __FILE__ ." http://www.foo.com/ba +r.html\n"; my $file = get($url) or die "Cannot get the page '$url'\n"; my $P = HTML::TokeParser->new(\$file); while (my $T = $P->get_token() ) { # 0 1 2 3 4 if($$T[0] eq "S") # ["S", $tag, $attr, $attrseq, $text] { if($$T[1] eq 'img') { &handle_link($T, 'src'); } elsif($$T[1] eq 'a') { &handle_link($T, 'href'); } elsif($$T[1] eq 'script') { &handle_link($T, 'src'); } else # nothing we wanna change { print $$T[4]; } } elsif($$T[0] =~ /^(?:E|PI)$/ ) # end tag | process instruction { print $$T[2]; } elsif($$T[0] =~ /^(?:T|C|D)$/ ) # text | comment | declaration { print $$T[1]; } } # endof while (my $T = $P->get_token) sub handle_link { my ( $T, $A) = @_; my $URI = new URI( $$T[2]->{$A} ); # 0 1 2 3 4 # ["S", $tag, $attr, $attrseq, $text] my $scheme = $URI->scheme(); if( $scheme and $scheme =~ /^(http|ftp)$/i ) # only if its not rel +ative url (yes scheme) { $$T[2]->{$A} = $new_link . $URI->path(); # path is escaped } print "<$$T[1]", ( map { qq| $_="$$T[2]->{$_}"| } @{$$T[3]} ) , '>'; }

     
    ___crazyinsomniac_______________________________________
    Disclaimer: Don't blame. It came from inside the void

    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

Re: HTML Link Modifier
by dmitri (Priest) on Dec 28, 2001 at 00:46 UTC
    You can also look at Apache::ProxyRewrite.