Re: Batch remove URLs

Link rot, what a pain. Here's something very quickly thrown together based upon your expanded criteria.

Create a new directory, copy your htm files in there.
Install the required modules, run the following from the command prompt: cpanm Mojolicious Path::Tiny.
Download the code below to the same location. Run the code.

This code reads the content of each htm file in a directory, parses it with Mojo::DOM, finds all links, checks each URL with Mojo::UserAgent , if it looks like it's dead it'll remove the parent HTML element. Saving the file after.

Example HTML:

<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>
<li><a href="http://sitedoesnotexist9999.net">fakesite</a></li>
</ul>
</body>
</html>
[download]

Perl code:

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';
use Path::Tiny;
use Mojo::DOM;
use Mojo::UserAgent;

# get current directory
my $dir = Path::Tiny->cwd;

# for each html file
for ( $dir->children( qr/\.htm$/ ) ){

  # read the contents into a variable
  my $html = path( $_->basename )->slurp;

  # get the dom
  my $dom = Mojo::DOM->new( $html );

  # find all links
  for( $dom->find('a')->each ){
    
    # get target href
    my $url = $_->attr('href');
    say "checking link $url";
    
    # use Mojo::UserAgent to check if link is alive
    my $ua  = Mojo::UserAgent->new;
    my $res;
    eval { $res = $ua->max_redirects(5)->head( $url )->result };

    # if an error is thrown
    if ( $@ ){
      warn "$url seems dead, removing parent";
      $_->parent->remove;
    } 

    # play nice
    sleep(10);
    
  }
  # save file  
  path( $_->basename )->spew($dom->content);
}
[download]

Example HTML after running program:

<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>

</ul>
</body>
</html>
[download]

Since I don't have an example of what you're actually using, and things like [404 Not Found] don't often make sense to keep around, I removed them, however simply using the replace method rather than remove on the parent does exactly what you want:

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';
use Path::Tiny;
use Mojo::DOM;
use Mojo::UserAgent;

# get current directory
my $dir = Path::Tiny->cwd;

# for each html file
for ( $dir->children( qr/\.htm$/ ) ){

  # read the contents into a variable
  my $html = path( $_->basename )->slurp;

  # get the dom
  my $dom = Mojo::DOM->new( $html );

  # find all links
  for( $dom->find('a')->each ){
    
    # get target href
    my $url = $_->attr('href');
    say "checking link $url";
    
    # use Mojo::UserAgent to check if link is alive
    my $ua  = Mojo::UserAgent->new;
    my $res;
    eval { $res = $ua->max_redirects(5)->head( $url )->result };

    # if an error is thrown
    if ( $@ ){
      warn "$url seems dead, updating link";
      $_->replace('[404 Not Found]');
      
    } 

    # play nice
    sleep(10);
    
  }
  
  # save file
  path( $_->basename )->spew($dom->content);
}
[download]

Which outputs:

<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>
<li>[404 Not Found]</li>
</ul>
</body>
</html>
[download]

There's a 10 second sleep in there, don't batter URLs. There is room for optimisation, for example if the same URL occurs more than once per file, a list of tested working URLs etc, but I'll leave that as an exercise for you.

Update: small addition.

Comment on Re: Batch remove URLs Select or Download Code

In Section Seekers of Perl Wisdom