Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Batch remove URLs

by bobafifi (Beadle)
on Oct 27, 2017 at 05:01 UTC ( #1202112=perlquestion: print w/replies, xml ) Need Help??

bobafifi has asked for the wisdom of the Perl Monks concerning the following question:

A friend of mine made her original website using FrontPage Express 2.0 circa 2000 before moving to Blogger. Today, she's got some 300 URLs from that old site that are returning "404 Not Found". If possible, I'd like to batch remove them (there are about 100 .htm pages) and replace with a simple 404 Not Found notice. I know how to remove individual URLs from the pages using a find/replace one liner, but doing them all in one pass has so far eluded me. Any advice about where to look for more info about how best to go about this would be greatly appreciated. Thanks!

UPDATE 10/27/17
Thanks everybody for all the help and good ideas! Turns out the problems I was having with running the one-liner were with malformed URLs and had nothing to do with the code. Once that was fixed, I just stacked the 300 one-liners on top of each other and it ran perfectly. Like so:
find . -type f -name "*.htm" -print|xargs perl -i -pe 's/http:\/\/exam +ple1.com/[404 Not Found]/g'; find . -type f -name "*.htm" -print|xargs perl -i -pe 's/http:\/\/exam +ple2.com/[404 Not Found]/g'; find . -type f -name "*.htm" -print|xargs perl -i -pe 's/http:\/\/exam +ple3.com/[404 Not Found]/g'; etc.

Replies are listed 'Best First'.
Re: Batch remove URLs
by marto (Archbishop) on Oct 27, 2017 at 11:31 UTC

    Link rot, what a pain. Here's something very quickly thrown together based upon your expanded criteria.

    • Create a new directory, copy your htm files in there.
    • Install the required modules, run the following from the command prompt: cpanm Mojolicious Path::Tiny.

    • Download the code below to the same location. Run the code.

    This code reads the content of each htm file in a directory, parses it with Mojo::DOM, finds all links, checks each URL with Mojo::UserAgent , if it looks like it's dead it'll remove the parent HTML element. Saving the file after.

    Example HTML:

    <html> <head> <title>test</title> </head> <body> <ul> <li><a href="http://perlmonks.org">perlmonks</a></li> <li><a href="http://archive.org">archnive.org</a></li> <li><a href="http://sitedoesnotexist9999.net">fakesite</a></li> </ul> </body> </html>

    Perl code:

    #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Path::Tiny; use Mojo::DOM; use Mojo::UserAgent; # get current directory my $dir = Path::Tiny->cwd; # for each html file for ( $dir->children( qr/\.htm$/ ) ){ # read the contents into a variable my $html = path( $_->basename )->slurp; # get the dom my $dom = Mojo::DOM->new( $html ); # find all links for( $dom->find('a')->each ){ # get target href my $url = $_->attr('href'); say "checking link $url"; # use Mojo::UserAgent to check if link is alive my $ua = Mojo::UserAgent->new; my $res; eval { $res = $ua->max_redirects(5)->head( $url )->result }; # if an error is thrown if ( $@ ){ warn "$url seems dead, removing parent"; $_->parent->remove; } # play nice sleep(10); } # save file path( $_->basename )->spew($dom->content); }

    Example HTML after running program:

    <html> <head> <title>test</title> </head> <body> <ul> <li><a href="http://perlmonks.org">perlmonks</a></li> <li><a href="http://archive.org">archnive.org</a></li> </ul> </body> </html>

    Since I don't have an example of what you're actually using, and things like [404 Not Found] don't often make sense to keep around, I removed them, however simply using the replace method rather than remove on the parent does exactly what you want:

    #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Path::Tiny; use Mojo::DOM; use Mojo::UserAgent; # get current directory my $dir = Path::Tiny->cwd; # for each html file for ( $dir->children( qr/\.htm$/ ) ){ # read the contents into a variable my $html = path( $_->basename )->slurp; # get the dom my $dom = Mojo::DOM->new( $html ); # find all links for( $dom->find('a')->each ){ # get target href my $url = $_->attr('href'); say "checking link $url"; # use Mojo::UserAgent to check if link is alive my $ua = Mojo::UserAgent->new; my $res; eval { $res = $ua->max_redirects(5)->head( $url )->result }; # if an error is thrown if ( $@ ){ warn "$url seems dead, updating link"; $_->replace('[404 Not Found]'); } # play nice sleep(10); } # save file path( $_->basename )->spew($dom->content); }

    Which outputs:

    <html> <head> <title>test</title> </head> <body> <ul> <li><a href="http://perlmonks.org">perlmonks</a></li> <li><a href="http://archive.org">archnive.org</a></li> <li>[404 Not Found]</li> </ul> </body> </html>

    There's a 10 second sleep in there, don't batter URLs. There is room for optimisation, for example if the same URL occurs more than once per file, a list of tested working URLs etc, but I'll leave that as an exercise for you.

    Update: small addition.

Re: Batch remove "404 Not Found" URLs
by kcott (Bishop) on Oct 27, 2017 at 05:39 UTC

    G'day bobafifi,

    "I know how to remove individual URLs from the pages using a find/replace one liner, but doing them all in one pass has so far eluded me."

    If you'd posted the part that you know, we could suggest how to extend that. Here's an example one-liner to change multiple lines in multiple files:

    $ cat ABC A old A B old B C old C
    $ cat DEF D old D E old E F old F
    $ perl -pi -e 's/old/new/' ABC DEF
    $ cat ABC A new A B new B C new C
    $ cat DEF D new D E new E F new F

    See perlrun for information on the -i and -p switches that I used.

    — Ken

      Thanks Ken!
      Here's what I've been using:
      find . -type f -name "*.htm" -print|xargs perl -i -pe 's/http:\/\/example\.com\/[404 Not Found]/g'

      I'm afraid I haven't described what I'm trying to accomplish very well, sorry.
      1.) I have a list of 300 URLs
      2.) I have a folder on my desktop with 100 .htm pages
      3.) I want to run that list against those 100 pages and remove URLs
      4.) This will leave the <a href tags in place with the text [404 Not Found] (instead of the URL - for example, <a href="[404 Not Found]">[404 Not Found]</a>).

      My plan then (since some of her links have descriptive text and others link text), was/is to render those dummy tags in the HTML inactive by doing another find/replace and leaving just <a>[404 Not Found]</a> to display 404 Not Found or the link's descriptive text in the browser.

      Thanks again Ken - I'll check out the perlrun link

        Assuming that you just want to get the job done and are not pursuing this as an academic exercise, I would abandon the one-liner approach. It can be done that way, but the more you throw into it the messier it gets. Here's one plan:

        1. Store your 300 URLs in a file, one per line (if you haven't already done so). You can then slurp this into an array at the start of your script.
        2. Loop over the files with a simple glob
        3. Inside that loop over all the URLs
        4. Inside the inner loop, call a subroutine with the filename and the URL to replace

        You can now test the inner subroutine in isolation on a test file to your heart's content to get it perfectly right without destroying the initial content. Consider quotemeta for the search terms. If you get stuck with that approach, come back with specific questions, ideally as an SSCCE. Good luck.

        "Here's the what I've been using ... 's/s/http://example.com/[404 Not Found]/g'"

        I doubt it. That won't even compile:

        $ perl -MO=Deparse -e 's/s/http://example.com/[404 Not Found]/g' Bareword found where operator expected at -e line 1, near "404 Not" (Missing operator before Not?) syntax error at -e line 1, near "404 Not Found" -e had compilation errors.

        Even assuming the initial "s/s/" was a typo, and should have been just "s/"; it still doesn't compile:

        $ perl -MO=Deparse -e 's/http://example.com/[404 Not Found]/g' Bareword found where operator expected at -e line 1, near "404 Not" (Missing operator before Not?) Regexp modifiers "/a" and "/l" are mutually exclusive at -e line 1, at + end of line syntax error at -e line 1, near "404 Not Found" -e had compilation errors.

        Perhaps you meant something closer to this:

        $ perl -MO=Deparse -e 's{http://example.com}{[404 Not Found]}g' s[http://example.com][[404 Not Found]]g; -e syntax OK

        You really need to copy and paste verbatim code. Typing by hand, or making guesses, is extremely error-prone; we can only respond to what you posted (not something different, that was maybe intended, but not actually written). Unfortunately, when one such problem is found, it raises the question of whether other parts are not true representations of the real code, data, output, and so on.

        While you probably could still do this with a one-liner; it's getting a bit complicated for that and I'd recommend a script. For a simple text substitution, a regex is probably fine; if it's actually more complex than your post suggests, you should find an alternative tool (see "Parsing HTML/XML with Regular Expressions" for a whole raft of options).

        You talk about doing this in two passes; that seems wasteful to me and one pass is easy anyway. You say you want to end up with "<a>[404 Not Found]</a>"; use whatever you want but, in the code below, I've used "<span class="bad-url">[404 Not Found]</span>": that will render as plain text as it is, but allows you to highlight it with CSS if you so desire.

        In the code below I've used Inline::Files purely for demonstration purposes. I'm assuming you're familiar with open. You can presumably get your list of HTML files with "*.htm" on the command line (the find and xargs seems overkill to me, but maybe you have a reason); using glob, within your script, is another option; there's also readdir; and there are many modules you could also use. I've also assumed that your "list of 300 URLs" is also in a file somewhere; however, it's far from clear if that's actually the case.

        In the code below, the technique I'm demonstrating involves creating a hash from your list of URLs once, then substituting links which match one of those URLs. Do note that your post suggests that the href value is the same as the <a> tag content: my code reflects that; modify if necessary.

        #!/usr/bin/env perl -l use strict; use warnings; use Inline::Files; my %bad_url; while (<URLLIST>) { chomp; ++$bad_url{$_}; } my $re = qr{(?x: ( # capture entire element to \$1 <a # match start of 'a' start tag \s+ # match whitespace after element name href=" # match start of href attribute ( # capture href value to \$2 [^"]+ # match anything that isn't a " ) # end \$2 capture " # match closing " \s* # match optional whitespace > # match end of 'a' start tag \s* # match optional whitespace \g2 # match href value (captured in \$2) \s* # match optional whitespace </a> # match 'a' end tag ) # end \$1 capture )}; my $replace = '<span class="bad-url">[404 Not Found]</span>'; for my $fh (\*HTM1, \*HTM2) { my $html = do { local $/; <$fh> }; print '*** ORIGINAL ***'; print $html; $html =~ s/$re/exists $bad_url{$2} ? $replace : $1/eg; print '*** MODIFIED ***'; print $html; } __URLLIST__ http://bad1.com/ http://bad2.com/ http://bad3.com/ http://bad4.com/ __HTM1__ <h1>HTM1</h1> <a href="http://bad1.com/">http://bad1.com/</a> <a href="http://good.com/">http://good.com/</a> <a href="http://bad2.com/">http://bad2.com/</a> __HTM2__ <h1>HTM2</h1> <a href="http://good.com/">http://good.com/</a> <a href="http://bad2.com/"> http://bad2.com/ </a> <a href="http://good.com/"> http://good.com/ </a> <a href="http://bad3.com/" >http://bad3.com/</a> <a href="http://bad4.com/">http://bad3.com/</a> <a href="http://bad4.com/">http://bad4.com/</a>

        Output:

        *** ORIGINAL *** <h1>HTM1</h1> <a href="http://bad1.com/">http://bad1.com/</a> <a href="http://good.com/">http://good.com/</a> <a href="http://bad2.com/">http://bad2.com/</a> *** MODIFIED *** <h1>HTM1</h1> <span class="bad-url">[404 Not Found]</span> <a href="http://good.com/">http://good.com/</a> <span class="bad-url">[404 Not Found]</span> *** ORIGINAL *** <h1>HTM2</h1> <a href="http://good.com/">http://good.com/</a> <a href="http://bad2.com/"> http://bad2.com/ </a> <a href="http://good.com/"> http://good.com/ </a> <a href="http://bad3.com/" >http://bad3.com/</a> <a href="http://bad4.com/">http://bad3.com/</a> <a href="http://bad4.com/">http://bad4.com/</a> *** MODIFIED *** <h1>HTM2</h1> <a href="http://good.com/">http://good.com/</a> <span class="bad-url">[404 Not Found]</span> <a href="http://good.com/"> http://good.com/ </a> <span class="bad-url">[404 Not Found]</span> <a href="http://bad4.com/">http://bad3.com/</a> <span class="bad-url">[404 Not Found]</span>

        — Ken

Re: Batch remove URLs
by haukex (Chancellor) on Oct 27, 2017 at 07:12 UTC
    using a find/replace one liner

    Please see Parsing HTML/XML with Regular Expressions, I think a module like Mojo::DOM will be much more reliable.

    As for batch processing, I'd suggest a module like File::Find::Rule or perhaps Path::Class's recurse method (although the find method you showed here works too, if your script only processes one file at a time).

    For that list of 300 URLs, you could perhaps build a regex out of them which you could then use to search (after having safely parsed the file with one of the modules mentioned in the link above ;-) ). If you need advanced handling of URLs, use the URI module.

Re: Batch remove URLs
by marto (Archbishop) on Oct 29, 2017 at 17:04 UTC

    "Once that was fixed, I just stacked the 300 one-liners on top of each other.."

    Yikes

    "..and it ran perfectly"

    <html> <head> <title>test</title> </head> <body> <ul> <li><a href="http://perlmonks.org">perlmonks</a></li> <li><a href="http://archive.org">archnive.org</a></li> <li><a href="http://example1.com">test</a></li> </ul> </body> </html>

    Becomes:

    <html> <head> <title>test</title> </head> <body> <ul> <li><a href="http://perlmonks.org">perlmonks</a></li> <li><a href="http://archive.org">archnive.org</a></li> <li><a href="[404 Not Found]">test</a></li> </ul> </body> </html>

    So when 'test' is clicked it will return something along the lines of:

    The requested URL /[404 Not Found] was not found on this server

    Rather than replacing the link with the text(<li>[404 Not Found]></li>). You've replaced one problem for another, rather than fix it.

       > Rather than replacing the link with the text(<li>[404 Not Found]></li>). You've replaced one problem for another, rather than fix it.

      That was just Step 1... (please see http://www.perlmonks.org/?node_id=1202118)

      Step 2: Find/Replace <a href="[404 Not Found]"> with <a> (which renders the links inactive, does it not?).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1202112]
Approved by kcott
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2019-11-21 15:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (104 votes). Check out past polls.

    Notices?