Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Solving possible missing links

by talexb (Canon)
on Jul 16, 2019 at 13:59 UTC ( #11102923=monkdiscuss: print w/replies, xml ) Need Help??

This node was up for moderation recently, and the cause was bitrot (out of date links).

Would it be useful to go through the node database, and do a check for external links to see if they can be updated in the same way?

Alex / talexb / Toronto

Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Replies are listed 'Best First'.
Re: Solving possible missing links
by Corion (Pope) on Jul 16, 2019 at 18:33 UTC

    Going through the links pointing to external sites is also interesting to scan for spam links, hence this is indeed desireable.

    I currently lack the time to do it myself, and database access is somewhat scarce, but the relevant DB schema is (roughly):

    create table node ( node_id integer not null unique primary key, type_nodetype integer not null references node, author_user integer not null references node, lastedit timestamp ); create table user ( user_id integer not null references node ); create table note ( note_id integer not null unique primary key references node(node_i +d), doctext text not null default '' );

    And Real, Working SQL to query these tables is (also at Replies with outbound links, but that's for gods only to access):

    select node_id, doctext from node left join document on node_id = document_id where lastedit > date_sub( current_date(), interval 10 day ) and type_nodetype = 11 -- note and doctext like '%http://%' order by lastedit desc

    This SQL should be refined to also catch https:// links, and then some Perl code needs to be written to verify that the text is an actual link.

    Test cases for text with links would be for example:

    <p>It's right [https://cpants.cpanauthors.org/release/GWHAYWOOD/sendma +il-pmilter-1.20_01|here].</p> --- <a href="http://www.groklaw.net">Groklaw</a> --- [href://http://www.perlmonks.org/?node=Tutorials|Monastery Tutorials]

    Negative test cases would be:

    <P><A>http://matrix.cpantesters.org/?dist=sendmail-pmilter%201.20_01</ +A></P> --- "http://localhost:3000"

    Ideally, we will be able to refine this code later to highlight outbound links that are not on the whitelist of Perlmonks links.

      Right -- this is what I had in mind. If there's a way to find and repair dead links, this would be a useful process to run quarterly, perhaps on weekends.

      Just trying to find a general solution to keep the site's content as healthy as possible. I still love going back to read old posts. :)

      Alex / talexb / Toronto

      Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

        Some solutions in Batch remove URLs, I'm sure it's come up a few times. You could tweak it to flag URLs which aren't accessible, perhaps linking to a couple of the well known archives if they have an entry.

Re:Solving possible missing links
by jdporter (Canon) on Jul 17, 2019 at 14:19 UTC

    Note that we should only do this for tutorials and such, certainly not for ordinary user writeups (questions, meditations, etc.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://11102923]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2020-01-19 08:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?