Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Finding dates from web pages

by cormanaz (Chaplain)
on Feb 24, 2020 at 19:46 UTC ( #11113366=perlquestion: print w/replies, xml ) Need Help??

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks. I have a bunch of URLs of news articles and need to get the publication dates from these, if available. There is a python library designed especially for this purpose. I'm wondering if there is any similar Perl module. I've searched around and the only thing I found was Web::Scraper which would take quite a bit of rules development to do the job. Am hoping maybe someone has done that work already.

Replies are listed 'Best First'.
Re: Finding dates from web pages
by Corion (Pope) on Feb 24, 2020 at 20:22 UTC

    I also know of another Python library, article-date-extractor, which has a set of regular expressions.

    I haven't ported it to Perl though.

Re: Finding dates from web pages
by talexb (Canon) on Feb 25, 2020 at 13:18 UTC

    Can you get what you want just from doing a HEAD on the web page? That would give you the Last Updated date, I think. I'm not sure if that's exactly what you want.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      That works for some pages that have known meta fields like pubdate or time, but many web pages don't use them. I think the Python library applies some heuristics in such cases.

        "I think the Python library applies some heuristics in such cases"

        You could look at the Python code and implement the same thing in perl. Let me know if you get stuck.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11113366]
Approved by GrandFather
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2021-01-17 22:58 GMT
Find Nodes?
    Voting Booth?