Trivial HTML extractor utilityby Dominus (Parson)
|on Nov 22, 2007 at 06:17 UTC||Need Help??|
I have a really useful trivial utility, called linkx, that is basically just a command-line wrapper around HTML::LinkExtor: you give it the name of an HTML file and it extracts and prints all the URLs in the file. I use this all the time:
You can tell this is really old because it uses two-argument open.
The -b base flag interprets all URLs relative to base base and prints out the absolute versions. The -t tag flag restricts the program to only printing out URLs that appear in that kind of entity, instead of all links. I had totally forgotten that the -t feature was in there. I wonder if it's useful?
Anyway, that's not what I wanted to write about. I also have a program that extracts referrer URLs from my web logs, and today I noticed a bunch of incoming links from Reddit. Most of these I thought I had probably seen before, but I wasn't sure. I thought if I could see the titles of the pages I would know. So I tried to get the titles:
hoping that the title element would be alone on a line. (GET is a utility that comes with Perl's LWP suite; you give it a URL and it fetches the document and prints it.)
This was a complete failure. Not only were the title elements not alone on their own lines, it seems that Reddit pages don't have any line breaks at all. The output was a big mess, and I didn't look into it in detail once I saw that this approach was a flop.
So I wrote the following item, htmlx, which solved the problem:
You give this a tag name, and then it reads HTML from standard input and prints the contents of all the entities with this tag. My Reddit searcher that didn't work became:
which did work. Hooray!
I expect that this will be useful for other stuff, but I'm not sure yet what. If I never find another use for it except as part of a GET url | htmlx title pipeline, I'll probably demote it from htmlx to just TITLE, but it's too soon to tell if that's a good idea.
I hope this is useful for someone. I hereby place all code in this post in the public domain, yadda yadda yadda. Share and enjoy!