Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
I have a really useful trivial utility, called linkx, that is basically just a command-line wrapper around HTML::LinkExtor: you give it the name of an HTML file and it extracts and prints all the URLs in the file. I use this all the time:

#!/usr/bin/perl use HTML::LinkExtor; use Getopt::Std ; getopts('b:t:'); @ARGV = '-' unless @ARGV; for my $file (@ARGV) { extract($file); } sub extract { my $file = shift; unless (open F, "< $file") { warn "Couldn't open file $file: $!; skipping\n"; return; } my $p = HTML::LinkExtor->new(undef, $opt_b); while (read F, my $buf, 8192) { $p->parse($buf); } for my $ln ($p->links) { my @ln = @$ln; my $tag = shift @ln; next if $opt_t && lc($opt_t) ne lc($tag); while (@ln) { shift @ln; my $url = shift @ln; print $url, "\n" unless $seen{$url}++; } } }
You can tell this is really old because it uses two-argument open.

The -b base flag interprets all URLs relative to base base and prints out the absolute versions. The -t tag flag restricts the program to only printing out URLs that appear in that kind of entity, instead of all links. I had totally forgotten that the -t feature was in there. I wonder if it's useful?

Anyway, that's not what I wanted to write about. I also have a program that extracts referrer URLs from my web logs, and today I noticed a bunch of incoming links from Reddit. Most of these I thought I had probably seen before, but I wasn't sure. I thought if I could see the titles of the pages I would know. So I tried to get the titles:

for i in `cat reddit`; do GET $i | grep -i title done
hoping that the title element would be alone on a line. (GET is a utility that comes with Perl's LWP suite; you give it a URL and it fetches the document and prints it.)

This was a complete failure. Not only were the title elements not alone on their own lines, it seems that Reddit pages don't have any line breaks at all. The output was a big mess, and I didn't look into it in detail once I saw that this approach was a flop.

So I wrote the following item, htmlx, which solved the problem:

#!/usr/bin/perl use HTML::TreeBuilder; my @tags = @ARGV; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse_file(\*STDIN); my @elements = $tree->find(@tags); for (@elements) { my $s = $_->as_text; $s =~ tr/\n/ /; print "$s\n"; }
You give this a tag name, and then it reads HTML from standard input and prints the contents of all the entities with this tag. My Reddit searcher that didn't work became:

for i in `cat reddit`; do GET $i | htmlx title done
which did work. Hooray!

I expect that this will be useful for other stuff, but I'm not sure yet what. If I never find another use for it except as part of a GET url | htmlx title pipeline, I'll probably demote it from htmlx to just TITLE, but it's too soon to tell if that's a good idea.

I hope this is useful for someone. I hereby place all code in this post in the public domain, yadda yadda yadda. Share and enjoy!


In reply to Trivial HTML extractor utility by Dominus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (5)
    As of 2014-08-30 21:19 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (294 votes), past polls