Twitter recently got rid of the ability to get search results as an RSS as part of their API update of 11 June 2013.
I found those feeds rather useful, so I made a little screen scraper that reimplements the functionality without needing to auth against their API (it just pulls the results out of the web search page). I guess this will be good for a while longer, like enough time to switch to statusnet, identica, or whatever.
It might be of use to some others in the monastry and illustrates the power of HTML::TreeBuilder::XPath.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
use Readonly;
use HTML::TreeBuilder::XPath;
use LWP::Simple;
use POSIX qw(strftime);
binmode STDOUT, 'utf8';
Readonly my $BASEURL => 'https://twitter.com';
Readonly my $USAGE => "$0 <search_term>: make an rss of a twitter se
+arch";
die $USAGE unless $#ARGV==0;
my $term = $ARGV[0];
my $content = get("$BASEURL/search?q=$term&src=typd");
die "Couldn't get search results" unless defined $content;
my @items;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse($content);
my $tweets = $tree->findnodes( '//li' . class_contains('js-stream-item
+') );
for my $li (@$tweets) {
my $tweet = $li->findnodes('./div'
. class_contains("tweet")
. '/div'
. class_contains("content") )->[0]
;
my $header = $tweet->findnodes('./div' . class_contains("stream-item
+-header"))->[0];
my $body = $tweet->findvalue('./p' . class_contains("tweet-text"))
+;
$body = "<![CDATA[$body]]>";
my $avatar = $header->findvalue('./a/img' . class_contains("avatar")
+ . "/\@src");
my $fullname = $header->findvalue('./a/strong' . class_contains("ful
+lname"));
my $username = '@' . $header->findvalue('./a/span' . class_contains(
+"username") . '/b');
my $uri = $BASEURL . $header->findvalue('./small'
. class_contains("time")
. '/a'
. class_contains("tweet-timestamp")
. '/@href'
);
my $timestamp = $header->findvalue('./small'
. class_contains("time")
. '/a'
. class_contains("tweet-timestamp")
. '/span/@data-time'
);
my $pub_date = strftime("%a, %d %b %Y %H:%M:%S %z", localtime($times
+tamp));
push @items, {
username => $username,
fullname => $fullname,
link => $uri,
guid => $uri,
title => $body,
description => $body,
timestamp => $timestamp,
pubDate => $pub_date
}
}
$tree->delete;
# now print as an rss feed
print<<ENDHEAD
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:georss="http://www
+.georss.org/georss" xmlns:twitter="http://api.twitter.com" version="2
+.0">
<channel>
<title>Twitter Search / $term </title>
<link>http://twitter.com/search/q=$term</link>
<description>Twitter search for: $term.</description>
<language>en-us</language>
<ttl>40</ttl>
ENDHEAD
;
for (@items) {
print<<ENDITEM
<item>
<title>$_->{username}: $_->{title}</title>
<description>$_->{description}</description>
<pubDate>$_->{pubDate}</pubDate>
<guid>$_->{guid}</guid>
<link>$_->{link}</link>
<twitter:source/>
<twitter:place/>
</item>
ENDITEM
;
}
print<<ENDRSS
</channel>
</rss>
ENDRSS
;
sub class_contains {
my $classname = shift;
"[contains(concat(' ',normalize-space(\@class),' '),' $classname ')]
+";
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|