Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Duct taping spam-bot protection to a web forum

by aitap (Curate)
on Aug 31, 2014 at 14:24 UTC ( #1099122=CUFP: print w/replies, xml ) Need Help??

There is a free web hosting which offers a third-level domain name and an installation of their proprietary CMS. It is somewhat widely known in the ex-USSR countries. They have "Web 2.0" AJAX interface, a lot of modules for nearly everything, from a simple forum to a web shop, and a primitive read-only API. I happen to be moderating one of such forums. Despite not being popular in terms of human population it has recently gained a lot of popularity among spam-sending robots.

At first they all were making the same mistake of posting messages with titles equal to their nicknames, and so the first version of bothunter.pl was born. It employed link parsing routines of WWW::Mechanize and reproduced a sniffed AJAX request by some black magic of parsing JavaScript source for variables. Needless to say, soon it broke, both because JavaScript source slightly changed and because bots became slightly smarter, so the moderators went back to deleting bots manually.

Yesterday I thought: with PhantomJS, I could mimic the browser and click all these AJAX buttons required to ban a user. As for the spam, maybe it's possible to count unique words in a message and warn if some of them is repeated a lot, and a list of stop-words could help, too... Before I started thinking of ways to automatically build a list of stop words from spam messages I realised that I was reiventing the wheel and searched for spam detection engines.

My first try was Mail::SpamAssassin, because it's written in Perl and I heard a lot of stories about plugging it into other programs. It turned out to be not so easy to make it work with plain text (non-mail) messages, so I searched for alternatives. It is Mail::SpamAssassin, after all. Bogofilter is not written in Perl, but still was easy to plug in my program, thanks to its -T option, and it happily works with plain text without complaining.

Interfacing with the site was not so easy. Banning a spam robot (click-click-tab-tab-"spam robot"-tab-space-tab-space) exploded into a mess of ->clicking xpath-found elements; at one time the site refused to register my click no matter how I tried, so I had to call the corresponding JS function manually; in the other place of program I find myself logged out, and the only way to get back in is to load the page I'm going to delete, load the login page, log in, then load the to-be-deleted page again. Kludges. Ew.

So, here it is: the second version of bot hunter Perl program. I hope it won't break as fast as the first one. Or, at least, will be easier to fix.

#!/usr/bin/perl use warnings; use strict; use WWW::Mechanize::PhantomJS; use IPC::Open2; use Text::Table; use IO::Prompter; use Encode; use Encode::Locale; binmode STDOUT, ":encoding(console_out)"; # login my $login = "login"; my $password = "password"; # backend my $phantomjs = "/path/to/phantomjs"; # url my $index = "http://example.org/"; my $newmsg = "http://example.org/forum/0-0-1-34"; # bogofilter my $bogofilter = "/usr/bin/bogofilter"; # display my $text_trim_length = 35; my $www = WWW::Mechanize::PhantomJS::->new( autodie => 1, launch_exe => $phantomjs, report_js_errors => 1, ); my @threads; auth(); load_messages(); my %actions = ( display => \&print_table, quit => sub { exit }, learn => \&learn_message, rampage => \&remove_spam, reload => \&load_messages, ); print_table(); while (my $mode = prompt 'What to do now?', -menu => [ keys %actions ] +) { $actions{$mode}->(); } sub screenshot { my ($name) = $_[0] || "test.png"; open my $file, ">:raw", $name; print $file $www->content_as_png(); close $file; } sub feed_bogofilter { my ($text, @params) = @_; my $pid = open2(my $read, my $write, $bogofilter, @params) or die +"$bogofilter @params: $!\n"; binmode $write, ":encoding(locale)"; # XXX teach it manually with +the same encoding print $write $text; close $write; defined (my $reply = <$read>) or return; # this happens when we le +arn spam chomp $reply; waitpid $pid, 0; $reply =~ /^([SHU])\s([\d.+-e]+)$/ or return; my ($status, $score) = ($1, $2); return ($status, $score); } sub compute_scores { print "Computing spam scores\n"; @{$_}{"status","score"} = feed_bogofilter("$_->{title}\n$_->{text} +", "-T") for @threads; @threads = sort { $b->{score} <=> $a->{score} } @threads; } sub print_table { compute_scores(); my $table = Text::Table::->new("#", "Status", "Score", "Title", "T +rimmed text"); my $i = 0; $table->load( map { my $text = $_->{text}; $text =~ tr/\n/ /; $text = substr($text, 0, $text_trim_length); [ $i++, @{$_}{"status", "score", "title"}, $text ]; } @threads); print $table; } sub learn_message { my $num = prompt "Message number?", -integer => [0..$#threads]; print '-'x20, "\n", $threads[$num]->{title}, "\n", $threads[$num]->{text}, "\n", '-'x20, "\n", ; my $param = prompt "Spam or ham?", -menu => { "Spam" => "s", "Ham" + => "n" }; feed_bogofilter($threads[$num]->{title}."\n".$threads[$num]->{text +}, "-$param"); } sub remove_spam { # WARNING # This subroutine is full of kludges for (grep { $_->{status} eq "S" } @threads) { print "Loading page $_->{url}\n"; $www->get($_->{url}); # at this point I'm somehow logged out auth(); # another piece of duct tape $www->get($_->{url}); ($www->xpath('//*[@class="banDo"]', one => 1))[-1]->click; # w +e always work with the last message print "Loading ban window\n"; timeout(sub {$www->xpath('//input[@id="a2"]', single => 1)->cl +ick}); # [+] raise ban level $www->xpath('//textarea[@name="reason"]', single => 1)->send_k +eys("spambot"); # ban reason $www->xpath('//input[@id="ever"]', single => 1)->click; # ban +for ever #$www->click({ xpath => q{//*[contains(concat(' ', normalize-s +pace(@class), ' '), ' myBtnCont ')]} }); # does not click print "Sending ban request\n"; $www->eval_in_page(q|_uPostForm('frm982',{type:'POST',url:'htt +p://example.org/index/'});|); print "Loading remove page\n"; my $last_post_bottom = ($www->xpath('//*[@class="postBottom"]' +))[-1]; ($www->xpath('.//a', node => $last_post_bottom))[-2]->click; # + second-from-last link in the postBottom block is "delete thread" print "Sending remove request\n"; $www->xpath('//input[@name="sbm"]', single => 1)->submit; } } sub timeout { my ($sub, $tries) = @_; $tries ||= 5; my $try = 0; until( eval { $sub->() }) { screenshot('fail.png'); no warnings 'once'; $DB::single++; die join ":", (caller(1))[1..3]. " failed to execute after $tr +y tries" if $try++ > $tries; sleep 1; } } sub auth { print "Logging in\n"; $www->get($index)->is_success or die "Failed to get $index"; # aut +odie does not work $www->submit_form(with_fields => {user => $login, password => $pas +sword}); } sub load_messages { print "Listing new messages\n"; $www->get($newmsg); @threads = map { { # they are mostly new threads with one message and no repl +ies # so title might be relevant title => ( eval { $www->xpath( './/a[@class="threadLink"]', node => $_, single => 1, ) } || $www->xpath( './/a[@class="threadPinnedLink"]', node => $_, single => 1, ) )->get_text, url => $www->xpath( './/a[@class="forumLastPostLink"]', node => $_, single => 1, )->get_attribute("href"), } } $www->xpath('//td[@class="threadIcoTd"]/..'); $|++; print "Loading messages"; for (@threads) { $www->get($_->{url}); # it is also possible that it's an appended spam message in an +other thread my $last_message = ($www->xpath('//span[@class="ucoz-forum-pos +t"]', one => 1))[-1]; $_->{text} = $last_message->get_text; print "."; } print "\n"; }
I hope that this was not too thedailywtf-esque, was it?

Replies are listed 'Best First'.
Re: Duct taping spam-bot protection to a web forum
by jacekw (Initiate) on Sep 15, 2014 at 11:45 UTC
    Thanks a lot. I'm so tired of removing all the spam and all the automated posting software.
    --
    Jack W.
    BizDb.co.uk

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://1099122]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2018-08-20 06:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Asked to put a square peg in a round hole, I would:









    Results (190 votes). Check out past polls.

    Notices?