Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

best approach to parse vichan-style imageboard

by lis128 (Novice)
on Mar 17, 2019 at 15:25 UTC ( #1231362=perlquestion: print w/replies, xml ) Need Help??

lis128 has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow Monks!

I need to write simple notifier to be always up-to-date with certain threads replies
I am aware of WebService::Vichan, but i already started my HTML::Tokeparser::Simple approach, which i don't want to abandon yet.
I belive that using simplier tools i will learn more and whole code will be more effective than using Super::Duper::Module -> do_everything(\$data);

Given belown partial document file

<div class="post reply body-not-empty" id="reply_8735435"> (cut out for visibility) <p class="body-line ltr ">The first 3 lines were 15% bait power, but t +hen it fell to mere 5% and the last lines are literally 0%, try again + in a few days.</p> </div> (cut out for visibility) <div class="post reply body-not-empty" id="reply_8735439"> (cut out for visibility) <div class="body" > <p class="body-line ltr "> <a onclick="highlightReply('8735417', event);" href="/b/res/8735417.ht +ml#8735417">&gt;&gt;8735417</a> </p> <p class="body-line ltr quote">&gt;Reddit is a great place for discour +se and there are many active subreddits where field professionals reg +ularly answer questions on issues of health, science, engineering, et +c</p> <p class="body-line ltr ">Yeah, as far as content goes, Reddit kicks 8 +chan's ass. They have some great boards for serious academic discussi +on.</p> <p class="body-line empty ">

i want to iterate over "reply_xxx" id divs and once found i want to descend below to finally rip out whole body class div
Then, proceed to next reply-a-like div until EOF
Simple? nope :P

The issue i am running into is extistence of Tokeparse's cursor thingie, a state indicator which internally "knows" where in document parser actually is.
Using this

my $parser = HTML::TokeParser::Simple->new(\$data); while (my $div = $parser->get_tag('div','/div')) { my $id = $div -> get_attr('id'); next unless (defined $id and $id =~ /reply/); # tutaj kursor jest wewnatrz taga z odpowiedzia # wiec iteruje glebiej while ( my $inner_div = $parser -> get_tag('div','/div')) { my $inner_class = $inner_div -> get_attr('class'); next unless (defined $inner_class and $inner_class eq 'body'); #~ # print "div.$id > div.$inner_class \n"; my $text = $parser -> get_text; print "$id: '$text' \n"; #~ # print $id ." "; } }

gives a result where only first ID is matched and inner while loop iterates over all replies' bodies until EOF
Obviously it's not what i am after :)
my first though was to isolate content of rest of HTML document after matching "reply id", run inner while until first closing div, then feed outer while with not-already-consumed document's data and do it until actual EOF

as you can see, it seems uneffective in first thought.

How do Monks would hande this task? By "rewinding" internal cursor using unget_token method? Tokeparser is not a must, i am open to other solutions, but it's welcome.

Replies are listed 'Best First'.
Re: best approach to parse vichan-style imageboard html (dom/css xpath)
by beech (Parson) on Mar 18, 2019 at 00:02 UTC

    Hi,

    You do not want to use anything with "parser" in the name

    #!/usr/bin/perl -- use strict; use warnings; use Mojo::DOM; my $html = q{ <div class="post reply body-not-empty" id="reply_8735435"> (cut out for visibility) <p class="body-line ltr ">The first 3 lines were 15% bait power, but t +hen it fell to mere 5% and the last lines are literally 0%, try again + in a few days.</p> </div> (cut out for visibility) <div class="post reply body-not-empty" id="reply_8735439"> (cut out for visibility) <div class="body" > <p class="body-line ltr "> <a onclick="highlightReply('8735417', event);" href="/b/res/8735417.ht +ml#8735417">&gt;&gt;8735417</a> </p> <p class="body-line ltr quote">&gt;Reddit is a great place for discour +se and there are many active subreddits where field professionals reg +ularly answer questions on issues of health, science, engineering, et +c</p> <p class="body-line ltr ">Yeah, as far as content goes, Reddit kicks 8 +chan's ass. They have some great boards for serious academic discussi +on.</p> <p class="body-line empty"> }; my $dom = Mojo::DOM->new( $html ); for my $e ( $dom->find( 'div.reply' )->each ){ print $e->{id},"\n", $e->text, "\n\n"; } __END__ reply_8735435 (cut out for visibility) reply_8735439 (cut out for visibility)

      I must admit that your approach is way clearer and understandable than mine.
      Once done with current problem i'll try to read plain HTML using given method :)
      One thing i don't get is statement "avoid parsers": why so?
      I was convinced that parsers were ment to extract data from more complext structures

      On the other hand i've managed to access chans using JSON interface (following https://github.com/vichan-devel/vichan-API/), but my question remains valid for HTML-only imageboards

      "You do not want to use anything with "parser" in the name"

      At least the magic word is in the description of Mojo::DOM:

      "Mojo::DOM is a minimalistic and relaxed HTML/XML DOM parser with CSS selector support. It will even try to interpret broken HTML and XML..."

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1231362]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2020-04-02 20:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The most amusing oxymoron is:
















    Results (26 votes). Check out past polls.

    Notices?