Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^4: Trouble with some of IDDB Public Methods

by Aldebaran (Curate)
on Jan 01, 2021 at 04:45 UTC ( [id://11126066]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Trouble with some of IDDB Public Methods
in thread Trouble with some of IMDB Public Methods

alternative you should be able to use the mojo based solution from earlier as a starting point to get just what you need. If you have any problems with that just post and I'll take a look.

OP seems to have found what he wanted, so I thought I might use the opportunity to ask marto (or anyone else who can bake from scratch with mojo) to further explore the script he posted in Re^5: polishing up a json fetching script for weather data. It might be an improvement to a script that marto characterized as sub optimal. I certainly hope that we don't optimize away the comments and break up the logic as opposed to having just a train of arrows that online sources may have, with words whose provenance is unknown, like top in this example:

# JSON POST (application/json) with TLS certificate authentication my $tx = $ua->cert('tls.crt')->key('tls.key')->post('https://example.c +om' => json => {top => 'secret'});

or json, there's nothing that makes keywords stand out, and where does one go to determine their provenance? How exactly are you going to disambiguate 'json'? The above came from link to Mojo/UserAgent. I understand that examples are selected for brevity. I would love to see a cache of them with many authors.

It seemed to me that having to hardcode the movie title like this was an area that can be improved.

my $imdburl = 'http://www.imdb.com/search/title?title=Caddyshack';

I couldn't get titles with multiple words to work at all. The search replaces spaces with plusses in the url, but interpolation with a lexical variable is just beneath mojo, even if it worked, which it doesn't. What I want is a script that shows me what's at this site from a mojo point of view, and this does so naively:

#!/usr/bin/perl use strict; use warnings; use Mojo::URL; use Mojo::Util qw(dumper); use Mojo::UserAgent; use Data::Dump; use Log::Log4perl; use 5.016; use Mojo::DOM; my $log_conf3 = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; my $log_conf4 = "/home/hogan/Documents/hogan/logs/conf_files/4.conf"; #Log::Log4perl::init($log_conf3); #debug Log::Log4perl::init($log_conf4); #info my $logger = Log::Log4perl->get_logger(); $logger->info("$0"); # pretend to be a browser my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like G +ecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name($uaname); my $first_title = 'Virgin+River'; my $imdburl = "http://www.imdb.com/search/title?title=$first_title"; say "imdburl is $imdburl"; # find search results my $dom = $ua->get($imdburl)->res->dom; my @nodes = @$dom; # c-style for is good for array output with index for ( my $i = 0 ; $i < @nodes ; $i++ ) { $logger->info("i is $i =============="); $logger->info("$nodes[$i]"); } sleep 2; #good hygiene __END__

What does it show?

2020/12/31 13:53:39 INFO i is 1 ============== 2020/12/31 13:53:39 INFO <!DOCTYPE html> 2020/12/31 13:53:39 INFO i is 2 ============== 2020/12/31 13:53:39 INFO

First looks right...second is empty...

The 3rd contains 61 k of javascript hell. The 4th and ultimate was empty. Javascript isn't meant for human eyes, or let me be specific, I find it illegible, so I used the browser tools to look closer. I realize that I simply don't understand the javascript, and that's not mojo's fault. The browser tools give me this upon inspection and right click inside the search box:

<input type="text" value="" autocomplete="off" aria-autocomplete="list +" aria-controls="react-autowhatever-1" class="imdb-header-search__inp +ut GVtrp0cCs2HZCo7E2L5UU react-autosuggest__input" id="suggestion-sea +rch" name="q" placeholder="Search IMDb" autocapitalize="none" autocor +rect="off"

Then I remembered that you can use mojo to do this instead:

$ mojo get https://www.imdb.com/ '*' attr id >1.txt $ grep search 1.txt navSearch-searchState suggestion-search-container nav-search-form navbar-search-category-select navbar-search-category-select-contents suggestion-search suggestion-search-button imdbHeader-searchClose imdbHeader-searchOpen $

Now I thought I was really in hot pursuit. I thought, "aha, I can find this id and post to it." So I go to find find in Mojo::Dom, and I don't really understand the examples until I can work them myself and see them:

$ ./1.dom.pl ./1.dom.pl 123 Test 123 a b b a a:Test b:123 <p id="a">Test</p><p id="b">123</p><p id="d">789</p><p id="c">456</p> $ cat 1.dom.pl #!/usr/bin/perl use strict; use warnings; use Mojo::URL; use Mojo::Util qw(dumper); use Mojo::UserAgent; use Data::Dump; use Log::Log4perl; use 5.016; use Mojo::DOM; my $log_conf3 = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; my $log_conf4 = "/home/hogan/Documents/hogan/logs/conf_files/4.conf"; #Log::Log4perl::init($log_conf3); #debug Log::Log4perl::init($log_conf4); #info my $logger = Log::Log4perl->get_logger(); $logger->info("$0"); # pretend to be a browser my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like G +ecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name($uaname); ## example from https://docs.mojolicious.org/Mojo/DOM #use Mojo::DOM; # Parse my $dom = Mojo::DOM->new('<div><p id="a">Test</p><p id="b">123</p></di +v>'); # Find say $dom->at('#b')->text; say $dom->find('p')->map('text')->join("\n"); say $dom->find('[id]')->map( attr => 'id' )->join("\n"); # Iterate $dom->find('p[id]')->reverse->each( sub { say $_->{id} } ); # Loop for my $e ( $dom->find('p[id]')->each ) { say $e->{id}, ':', $e->text; } # Modify $dom->find('div p')->last->append('<p id="c">456</p>'); $dom->at('#c')->prepend( $dom->new_tag( 'p', id => 'd', '789' ) ); $dom->find(':not(p)')->map('strip'); # Render say "$dom"; __END__ $ ./4.dom.pl ./4.dom.pl <h1>Test</h1> bar bar foo baz ===== comment doctype pi text root tag text $ cat 4.dom.pl #!/usr/bin/perl use strict; use warnings; use Mojo::URL; use Mojo::Util qw(dumper); use Mojo::UserAgent; use Data::Dump; use Log::Log4perl; use 5.016; use Mojo::DOM; my $log_conf3 = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; my $log_conf4 = "/home/hogan/Documents/hogan/logs/conf_files/4.conf"; #Log::Log4perl::init($log_conf3); #debug Log::Log4perl::init($log_conf4); #info my $logger = Log::Log4perl->get_logger(); $logger->info("$0"); ## examples from https://docs.mojolicious.org/Mojo/DOM my $dom7 = Mojo::DOM->new(); my $str7 = $dom7->parse('<div><h1>Test</h1><h2>123</h2></div>')->at('h2')->prev +ious; $logger->info($str7); # "bar" my $dom8 = Mojo::DOM->new(); my $str8 = $dom8->parse("<div>foo<p>bar</p>baz</div>")->at('p')->text; say "$str8"; $logger->info($str8); # "foo\nbaz\n" my $dom9 = Mojo::DOM->new(); my $str9 = $dom9->parse("<div>foo\n<p>bar</p>baz\n</div>")->at('div')- +>text; $logger->info($str9); $logger->info('====='); my $dom1 = Mojo::DOM->new(); my $str1 = $dom1->parse('<!-- Test -->')->child_nodes->first->type; $logger->info($str1); # "doctype" $str1 = $dom1->parse('<!DOCTYPE html>')->child_nodes->first->type; $logger->info($str1); # "pi" $str1 = $dom1->parse('<?xml version="1.0"?>')->child_nodes->first->typ +e; $logger->info($str1); $str1 = $dom1->parse('<title>Test</title>')->at('title')->child_nodes->first +->type; $logger->info($str1); $str1 = $dom1->parse('<p>Test</p>')->type; $logger->info($str1); $str1 = $dom1->parse('<p>Test</p>')->at('p')->type; $logger->info($str1); $str1 = $dom1->parse('<p>Test</p>')->at('p')->child_nodes->first->type +; $logger->info($str1); __END__ $

Finally, I got a usage for find that worked:

$ ./2.dom.pl ./2.dom.pl ads_tarnhelm ads_doWithAds ads_monitoring_setup ads_safeframe_setup ad +s_general_setup IMDbHomepageSiteReactViews imdbHeader nblogin imdbHea +der-navDrawerOpen imdbHeader-navDrawerOpen--desktop imdbHeader-navDra +wer nav-link-categories-mov nav-link-categories-tvshows nav-link-cate +gories-video nav-link-categories-awards nav-link-categories-celebs na +v-link-categories-comm home_img_holder home_img navSearch-searchState + suggestion-search-container nav-search-form navbar-search-category-s +elect navbar-search-category-select-contents suggestion-search sugges +tion-search-button imdbHeader-searchClose imdbHeader-searchOpen ipc-s +vg-gradient-tv-logo-t ipc-svg-gradient-tv-logo-v ipc-wrap-background- +id inline20_wrapper placeholderPattern b a b a b a b a b a b a b a in +line40_wrapper placeholderPattern from-your-watchlist fan-picks tecon +sent ftr__a ftr__c ftr__e ftr__g ftr__i ftr__k ftr__m ftr__o ftr__q f +tr__s ftr__u ftr__w ftr__y ftr__A ftr__C ftr__E ftr__G ftr__b ftr__d +ftr__f ftr__h ftr__j ftr__l ftr__n ftr__p ftr__r ftr__t ftr__v ftr__x + ftr__z ftr__B ftr__D ftr__F ftr__H ipc-svg-gradient-tv-logo-t ipc-sv +g-gradient-tv-logo-v ipc-svg-gradient-tv-logo-t ipc-svg-gradient-tv-l +ogo-v be $ cat 2.dom.pl #!/usr/bin/perl use strict; use warnings; use Log::Log4perl; use 5.016; use Mojo::DOM; use Mojo::UserAgent; my $log_conf3 = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; my $log_conf4 = "/home/hogan/Documents/hogan/logs/conf_files/4.conf"; #Log::Log4perl::init($log_conf3); #debug Log::Log4perl::init($log_conf4); #info my $logger = Log::Log4perl->get_logger(); $logger->info("$0"); # represent $0 as browser to server my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like G +ecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name($uaname); ## main page of imdb contains search box my $imdburl = "http://www.imdb.com/"; ## example from https://docs.mojolicious.org/Mojo/DOM my $dom = $ua->get($imdburl)->res->dom; # say "$dom"; works # my @ids= $dom->find('[id]')->map(attr => 'id')->each; $logger->info("@ids"); __END__ $

Anyways, this was my final push and I seem to come up short:

$ ./2.1.dom.pl ./2.1.dom.pl navSearch-searchState suggestion-search-container nav-search-form navb +ar-search-category-select navbar-search-category-select-contents sugg +estion-search suggestion-search-button imdbHeader-searchClose imdbHea +der-searchOpen Can't locate object method "find" via package "Mojo::UserAgent" at ./2 +.1.dom.pl line 48. $ cat 2.1.dom.pl #!/usr/bin/perl use strict; use warnings; use Log::Log4perl; use 5.016; use Mojo::DOM; use Mojo::UserAgent; use Mojo::URL; use Mojo::Util qw(trim); my $log_conf3 = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; my $log_conf4 = "/home/hogan/Documents/hogan/logs/conf_files/4.conf"; #Log::Log4perl::init($log_conf3); #debug Log::Log4perl::init($log_conf4); #info my $logger = Log::Log4perl->get_logger(); $logger->info("$0"); # represent $0 as browser to server my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like G +ecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name($uaname); ## main page of imdb contains search box my $imdburl = "http://www.imdb.com/"; ## example from https://docs.mojolicious.org/Mojo/DOM my $dom = $ua->get($imdburl)->res->dom; # say "$dom"; works # my @ids = $dom->find('[id]')->map( attr => 'id' )->each; #$logger->info("@ids"); my @matches = grep { /search/ } @ids; $logger->info("@matches"); my $vid = 'Virgin River'; $ua->post( $imdburl => form => { 'suggestion-search' => $vid } ); # assume first match my $filmurl = $ua->find('a[href^=/title]')->first->attr('href'); # extract film id my $filmid = Mojo::URL->new($filmurl)->path->parts->[-1]; # get details of film $dom = $ua->get("https://www.imdb.com/title/$filmid/")->res->dom; # print film details say trim( $dom->at('div.title_wrapper > h1')->text ) . ' (' . trim( $dom->at('#titleYear > a')->text ) . ')'; # print actor/character names foreach my $cast ( $dom->find('table.cast_list > tr:not(:first-child)' +)->each ) { say trim ( $cast->at('td:nth-of-type(2) > a')->text ) . ' as ' . trim( $cast->at('td.character')->all_text ); } __END__ $

These are resources I drew from:

Thanks for comments,

Replies are listed 'Best First'.
Re^5: Trouble with some of IDDB Public Methods
by marto (Cardinal) on Jan 01, 2021 at 09:54 UTC

    "I certainly hope that we don't optimize away the comments and break up the logic as opposed to having just a train of arrows that online sources may have, with words whose provenance is unknown, like top in this example:"

    # JSON POST (application/json) with TLS certificate authentication my $tx = $ua->cert('tls.crt')->key('tls.key')->post('https://example.c +om' => json => {top => 'secret'}); [download]

    "or json, there's nothing that makes keywords stand out, and where does one go to determine their provenance? How exactly are you going to disambiguate 'json'?"

    As with the cert attribute, just look at the post documentation. It's just encoding a perl value to JSON, and posting it to an example site with TLS cert auth. Consider the longhand example of just the JSON part:

    #!/usr/bin/perl use strict; use warnings; use Mojo::JSON qw(encode_json); use feature 'say'; my $bytes = encode_json{ top => 'secret' }; say $bytes;

    Following the appropriate links in the Mojo docs takes you to the relevant places.

    "I couldn't get titles with multiple words to work at all. The search replaces spaces with plusses in the url, but interpolation with a lexical variable is just beneath mojo, even if it worked, which it doesn't. What I want is a script that shows me what's at this site from a mojo point of view, and this does so naively:"

    A lazy way (since it's early on New Years day) would be to take my example, prompt for a film title and replace spaces with the plus sign. If you want to go down the route of automating forms, as mentioned before, make life easy on yourself and use the browser 'developer tools' to find the data you need for the form fields you care about. This is more effective then grepping in the dark from dumped results.

    #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::URL; use Mojo::Util qw(trim); use Mojo::UserAgent; my $imdburl = 'https://www.imdb.com/search/title?title='; # prompt for title, replace spaces with plus signs say 'Enter name of film to search for: '; my $film = <STDIN>; chomp $film; $film =~ s/ /+/g; $imdburl .= $film; # pretend to be a browser my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 ( +KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name( $uaname ); # find search results my $dom = $ua->get( $imdburl )->res->dom; #my $dom = $ua->post( $imdburl => form => {title => $film} )->res->dom +; # assume first match my $filmurl = $dom->find('a[href^=/title]')->first->attr('href'); # extract film id my $filmid = Mojo::URL->new( $filmurl )->path->parts->[-1]; # get details of film $dom = $ua->get( "https://www.imdb.com/title/$filmid/" )->res->dom; # print film details say 'Search Results'; say trim( $dom->at('div.title_wrapper > h1')->text ) . ' (' . trim( $d +om->at('#titleYear > a')->text ) .')'; # print actor/character names foreach my $cast ( $dom->find('table.cast_list > tr:not(:first-child)' +)->each ){ say trim ($cast->at('td:nth-of-type(2) > a')->text ) . ' as ' . trim + ( $cast->at('td.character')->all_text ); }

    Outputs:

    This example is only differs from my original by a few verbose lines, and again is sub optimal, and intended just to get you started. Obviously this is aimed at Films, and if you search for a series rather than a film the resulting page has differences that you'd need to cater for. If your intention is to take this further I'd strongly recommend using the browser developer tools, don't get hung up on how Mojo can dump the page data and all it's elements, this is mostly unimportant if you just want to automate an existing interface. Adding code to cater for different types of results (film, TV show), obvious error checking, perhaps better prompting of results rather than assuming the first one is what the user means, e.g. a search for 'Batman' returns "The Batman (2022)" rather than "Batman (1966)".

    Update: added spoiler tag explanation.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11126066]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-03-28 12:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found