Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I want to chip in where I can on documentation efforts, and there's a lot of opportunity now. I noticed a couple typos as I was going through WWW::Mechanize::Chrome and asked Corion if he wouldn't mind me getting involved with it. He says, sure, so I've been working through the examples and then the links. Almost by necessity, one has to take a step back and make comparisons to WWW::Mechanize. I've used this for years, but it's been 4 years since having done so, so anything I once knew about forms or the like is completely out the window. I selected my best candidate from the examples and couched it in the logging scheme that is at least verbose in other cases. Here's the bash invocation followed by the source:

$ ./2.quotes.pl Fargo >1.txt No matches for "Fargo" were found. $ ll 1.txt -rw-r--r-- 1 hogan hogan 128858 Apr 12 22:02 1.txt $ cat 2.quotes.pl #!/usr/bin/perl -w use strict; use 5.016; use WWW::Mechanize; use Getopt::Long; use Text::Wrap; use Log::Log4perl; use Data::Dump; my $log_conf = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; Log::Log4perl::init($log_conf); my $logger = Log::Log4perl->get_logger(); #$logger->level('DEBUG'); my $match = undef; my $random = undef; GetOptions( "match=s" => \$match, "random" => \$random, ) or exit 1; my $movie = shift @ARGV or die "Must specify a movie\n"; my $quotes_page = get_quotes_page($movie); my @quotes = extract_quotes($quotes_page); if ($match) { $match = quotemeta($match); @quotes = grep /$match/i, @quotes; } if ($random) { print $quotes[ rand @quotes ]; } else { print join( "\n", @quotes ); } sub get_quotes_page { my $movie = shift; my $mech = WWW::Mechanize->new; $mech->get("https://www.imdb.com/search/name-text/"); $mech->success or die "Can't get the search page"; open my $fh, '>', '/home/hogan/Documents/hogan/logs/1.form-log.txt' or die "Couldn't open logfile 'form-log.txt': $!"; $mech->dump_forms($fh); my $ret1 = $mech->submit_form( form_number => 2, fields => { title => $movie, restrict => "Movies only", }, ); $logger->info("return1 is $ret1"); # dd $ret1; # yikes if ( $ret1->is_success ) { $logger->info("Supposedly successful so far"); print $ret1->decoded_content; } else { print STDERR $ret1->status_line, "\n"; } my @links = $mech->find_all_links( url_regex => qr[^/Title] ) or die "No matches for \"$movie\" were found.\n"; # Use the first link my ( $url, $title ) = @{ $links[0] }; warn "Checking $title...\n"; $mech->get($url); my $link = $mech->find_link( text_regex => qr/Memorable Quotes/i ) or die qq{"$title" has no quotes in IMDB!\n}; warn "Fetching quotes...\n\n"; $mech->get( $link->[0] ); return $mech->content; } sub extract_quotes { my $page = shift; # Nibble away at the unwanted HTML at the beginnning... $page =~ s/.+Memorable Quotes//si; $page =~ s/.+?(<a name)/$1/si; # ... and the end of the page $page =~ s/Browse titles in the movie quotes.+$//si; $page =~ s/<p.+$//g; # Quotes separated by an <HR> tag my @quotes = split( /<hr.+?>/, $page ); for my $quote (@quotes) { my @lines = split( /<br>/, $quote ); for (@lines) { s/<[^>]+>//g; # Strip HTML tags s/\s+/ /g; # Squash whitespace s/^ //; # Strip leading space s/ $//; # Strip trailing space s/&#34;/"/g; # Replace HTML entity quotes # Word-wrap to fit in 72 columns $Text::Wrap::columns = 72; $_ = wrap( '', ' ', $_ ); } $quote = join( "\n", @lines ); } return @quotes; } __END__ $

When we have a look at what forms were available, we have:

$ cat /home/hogan/Documents/hogan/logs/1.form-log.txt GET https://www.imdb.com/find [nav-search-form] navbar-search-category-select=<UNDEF> (checkbox) [*<UNDEF>/off|on] q= (text) <NONAME>=<UNDEF> (submit) ref_=nv_sr_sm (hidden readonly) POST https://www.imdb.com/search/title-text/ type=plot (option) [*plot/Plot|quotes/Quotes| +trivia/Trivia|goofs/Goofs|crazy_credits/Crazy Credits|location/Filmin +g Locations|soundtracks/Soundtracks|versions/Versions] query= (search) <NONAME>=<UNDEF> (submit) POST https://www.imdb.com/search/name-text/ type=bio (option) [*bio/Biographies|quotes/Q +uotes|trivia/Trivia] query= (search) <NONAME>=<UNDEF> (submit) $

The way I count it, we want the 1st one instead of the second, with zero-indexing. Either way, this is about where I lose the handle on it. I have trouble time and again with the output overwhelming the terminal. With the log from Log4perl, I have:

2020/04/12 22:20:23 INFO return1 is HTTP::Response=HASH(0x5653c7bfc2e8 +) 2020/04/12 22:20:23 INFO Supposedly successful so far

As far as I can tell, what one gets when one decodes this return value, it looks like the whole page in html form splashed out onto STDOUT, and this leaves me confused and sifting through stuff meant for machines. I've tried ARGV with other movies from imdb top 100 quote movies.

$ ./2.quotes.pl Fargo >1.txt No matches for "Fargo" were found. $ ./2.quotes.pl Jaws >1.txt No matches for "Jaws" were found.

How gratifying it would be to see:

We're gonna need a bigger boat.

Q1) What do I need to do to get this script working?

Q2) What is the relationship between WWW::Mechanize and modules like WWW::Mechanize::Gzip and WWW::Mechanize::Chrome? The former uses this line:

use base qw(WWW::Mechanize);

, while the latter seems to reference its "base" module in the raw source. Does either "inherit" anything from its base?

Update: I botched the second half of this question as I made the comparison. What I meant to ask was:

Q2) ..., while the latter seems to lack reference to its "base" module in the raw source.

Q3) If I went to the link and typed in Jaws, would I get 128k worth of html?

$ ll 1.txt -rw-r--r-- 1 hogan hogan 128858 Apr 12 22:02 1.txt

Q4) Doesn't "scraping" connote going after an entire class of files like images, or have I done it here without even trying?

Thanks for your comment,


In reply to running an example script with WWW::Mechanize* module by Aldebaran

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-20 02:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found