Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

How do I pull HTML form option values?

by unklejunky (Initiate)
on Apr 03, 2018 at 16:21 UTC ( #1212267=perlquestion: print w/replies, xml ) Need Help??

unklejunky has asked for the wisdom of the Perl Monks concerning the following question:

Autotrader's website has a form called 'searchVehicle'. In that form are a number of HTML <option> values for the given form input fields `qr/^(radius|make|model|price-to)`. These <option> values are as you'd expect, i.e. a drop-down list. How do I pull the string values of those `<option>`'s? So far I have the following Perl code:
#!/usr/bin/perl use strict; use warnings; use utf8; use WWW::Mechanize; use Data::Dumper; use JSON; binmode STDOUT, ':encoding(UTF-8)'; binmode STDIN, ':encoding(UTF-8)'; my $url = 'https://www.autotrader.co.uk/'; my $mech = WWW::Mechanize -> new( autocheck => 1 ); $mech -> agent_alias( 'Linux Mozilla'); if ($mech -> status( $mech -> get($url)) == 200) { $mech -> form_name('searchVehicles'); my @inputs = $mech -> find_all_inputs( name_regex => qr/^(radius|make|mode +l|price-to)$/, type => 'option',); print Dumper \@inputs; };
My result is like:
$VAR1 = [ bless( { 'class' => 'c-form__select', 'current' => 0, 'id' => 'radius', 'menu' => [ { 'name' => 'Distance (national)', 'value' => '', 'seen' => 1 } ], 'type' => 'option', 'aria-label' => 'Choose a distance from your postco +de', 'name' => 'radius', 'idx' => 1 }, 'HTML::Form::ListInput' ), bless( { }, 'HTML::Form::ListInput' ), bless( { }, 'HTML::Form::ListInput' ), bless( { }, 'HTML::Form::ListInput' ) ];
Note: I have truncated all but the first since you get the idea with the first. The above code returns an array of hashes. I can access all the values that are listed in the Dumper output if I loop over the `@inputs` array and print only the current loop's hash key I want for the corresponding value, but I am after the values of the `<option>` entries. For example, if you inspect the source of the website `searchVehicles` form has an input field for Distance. The drop-down gives you choices between 1 to 200 miles. How do I obtain those values? This is so I can use these values to present and prompt the user of the script with valid options for their search.

Replies are listed 'Best First'.
Re: How do I pull HTML form option values?
by marto (Cardinal) on Apr 04, 2018 at 08:09 UTC

    Here's an example of how to do this using Mojo::DOM to get you started, below I simply copy the HTML you provided and use it as a variable, a better method is discussed after the output:

    use strict; use warnings; use feature 'say'; use Mojo::DOM; my $html = '<select class="c-form__select" id="radius" name="radius" a +ria-label="Choose a distance from your postcode"><option value="">Dis +tance (national)</option><option value="1">Within 1 mile</option><opt +ion value="5">Within 5 miles</option><option value="10">Within 10 mil +es</option><option value="15">Within 15 miles</option><option value=" +20">Within 20 miles</option><option value="25">Within 25 miles</optio +n><option value="30">Within 30 miles</option><option value="35">Withi +n 35 miles</option><option value="40">Within 40 miles</option><option + value="45">Within 45 miles</option><option value="50">Within 50 mile +s</option><option value="55">Within 55 miles</option><option value="6 +0">Within 60 miles</option><option value="70">Within 70 miles</option +><option value="80">Within 80 miles</option><option value="90">Within + 90 miles</option><option value="100">Within 100 miles</option><optio +n value="200">Within 200 miles</option></select>'; my $dom = Mojo::DOM->new( $html ); # find each select foreach my $select ( $dom->find('select')->each ){ say "Found select named $select->{name} with the values/text:"; # process each option foreach my $opt ( $select->find('option')->each ){ say $opt->{value}; say $opt->text; } }

    Produces:

    Found select named radius with the values/text: Distance (national) 1 Within 1 mile 5 Within 5 miles 10 Within 10 miles 15 Within 15 miles 20 Within 20 miles 25 Within 25 miles 30 Within 30 miles 35 Within 35 miles 40 Within 40 miles 45 Within 45 miles 50 Within 50 miles 55 Within 55 miles 60 Within 60 miles 70 Within 70 miles 80 Within 80 miles 90 Within 90 miles 100 Within 100 miles 200 Within 200 miles

    If $html contained additional <select> elements each will be processed e.g.

    .... Within 200 miles Found select named derp with the values/text: foo bar

    You could combine the whole thing with the page get using Mojo::UserAgent like I do in this example, which gets a page, parses the dom for a selector and prints a value.

    A word of warning, I couldn't see a terms of use on the site you linked to, none was visible where I could easily find it when I browsed this morning, using a phone. Many sites list automatic scraping as a violation of their terms of use.

    Update: slight rewording for clarity.

Re: How do I pull HTML form option values?
by NetWallah (Canon) on Apr 03, 2018 at 17:48 UTC
    Are you looking for something like this?:
    use strict; use warnings; my $v= [ bless( { 'class' => 'c-form__select', 'current' => 0, 'id' => 'radius', 'menu' => [ { 'name' => 'Distance (national)', 'value' => '1', 'seen' => 1 } ], 'type' => 'option', 'aria-label' => 'Choose a distance from your postco +de', 'name' => 'radius', 'idx' => 1 }, 'HTML::Form::ListInput' ), bless( { }, 'HTML::Form::ListInput' ), bless( { }, 'HTML::Form::ListInput' ), bless( { }, 'HTML::Form::ListInput' ) ]; my $LookForName = 'radius'; for my $entry (grep {$_->{name} and $_->{name} eq $LookForName} @$v) +{ print "Found $LookForName:\n"; for my $menuItem (@{ $entry->{menu} }){ print "\tValue:",$menuItem->{value},"\n"; } }

                    Memory fault   --   brain fried

      Not quite. Here's the form input field for `radius` (Distance):
      select class="c-form__select" id="radius" name="radius" aria-label="Ch +oose a distance from your postcode"><option value="">Distance (nation +al)</option><option value="1">Within 1 mile</option><option value="5" +>Within 5 miles</option><option value="10">Within 10 miles</option><o +ption value="15">Within 15 miles</option><option value="20">Within 20 + miles</option><option value="25">Within 25 miles</option><option val +ue="30">Within 30 miles</option><option value="35">Within 35 miles</o +ption><option value="40">Within 40 miles</option><option value="45">W +ithin 45 miles</option><option value="50">Within 50 miles</option><op +tion value="55">Within 55 miles</option><option value="60">Within 60 +miles</option><option value="70">Within 70 miles</option><option valu +e="80">Within 80 miles</option><option value="90">Within 90 miles</op +tion><option value="100">Within 100 miles</option><option value="200" +>Within 200 miles</option></select>

      (https://www.autotrader.co.uk/)

      I'm after all the values of the `value=` tags for each `<option>`, i.e. 5,10,20,30 etc into an array which I can then use elsewhere in the code. Perhaps find_all_inputs() wasn't the correct way to capture this?
        From what I can tell, WWW::Mechanize uses HTML::Form, which , in my opinion, does not parse the <select> tag properly.

        The code looks for "multiple":

        if (exists $self->{multiple}) { unshift(@{$self->{menu}}, { value => undef, name => "off"} +); $self->{current} = $checked ? 1 : 0; } else { $self->{current} = 0 if $
        which does not seem to be set anywhere.
        As a result, it only pickes up the first <option>.

        You will likely have to parse the raw html using HTML::TokeParser or the like, to extract the options.

        See fellow un-answered sufferer at stackoverflow.

                        Memory fault   --   brain fried

Re: How do I pull HTML form option values?
by bliako (Prior) on Apr 04, 2018 at 12:20 UTC

    I am adding one more method which uses HTML::TreeBuilder::XPath to select what you want using an XPath expression. Mainly for sentimental reasons because TreeBuilder helped me out in many a cases, but also because XPath expressions are useful to know and use.

    A word of warning: HTML::TreeBuilder seems to ignore unknown-to-it html tags by default, e.g. HTML5 tags like section. A quick but dirty fix is to instruct it to lax a bit, see Issues using HTML::TreeBuilder::XPath and the HTML5 <section> tag

    Here is example code tested with your html string:

    #!/usr/bin/env perl use strict; use warnings; use HTML::TreeBuilder::XPath; my $html = '<select class="c-form__select" id="radius" name="radius" a +ria-label="Choose a distance from your postcode"><option value="">Dis +tance (national)</option><option value="1">Within 1 mile</option><opt +ion value="5">Within 5 miles</option><option value="10">Within 10 mil +es</option><option value="15">Within 15 miles</option><option value=" +20">Within 20 miles</option><option value="25">Within 25 miles</optio +n><option value="30">Within 30 miles</option><option value="35">Withi +n 35 miles</option><option value="40">Within 40 miles</option><option + value="45">Within 45 miles</option><option value="50">Within 50 mile +s</option><option value="55">Within 55 miles</option><option value="6 +0">Within 60 miles</option><option value="70">Within 70 miles</option +><option value="80">Within 80 miles</option><option value="90">Within + 90 miles</option><option value="100">Within 100 miles</option><optio +n value="200">Within 200 miles</option></select>'; my $TB = HTML::TreeBuilder::XPath->new( # if you have html5 with tags unknown to 'HTML::TreeBuilder' then +uncomment following # ignore_unknown => 0, ); $TB->parse($html) || die 'TB->parse()'; #my $xpath = '//form[@name="searchVehicles"]//select[@name="radius"]/o +ption[@value!=""]/@value'; my $xpath = '//select[@name="radius"]//option[@value!=""]/@value'; my @values = $TB->findvalues($xpath); print "found ".scalar(@values)." option values:\n"; foreach my $avalue (@values){ print "value: '".$avalue."'\n"; }

    p.s. Personally, I would use a package which can cope better with latest HTML standards. XPath for me is appealing, say like a regex is. However, if what you are building is going to be a long term, expanding project you need to maintain - easily - your XPaths whenever site's web design changes. Maybe then Mojo::DOM (which I have not tried) is the way to go.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1212267]
Approved by beech
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (1)
As of 2021-05-09 04:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (100 votes). Check out past polls.

    Notices?