Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Passing complex html-tag input over command line to HTML TreeBuilder method look_down() properly

by sadarax (Sexton)
on Apr 18, 2019 at 01:01 UTC ( #1232738=perlquestion: print w/replies, xml ) Need Help??

sadarax has asked for the wisdom of the Perl Monks concerning the following question:

Greetings wise Perl monks. My problem is that I am unable to dynamically pass a set of html tags over input command line to HTML::TreeBuilder's look_down() method. I want to invoke the program something like this:
perl ./download.pl --url 'http://www.gocomics.com/9chickweedlane/2019/ +04/17' --tags div --tags class="comic container js-comic-"
I cannot figure out how to pass the 'class' word and its value "comic container js-comic-" and drop them appropriately into the function call 'look_down()'. Below here is a hardcoded example (this works just fine if I use it in the program):
@results = $tree->look_down( _tag => "div", "class" => qr(comic co +ntainer js-comic-) ); # HARDCODED, should by dynamic
Ideally it would be something dynamic like this (forgive the dumb example):
# PROCESSING THE TAG LIST.... my $first_tag = $self->{ tags(0) }; my $second_tag = $self->{ tags(1) }; if( $second_tag =~ "=" ) { # Split apart the key-value pair my @words = split /\=/,$second_tag; } @results = $tree->look_down( _tag => $first_tag, "$words[0]" => qr("$w +ords[1]") );
Here is the code in action:
### MAIN PROGRAM sub main { use File::Spec; use Getopt::Long; my $url = undef; my @tags = undef; GetOptions( "tags=s" => \@tags, "url=s" => \$url, ) or die("Error in command line arguments. $!\n"); my $dlobj = DownloadObject->new( $url, \@tags ); $dlobj->download(); }
### DOWNLOAD OBJECT CLASS #!/usr/bin/perl -w use warnings; use strict; package DownloadObject; # Simple Constructor sub new { my $proto = shift; my $class = ref($proto) || $proto; my $self = {}; $self->{url} = undef; # URL to target $self->{tags_list_process_order} = undef; # List of html-tags, in se +quential order, to process in order to extract the target content $self->{url} = $_[0]; $self->{tags_list_process_order} = $_[1]; bless ($self, $class); return $self; } sub download { my $self = shift; require LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0'); $ua->timeout(10); $ua->env_proxy; my $response = $ua->get( $self->{url} ); # Download the content if( $response->is_success ) { use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new_from_content( $response->content +() ); # Put the contents into HTML-Treebuilder my @results = (); # THIS IS THE PROBLEM LINE. It is HARDCODED, and I want to make it + dynamic. @results = $tree->look_down( _tag => "div", "class" => qr(comic co +ntainer js-comic-) ); foreach(@results) { say " Data-Image URL: " . $_->attr('data-image'); # Gonna do something with result...... } } }
Any help would be greatly appreciated.
  • Comment on Passing complex html-tag input over command line to HTML TreeBuilder method look_down() properly
  • Select or Download Code

Replies are listed 'Best First'.
Re: Passing complex html-tag input over command line to HTML TreeBuilder method look_down() properly
by bliako (Vicar) on Apr 18, 2019 at 09:55 UTC

    Maybe your command-line logic can be changed so as to differentiate between element type, name, class and id? Right now you have adopted a convention which says : use the same parameter key (--tags) for type and class but whenever there is a class you need to throw in a class=... in order to differentiate between them.

    How about --tags type=div --tags class='comic-' --tags id='123'. An alternative would be: --tags-type div --tags-class 'comic-' --tags-id '123'. Meaning you created 3 different command line options and would need to keep adding if you remember more tag things like --tags-name.

    Once you have the command-line logic that suits you, your users AND ALSO (most importantly) allows for expanding your features in the future without changing the API too much, then perhaps you want to use Getopt::Long's GetOptions() feature of passing a sub to parse complext command-line values. So that you keep command line parsing inside GetOptions() and don't "pollute" the rest of your code with command-line checks and parsing. For example (assuming you picked the first "cmd-line logic" i proposed):

    use Getopt::Long; my %tags = (); my %cmd2tags = ( # list of cmd-line element types => correspondence to xpath (via lo +ok_down()) 'type' => '_tags', 'class' => 'class', 'id' => 'id', 'name' => 'name' ); GetOptions( "tags=s" => sub { # sub to be called whenever --tags XYZ is detected # it expects XYZ to be in the form "K=V" and K must # exist as a key in %cmd2tags my ($k,$v) = @_; if( $v =~ /^(.+?)=(.+?)$/ ){ my $t=$1; my $n=$2; my $c2t = $cmd2tags{$t}; die "unknown tag name '$t', I know only of ".join(",", keys % +cmd2tags)."\n" unless defined $c2t; $tags{$c2t} = $n; } else { die "--tags V: V must be in the form 'key=value' where + key can be 'class' or 'type', e.g. ..." } }, "url=s" => \$url, ) or die("Error in command line arguments. $!\n"); ... # now %tags contains tag-type-name=>value, e.g. '_tag' => 'div' etc. print "Tag received: '$_' => '".$tags{$_}."'\n" for(keys %tags); @results = $tree->look_down(%tags); ...

    Edit: IMO relying on the order of command line parameters is not good form. There are exceptions of course, but I always try to avoid it if I can. In your use-case it is not required. And the code I showed does not require it either.

Re: Passing complex html-tag input over command line to HTML TreeBuilder method look_down() properly
by Anonymous Monk on Apr 18, 2019 at 01:33 UTC

    You did not say how/what you had tried to pass the tags and what the program received (use Data::Dumper to print to show exactly). So ...

    Use Getopt::Long module to receive "tags" option arguments as a list of strings. Then process the attribute ("class" in your example) and its value yourself (use split( '=' , $attr_value_pair , 2 )).

      Was the first part demonstrating how I invoke the program not clear?
      bash $: perl ./download.pl --url 'http://www.gocomics.com/9chickweedlane/2019/04/17' --tags div --tags class="comic container js-comic-"

        I did see the invocation; missed the use both of Getopt::Long & split() when I did not see what input you were actually working with. Then I had answered the wrong, possibly nonquestion of passing the tags & the attributes & the values from the command line to your program based on title of OP.

Re: Passing complex html-tag input over command line to HTML TreeBuilder method look_down() properly
by Anonymous Monk on Apr 18, 2019 at 07:52 UTC
      Thanks for the suggestion. I've looked at them but I'm not really sure how to make them work. Using my browser's devtools, I found specific element I'm wanting to target and did Copy --> Copy Selector. It gave me this:
        body > div.js-amu-container-global > div.gc-page-container > div.gc-container.gc-container--fluid > div > div.layout-2col-content.content-section-padded > div.gc-container > div.comic.container.js-comic-2729847.js-item-init.js-item-share.js-comic-swipe.bg-white.border.rounded > div.comic__wrapper > div.comic__container > div > a > picture
      
      I'm really not sure how to use that. Something like this?
      use HTML::Selector::XPath; my $selector = HTML::Selector::XPath->new("div.comic__container"); $selector->to_xpath;
      I want to parse that piece of HTML code to find the 'data-image' attribute, which I know resides within there very close to that selector point.
        #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath 'selector_to_xpath'; Main( @ARGV ); sub Main { my $tree = HTML::TreeBuilder::XPath->new; # $tree->parse_file('foo.html'); $tree->parse_content( DemoHtml() ); for my $node ( $tree->findnodes( selector_to_xpath( 'div.comic__container' ) ) ) { MeImagins( $node ); } } sub MeImagins { my( $node ) = @_; for my $img( $node->findnodes('//img') ){ print "\n###", "\n", $img->address(), "\n", $img->attr( 'src' ), "\n", $img->attr( 'alt' ), "\n", ; } } sub DemoHtml { return <<'__HTML__'; <div class="comic__container"> <div class="comic__image js-comic-swipe-target"> <div class="swipe-preview swipe-preview__previous js-preview-p +revious"> <div class="swipe-preview__group"> <h5 class="card-subtitle"> <date>April 16, 2019</date> </h5> <div class="swipe-preview__ubadge"> <div class="gc-avatar gc-avatar--creator sm"><img srcset="https: +//assets.gocomics.com/assets/transparent-3eb10792d1f0c7e07e7248273540 +f1952d9a5a2996f4b5df70ab026cd9f05517.png" data-srcset="https://avatar +.amuniversal.com/feature_avatars/ubadge_images/features/cw/small_u-20 +1701251613.png, 72w" class="lazyload" alt="9 Chickweed Lane" src="htt +ps://avatar.amuniversal.com/feature_avatars/ubadge_images/features/cw +/small_u-201701251613.png"></div> </div> </div> </div> <a itemprop="image" class="js-item-comic-link" href="/9chickwe +edlane/2019/04/17" title="9 Chickweed Lane"> <picture class="item-comic-image"><img class="lazyload img-fluid" sr +cset="https://assets.gocomics.com/assets/transparent-3eb10792d1f0c7e0 +7e7248273540f1952d9a5a2996f4b5df70ab026cd9f05517.png" data-srcset="ht +tps://assets.amuniversal.com/93d41d70391d01379025005056a9545d 900w" s +izes=" (min-width: 992px) 900px, (min-width: 768px) 600px, (min-width: 576px) 300px, 900px" alt="9 Chickweed Lane Comic Strip for Ap +ril 17, 2019 " src="https://assets.amuniversal.com/93d41d70391d013790 +25005056a9545d" width="100%"></picture> </a> <meta itemprop="isFamilyFriendly" content="true"> <div class="swipe-preview swipe-preview__next js-preview-next" +> <div class="swipe-preview__group"> <h5 class="card-subtitle"> <date>April 18, 2019</date> </h5> <div class="swipe-preview__ubadge"> <div class="gc-avatar gc-avatar--creator sm"><img srcset="https: +//assets.gocomics.com/assets/transparent-3eb10792d1f0c7e07e7248273540 +f1952d9a5a2996f4b5df70ab026cd9f05517.png" data-srcset="https://avatar +.amuniversal.com/feature_avatars/ubadge_images/features/cw/small_u-20 +1701251613.png, 72w" class="lazyload" alt="9 Chickweed Lane" src="htt +ps://avatar.amuniversal.com/feature_avatars/ubadge_images/features/cw +/small_u-201701251613.png"></div> </div> </div> </div> </div> <nav class="gc-calendar-nav" role="group" aria-label="Date Nav +igation Controls"> <div class="gc-calendar-nav__previous"> <a role="button" href="/9chickweedlane/1993/07/12" class="fa btn + btn-outline-secondary btn-circle fa fa-backward sm " title=""></a> <a role="button" href="/9chickweedlane/2019/04/16" class="fa btn + btn-outline-secondary btn-circle fa-caret-left sm js-previous-comic +" title=""></a> </div> <div class="gc-calendar-nav__select"> <div class="btn btn-outline-secondary gc-calendar-nav__datepicke +r js-calendar-wrapper" data-date="2019/04/17" data-name="/9chickweedl +ane/" data-year="2019" data-month="04" data-day="17" data-feature="9c +hickweedlane" data-ct="" data-start="1993/07/12" data-end="2019/04/19 +" data-open="2019-04-17"> <i class="fa fa-calendar xs"></i> <input name="startDate" placeholder="April 17, 2019" readonl +y="readonly" class="cal off calendar-input date js-calendar-input dat +epicker js-calendar-input-link" type="text"> </div> <a class="btn btn-outline-secondary" alt="Click to View a Random + 9 Chickweed Lane Comic Strip!" href="/random/9chickweedlane">Random< +/a> </div> <div class="gc-calendar-nav__next"> <a role="button" href="/9chickweedlane/2019/04/18" class="fa btn + btn-outline-secondary btn-circle fa-caret-right sm " title=""></a> <a role="button" href="/9chickweedlane/2019/04/19" class="fa btn + btn-outline-secondary btn-circle fa-forward sm " title=""></a> </div> </nav> </div> __HTML__ }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1232738]
Approved by haukex
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2019-12-09 02:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?