Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

PerlMonks::Mechanized (beta)

by davido (Archbishop)
on Dec 25, 2004 at 21:24 UTC ( #417429=sourcecode: print w/ replies, xml ) Need Help??

Category: PerlMonks.org Related Scripts
Author/Contact Info davido
Description:

It's Christmas day in LA, and here's my present to the Monastery (whether you see it as a diamond or a lump of coal remains to be determined). Merry Christmas all...

This is Yet Another Attempt to gather Monastery automation tools under one roof. The following is a module, PerlMonks::Mechanized, which mechanizes many of the common PM tasks. It makes using the Monastery's XML tickers a trivial task, for example. I intend to polish it, and add features on request. For now, it just has what I have needed myself recently, plus a little more. But as requests come in, if they're well thought-out, I'll add them. I'm also very interested in comments that will help me to improve it. At this point, I don't consider any part of this module stable (not even its interface). But if people like it, I'll stabilize it, properly document it, and continue adding to it/ maintaining it. For now, this is a rough-draft.

It uses $ENV{PMPASS} and $ENV{PMUSER} for login info, or you can supply them as user, passwd to my $monk = PerlMonks::Mechanize->new($user,$passwd);

Here is a simple example...

use strict; use warnings; use PerlMonks::Mechanized; my $monk = PerlMonks::Mechanized->new(); $monk->say( "Hello. This is a PerlMonks::Mechanized test." );

The preceeding code assumes you've set your login env variables. The code only logs you in to the Monastery if the action you're requesting requires a login. Otherwise, you stay logged out. Login is on demand, assuming you've supplied the proper login info. Janitors can also use this for node edits / retitling. I'll be reworking the retitler to take advantage of this module's features.

The code is well commented. Start your reading at the beginning of the PerlMonks::Mechanized package (ie, ignore the Janitor package until you've read PM::Mech first. Then if you're a janitor, go back and read the Janitor class). All public functions are documented at least enough that with the help of Data::Dumper you'll know what they're doing.

Enjoy!


Updates:

  • PM::Mech::Janitor fixes (updated to v0.6):
    • Renamed Janitor class to PerlMonks::Mechanized::Janitor (more sensible)
    • PM::Mech::Janitor bug fixed where W::Mech was misspelled. Woops.
    • PM::Mech::Janitor's version number goes to 0.6.
  • PM::Mechanized fixes (updated to v0.6)
    • Repaired login detection to work with PM::M::Janitor
  • URL fix, per Dietz.
    • PM::M becomes v0.61. Added trailing '/' to URL.




# PerlMonks::Mechanized
# and its helper class 'PM::Mech::Janitor'.
# Start by creating a PerlMonks::Mechanized->new() object.
# If needed, login info can be stored in $ENV{PMPASS}, $ENV{PMUSER}.
# To override default site URL, set $ENV{PMSITE} to full URL.
# Janitor object is returned by the PerlMonks::Mechanized->janitor()
# method.  Don't call Janitor->new() directly.
# Read on, for details.


package PerlMonks::Mechanized::Janitor;

use strict;
use warnings;

use WWW::Mechanize; # Needs a new WWW::Mech object in order to 
                    # implement rollback / commit.

# Janitor class: Returned by janitor() method of PM::Mech.  First 
# call must be to $janitor->fetch().  After that initial hit, no
# server hits will occur until $janitor->commit() is invoked.
# You may spawn multiple janitors, for full cached transactions.
# Each Janitor invokes a new WWW::Mechanize object.


# Returns a Janitor object.

sub new {
    my( $class, $monk, $id ) = @_;
    my $self = {};
    $self->{monk} = $monk;
    $self->{id} = $id;
    $self->{agent} = WWW::Mechanize->new( 
        'autocheck' => 1,
        'agent' => 'PM::Mech::Janitor0.6'
    );
    $self->{site} = $self->{monk}{site};
    $self->{fetched} = 0;
    return bless $self, $class;
}


# If passed the argument of "unconsider", the target node will be
# unconsidered.
# Must fetch() before using any of the other methods in this class.
# No return value.

sub fetch {
    my( $self, $unconsider ) = @_;
    my $uncon_URI = '';
    if(
            defined( $unconsider ) 
        and $unconsider =~ /^Un/i
    ) {
        $uncon_URI =   ";op=consider;"
                     . "$self->{id}=unconsider";
    }
    $self->{agent}->get(
          $self->{site}
        . '?node_id='
        . $self->{id}
        . ';displaytype=editors'
        . $self->{monk}->_login_URI('force')
        . $uncon_URI
    );
    $self->{agent}->success()
        or die "Unable to fetch Janitors view of $self->{id}.\n";
    $self->{fetched} = 1;
}

sub get_title {
    my $self = shift;
    _verify_fetch( $self );
    my $form = $self->{agent}->form_name( 'edit_node' );
    $form or die "Couldn't find 'edit_node' form in get_title.\n";
    return $form->value( 'update_title', 1 );
}

sub set_title {
    my( $self, $title ) = @_;
    _verify_fetch( $self );
    $self->{agent}->field( 'update_title', $title );
}

sub get_author {
    my $self = shift;
    _verify_fetch( $self );
    my $content = $self->{agent}->content();
    my $author = '';
    if( $content =~ 
            m/\s+by\s+<a HREF="\?node_id=\d+">([^<]+)<\/a>/i ) {
        $author = $1;
    } else {
        die   "Couldn't ascertain the author while scraping the "
            . "editor view of ID: $self->{id}.\n";
    }
    return $author;
}

sub get_doctext {
    my $self = shift;
    _verify_fetch( $self );
    my $form = $self->{agent}->form_name( 'edit_node' );
    $form or die "Couldn't find 'edit_node' form in get_doctext.\n";
    return $form->value( 'update_doctext', 1 );
}

sub set_doctext {
    my( $self, $text ) = @_;
    _verify_fetch( $self );
    $self->{agent}->field( 'update_doctext', $text );
}

# Commits the changes made.  The only change that cannot be rolled-
# back is the "unconsider" change.  Sorry 'bout that. ;)

sub commit {
    my $self = shift;
    _verify_fetch( $self );
    $self->{agent}->current_form->value( 'blah', 'update' );
    $self->{agent}->click( 'blah' );
    $self->{agent}->success()
        or die "Couldn't commit changes to $self->{id}.\n";
}

# Private Janitor class function.

sub _verify_fetch {
    my $self = shift;
    die "ID: $self->{id} hasn't been fetched yet.\n"
        unless $self->{fetched};
}

1;


package PerlMonks::Mechanized;

# This is your starting point.  ...Create a new
# PerlMonks::Mechanized object, and have fun with it.
# Logins are not performed unless they are needed for the activity
# you're requesting.  If a login is needed, it will be done 
# automatically if you passed ( user, password ) to new(), or if 
# $ENV{PMUSER} and $ENV{PMPASS} are set.  Logins are automatic, and
# on-demand.  However, once logged in, you stay logged in until
# your PM::Mech object is destroyed.

use strict;
use warnings;

use WWW::Mechanize;
use XML::Simple;

our $SITE = exists( $ENV{PMSITE} ) 
                ? $ENV{PMSITE} 
                : 'http://www.perlmonks.org/';


# Call new() to create PM::Mech object. Call with ( $user, $pass )
# to log in on demand, or set $ENV{} variables for on-demand login.
# If no login info is supplied through new() or $ENV, you can only
# do things that don't require login.  You don't need to explicitly
# log in. If you have supplied the proper info, it will happen when
# needed, transparently.

sub new {
    my $class = shift;
    my $obj = {};
    $obj->{user}   = defined( $_[0] ) 
                     ? $_[0] 
                     : defined( $ENV{PMUSER} )
                       ? $ENV{PMUSER}
                       : '';
    $obj->{passwd} = defined( $_[1] ) 
                     ? $_[1] 
                     : defined( $ENV{PMPASS} )
                       ? $ENV{PMPASS}
                       : '';
    $obj->{logged_in} = 0;
    $obj->{login_phrase} = ( $obj->{user} && $obj->{passwd} ) ?
          ";op=login;user=$obj->{user};"
        . "passwd=$obj->{passwd};expires=+10y"
        : '';
    $obj->{site} = $SITE;
    $obj->{agent} = WWW::Mechanize->new( 
        'autocheck' => 1,
        'agent' => 'PM::Mech0.61'
    );
    bless $obj, $class;
}


# Given a base thread ID, returns a datastructure containing
# The thread's ID's.  Uses the XML ticker, "xml node thread", at
# id://180684.

sub threaded_ids {
    my( $self, $base ) = @_;
        $self->{agent}->get(   $self->{site}
                             . "?node_id=180684;id=$base" );
    $self->{agent}->success()
        or die "Unable to fetch thread ticker for id = $base.\n";
    my $struct = XMLin( 
        $self->{agent}->content(),
        ForceArray => 1,
        KeepRoot   => 1
    );
    return $struct;
}


# Given a base thread ID, returns a flat list of thread ID's.
# Uses the XML ticker, "xml node thread", at id://180684.

sub thread_list {
    my( $self, $base ) = @_;
    my $structref = threaded_ids( $self, $base );
    return [ 
        sort { $a <=> $b } _flatten_thread( $structref )
    ];
}


# Returns a datastructure containing info about a node or list of
# nodes.  Uses the XML ticker, "Node Query XML Generator", 
# at id://37150.  Accepts a node id or list of nodes.

sub node_info {
    my( $self, @ids ) = @_;
    $self->{agent}->get(    
          $self->{site}
        . "?node_id=37150;nodes="
        . join( ',', @ids )
        . ';xmlstyle=flat'
    );
    $self->{agent}->success()
        or die "Unable to fetch node query XML generator.\n";
    return  XMLin( 
                $self->{agent}->content(),
                ForceArray => 1
            )->{node};
}

# Calls get_node_info() with a single ID or list of nodes.
# Returns an array of arrays holding id/title pairs.  Relies on
# "Node Query XML Generator", at id://37150.

sub node_titles {
    my( $self, @ids ) = @_;
    my $info = node_info( $self, @ids );
    return [ 
        map { [ $_->{node_id}, $_->{content} ] } 
            @{ $info } 
    ];
}


# Grab user stats.  Uses the XP XML Ticker (id://16046).
# See the PM FAQ for details about valid args, and their meanings.

sub user_stats {
    my( $self, %params ) = @_;
    my $parameters = '';
    foreach( ( 'for_user', 'showlevels', 
               'for_userid','shownorm', 'showall' ) ) {
        $parameters .=  exists( $params{$_} )
                          ? ";$_=$params{$_}"
                          : '';
    }
    $self->{agent}->get( 
          $self->{site}
        . "?node_id=16046;xmlstyle=flat"
        . $parameters
        . $self->_login_URI()
    );
    $self->{agent}->success()
        or die "Unable to fetch user stats ticker.\n";
    return  XMLin(
                $self->{agent}->content(),
                ForceArray => 1
            );
}




# Reads the New Chatterbox XML Ticker (id://207304) and returns a
# ref to a LoL structure of CB traffic.

sub chatterbox {
    my $self = shift;
    $self->{agent}->get(
          $self->{site}
        . "?node_id=207304"
    );
    $self->{agent}->success()
        or die "Unable to fetch Chatterbox content XML generator.\n";
    return   XMLin(
                $self->{agent}->content(),
                ForceArray => 1
            )->{message};
}



# Talks in the CB.  Messages can't be longer than 250 characters.

sub say {
    my( $self, $message ) = @_;
    if( length( $message ) > 250 ) {
        $message = substr $message, 0, 250;
    }
    $self->{agent}->get( 
          $self->{site}
        . '?' . $self->_login_URI
        . ';node_id=16046;op=message;message='
        . $message
    );
    $self->{agent}->success()
        or die "Unable to talk in the CB.\n";
}

# Returns logged-in user's private messages in a datastructure.
# Uses the Private Message XML Ticker (id://15848).  See 
# node_id=379320 for information on how the parameter fields work.

sub private_message {
    my( $self, %params ) = @_;
    my $parameters = '';
    foreach( ( 'max_recs', 'since_id', 'prior_to', 'archived' ) ) {
        $parameters .= exists( $params{$_} )
                         ? ";$_=$params{$_}"
                         : '';
    }
    $self->{agent}->get( 
          $self->{site}
        . "?node_id=15848;xmlstyle=clean"
        . $parameters
        . $self->_login_URI()
    );
    $self->{agent}->success()
        or die "Unable to fetch Private Message ticker.\n";
    return  XMLin(
                $self->{agent}->content(),
                ForceArray => 1
            )->{message};
}


# Reads the 'Other Users XML Ticker' (id://15851) and returns a ref
# to a list of other users currently logged in to the Monastery.

sub other_users {
    my $self = shift;
    $self->{agent}->get( 
          $self->{site}
        . "?node_id=15851"
    );
    $self->{agent}->success()
        or die "Unable to fetch Other Users XML generator.\n";
    return  XMLin( 
                $self->{agent}->content(),
                ForceArray => 1
            )->{user};
}

# Uses the 'displaytype=xml;xmlstyle=flat' ticker to grab an entire
# single node and any available related info for that node.  The 
# data is plopped into a datastructure that mirrors the original
# XML tags, which in turn, mirror PM database columns.

sub node_content {
    my( $self, $id ) = @_;
    $self->{agent}->get(
          $self->{site}
        . "?node_id=$id;displaytype=xml;xmlstyle=flat"
    );
    $self->{agent}->success()
        or die "Unable to fetch node ID: $id\n";
    return XMLin( $self->{agent}->content() );
}

# Uses the Scratchpad Viewer's XML displaytype (id://108949) to get
# a user's scratchpad.  An attempt will be made to log self in, if
# possible.  If the logged-in user is the same as the user who's pad
# we're retrieving, the private portion will also be retrieved.
# This returns a datastructure.

sub scratchpad {
    my( $self, $pad ) = @_;
    $pad = ( defined $pad ) ? $pad : $self->{user};
    $self->{agent}->get(
          $self->{site}
        . "?node_id=108949;user=$pad;passthrough=1"
        . ';displaytype=xml;xmlstyle=flat'
        . $self->_login_URI()
    );
    $self->{agent}->success()
        or die "Unable to fetch scratchpad for $pad.\n";
    return  XMLin(
                $self->{agent}->content(),
                ForceArray => 1
            );
}

# This sub uses the Newest Nodes XML Generator (id://30175) to get
# a list of newest nodes.  See the PerlMonks FAQ for a description
# of what "types=" options you have.  You may optionally specify
# whether to use xmlstyle=flat (default) or xmlstyle=rss.
# You may specify sinceunixtime=epocseconds, or days=decimal to
# get up to 8 days worth of newest nodes.
# "types" should be passed in as 'types=>[type,type,type]'
# Optional params should be passed as a hashref.
# Currently no validity checking is really done on params passed
# to the method.

sub newest_nodes {
    my( $self, %params ) = @_;
    $self->{agent}->get(
          $self->{site}
        . '?node_id=30175;xmlstyle='
        . (
            exists( $params{xmlstyle} ) 
                ? $params{xmlstyle}
                : 'flat'
          )
        . (
            exists( $params{days} )
                ? ';days=' . $params{days}
                : ''
          )
        . (
            exists( $params{sinceunixtime} )
                ? ';sinceunixtime=' . $params{sinceunixtime}
                : ''
          )
        . (
            exists( $params{types} ) 
                ?   ';types=' 
                  . join( ',', @{$params{types}} )
                : ''
          )
    );
    $self->{agent}->success()
        or die "Unable to fetch newest nodes.\n";
    return  XMLin(
                $self->{agent}->content(),
                ForceArray => 1
            );
}


# This sub fetches the displaytype=editors view of the node
# indicated in $id.  It returns an object of class Janitors with 
# the following methods:
# fetch(), get_title(), set_title(), get_author(), get_doctext(),
# set_doctext(), and commit().  You must always fetch() first, and
# after that, the rest of the methods will have relevancy.

sub janitor {
    my( $self, $id ) = @_;
    return PerlMonks::Mechanized::Janitor->new( $self, $id );
}



# Private class subs.  Please don't use these externally.

# Called by methods that need the user to be logged in.
# If the user is already logged in, this sub returns empty string.
# If user isn't logged in, and it is possible to do so, this sub
# returns a URI suffix to log the user in.

sub _login_URI {
    my( $self, $independant_agent ) = @_;
    my $login = '';
    my $logged_in = $self->{logged_in}; # Save old state.
    my $independant; # Flag for independant agent.
    if( 
        defined( $independant_agent ) 
        and $independant_agent
    ) {
        $self->{logged_in} = 0;
        $independant = 1;
    }
    if(
            $self->{logged_in} == 0
        and $self->{login_phrase}
    ) {
        $login = $self->{login_phrase};
        $self->{logged_in} = 1;
    }
    if( $independant ) {
        # if this is an independant agent, restore 
        # original login flag state.
        $self->{logged_in} = $logged_in;
    }
    return $login;
}


# Used by get_thread_list() to flatten return value from 
# get_thread_ids().

sub _flatten_thread {
    my @nodes;
    foreach my $key ( keys %{$_[0]} ) {
        if ( ref( $_[0]->{$key} ) ) {
            push @nodes, _flatten_thread( $_[0]->{$key} );
        }
        if ( $key =~ m/^\d+$/ ) {
            push @nodes, $key;
        }
    }
    return @nodes;
}




1;



Comment on PerlMonks::Mechanized (beta)
Download Code
Re: PerlMonks::Mechanized (beta)
by Dietz (Curate) on Dec 28, 2004 at 17:45 UTC
    Hi davido!

    First I have to say thanks for your effort!
    I really like this one.

    Just double checked:
    Your module works fine at home under Linux.
    At work (ActiveState v5.8.4 build 810 under WinXP) I had to add a leading slash to the node_id's in your module I was experimenting with to fetch a page.
    Without the leading slash WWW::Mechanize is reporting 'Bad Request'.

    First I thought it was your update to v0.6, but now at home again I can see that v0.5 also didn't have the slash.

    So what am I talking about:
    e.g. changed "?node_id=207304" to "/?node_id=207304"

    Compare:
    http://www.perlmonks.org?node_id=207304
    http://www.perlmonks.org/?node_id=207304

    Although I didn't test it at work, I assume adding a trailing slash to the variable our $SITE instead of adding a leading slash before all node_id's should also do the trick. If the trailing slash in $SITE doesn't work under WinXP I will report tomorrow.

    After adding the slash in your module, this works now for WinXP and for Linux, though I don't know why it is behaving like that. Is it the platform or the Perl version?
    Firefox/Mozilla adds this slash automatically if it isn't provided, IE6 doesn't care:


    Bye,
    Dietz

    P.S.: At work I use Corion's 'trick' (Re: Re: Re: WWW::Mechanize and proxy with username/password) to set our company's proxy server:

    $ENV{HTTP_PROXY} = 'http://username:password@proxyserver:port';

    This works like a charm.

    Update: formatting

      Thanks for your report. It's funny that the missing trailing '/' character wouldn't be a problem for me, and would for you. We're both hitting the same server. But regardless, it was a simple fix. I've annotated the fix among the update log in the root node for this thread.

      You may already have noticed this by reading the module's comments and code, but you can override the default site name (which is now set as "http://www.perlmonks.org/") by setting $ENV{PMSITE} prior to creating a PM::Mech object, or (of course) by setting the PMSITE environment variable. The hard coded site name now has a trailing slash to fix your reported problem. ...and likewise if using the PMSITE environment variable, remember to use that trailing slash. I may incorporate a check for that when I go through and do more to bullet proof this module in the near future.

      Glad you liked it. Aside from the positive rep, I haven't received many comments about it, so I don't know if anyone actually finds it useful. I'm nearly done rewriting the Janitors Thread Retitler to use this module, and it's made it a lot easier, so for me, it's been helpful.


      Dave

Re: PerlMonks::Mechanized (beta)
by demerphq (Chancellor) on Dec 30, 2004 at 13:05 UTC

    Heh, well, this has been a back burner project for me for quite some time. Ive focused more on the tickers, and the various XML feeds, and nothing on the HTML side. But it looks like youve gotten there first and have a nicer package than I did. Instead of pursuing my efforts Ill start looking at patching this. I actually wonder if there is any point in putting this onto Sourceforge or something so that it can be worked on collaboratively more easily.

    Anyway, nice work, I look forward to playing with it later on. :-)

    ---
    demerphq

      Putting it on Sourceforge might not be a bad idea. I'll look into that in the next day or two.

      A few thoughts for things I plan to add or change in this module:

      • Add a Recently Active Threads parser. This will be easier once a RAT XML ticker is available. Unless someone else gets to that first, I'll attempt it eventually.
      • Add a normal displaytype (non-XML) node grabber, possibly combined with an HTML token parser.
      • Add interoperability with some of the Monastery's nodelets.
      • Fix the Janitor class so that it is smart enough to know the difference between different types of nodes. For example, currently if you edit the doctext field of a Code Catacombs type node, you're actually editing the code display segment, not the description segment.
      • This would be a pretty major change, but I may make PerlMonks::Mechanized a proper subclass of WWW::Mechanize, instead of its current 'uses' relationship.

      Dave

        Some quick thoughts:

        • The XML Node thread ticker should provide an option to provide a flat list directly.
        • Ill look into making a RAT ticker. I have a feeling expecting anyone else to do it is kinda like asking them to pull out their own toenails. :-)
        • Most of my efforts have been regarding things like search internal code, and facilitating such tasks as synchronizing the pmdev server with the master. Likewise with correctly using the private message ticker, et all.
        • Im not sure if webscraping and XML parsing code should live in the same place, im not saying it shouldnt Im just wondering why it should. :-)
        • Support for multiple users?
        • XP and level info can be obtained from the User XP XML ticker with the correct option specified.
        • Generally speaking I would prefer that if you have issues that you need to workaround with a given ticker that you let me know and we change the ticker.

        This is stream of consciousness stuff here, so take it all with a grain of salt. Also I may update this as more stuff occurs to me.

        ---
        demerphq

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://417429]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (9)
As of 2014-12-21 19:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (107 votes), past polls