Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

help with some parsing

by AI Cowboy (Sexton)
on May 27, 2009 at 20:53 UTC ( #766509=perlquestion: print w/ replies, xml ) Need Help??
AI Cowboy has asked for the wisdom of the Perl Monks concerning the following question:

hey all, I'm a new guy here, but I seem to have run into a problem: I am trying out new ways for a robot I'm making to learn, and I need to figure out how to parse (pars? parse?) a web page so I only get what text I want, for example, a Wikipedia page, I only want the page contents, not the "log in here" or "forgot your password?" things, and I am at a loss. could you help me out?

Comment on help with some parsing
Re: help with some parsing
by CountZero (Bishop) on May 27, 2009 at 21:07 UTC
    Hail and welcome to the Monastery!

    Did you already take a look at WWW::Robot or WWW::Spyder?

    For parsing of HTML-pages, HTML::Parser or HTML::Parser::Simple are interesting.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I have not, however I will take a look at the HTML parser things. Thanks much! AI Cowboy
Re: help with some parsing
by zwon (Monsignor) on May 27, 2009 at 21:45 UTC
      thats an even better thing, and I cannot tell you how happy this makes me!!!! thank you so much! I will still use the other method, for other sites, such as etiquette sites or something, but thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://766509]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (15)
As of 2014-07-31 13:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls