Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^2: Parse::RecDescent for parsing URLs

by artist (Parson)
on Jul 27, 2007 at 17:46 UTC ( #629161=note: print w/ replies, xml ) Need Help??


in reply to Re: Parse::RecDescent for parsing URLs
in thread Parse::RecDescent for parsing URLs

I am looking to extract patterns of URL from given sites. Example: http://www.perlmonks.org/index.pl?node_id=629153 is a valid question-answer node. Where as http://www.perlmonks.org/index.pl?node=Recently%20Active%20Threads is not. There is a certain pattern follows here that node_id=\d+ is a valid question-answer node. Extracting these type of patterns from given site, can help me to determine the nature of the link. I like to do these site-wide, automatically.

Hopefully, I am making sense here.




--Artist


Comment on Re^2: Parse::RecDescent for parsing URLs
Download Code
Replies are listed 'Best First'.
Re^3: Parse::RecDescent for parsing URLs
by ikegami (Pope) on Jul 27, 2007 at 17:53 UTC

    Parse::RecDescent is used to create parsers, yet there already exists a parser for URIs. URI and extention URI::QueryParam should do the trick.

    Update: Here's an example:

    use URI qw( ); use URI::QueryParam qw( ); foreach ( 'http://www.perlmonks.org/index.pl?node_id=629153', 'http://www.perlmonks.org/index.pl?node=Recently%20Active%20Threads +', ) { my $uri = URI->new($_); my @node_ids = $uri->query_param('node_id'); my @node_titles = $uri->query_param('node'); if ( (@node_ids && @node_titles) || @node_ids > 2 || @node_titles > 2 ) { warn("$uri: Error: Bad uri\n"); } if (!@node_ids && !@node_titles) { warn("$uri: Warning: Unrecognized uri\n"); next; } if (@node_ids) { print("$uri: By Id ($node_ids[0])\n"); } if (@node_titles) { print("$uri: By Title ($node_titles[0])\n"); } }
Re^3: Parse::RecDescent for parsing URLs
by ikegami (Pope) on Jul 27, 2007 at 18:35 UTC

    Or maybe you are trying to extract data from a download HTML page? If so, use an existing HTML parser (such as HTML::TreeBuilder and HTML::Tree) instead of rolling out your own.

    I've found XPath to be very useful. HTML::TreeBuilder::XPath allows you to query the HTML document for information. The Firebug extention for Firefox can help you find the paths.

    If PerlMonks is not just an example, I recommend download the XML version of pages by adding the displaytype=xml query parameter to requested URIs. The same advice I gave for HTML applies for XML. Use an existing parser, and XPath is very useful for XML too.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://629161]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (16)
As of 2015-07-29 14:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (263 votes), past polls