Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Question regarding web scraping

by Corion (Patriarch)
on Oct 22, 2016 at 14:39 UTC ( [id://1174512]=note: print w/replies, xml ) Need Help??


in reply to Question regarding web scraping

The following is a malformed regular expression:

while ($CONTENT =~ <div class=\"usertext-body may-blank-within md-cont +ainer \"><div class=\"md\">(.+?)<\/div><\/div><\/form><ul class=\"fla +t-list buttons\"> //gs )

It is at least missing the s/ start.

Personally, I suggest that you do the content extraction by using HTML::TreeBuilder and XPath or CSS selectors (via HTML::TreeBuilder::XPath and HTML::Selector::CSS).

Also note that Reddit has an API available, so you maybe don't need to scrape at all but can get the comments in a machine readable format directly.

Also note that on CPAN, there are many Reddit modules available, and it seems that Reddit::Client is using the Reddit API.

Replies are listed 'Best First'.
Re^2: Question regarding web scraping
by Gangabass (Vicar) on Oct 23, 2016 at 05:42 UTC
Re^2: Question regarding web scraping
by Lisa1993 (Acolyte) on Oct 22, 2016 at 15:30 UTC
    Thank you very much! I will look into these alternatives. Thanks again for your suggestions.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1174512]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-03-28 16:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found