laziness, impatience, and hubris | |
PerlMonks |
Re: Scraping a website - Unterminated quoted stringby kennethk (Abbot) |
on May 04, 2017 at 21:59 UTC ( [id://1189532]=note: print w/replies, xml ) | Need Help?? |
In general, parsing HTML from the wild using regular expressions is an exercise in frustration. I'd highly recommend pulling down HTML::Tree.
Also, there's no real reason to shell out to curl. I use LWP::UserAgent, though for low barrier to entry you may prefer LWP::Simple. With regard to your output, lexical file handles and two argument open would be better practice (there is nnothing wrong with what you are doing per se). So you might replace with
The file will automatically close when you go out of scope, it will handle some potential escaping issues that two-argument cannot, and there is no need to quote the article content before printing it. Lastly, if you are scraping, you may be violating terms of service, so please check on that for the site you are accessing. At the least, you should put a sleep in there to be polite. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
In Section
Seekers of Perl Wisdom
|
|