|Just another Perl shrine|
A regex on the same content fails and works, with conditionsby hacker (Priest)
|on Oct 22, 2007 at 00:56 UTC||Need Help??|
hacker has asked for the
wisdom of the Perl Monks concerning the following question:
I'm banging my head against the wall on this one, and I don't understand why I'm getting these results.
I have a script I wrote that grabs an XML feed from a news site, extracts <link>, <pubDate> and <title> from the feed (via XML::Simple) follows the link referenced in the news feed to the original article, and then pulls the content out of the body of the article.
As part of the "final article" body extraction, I'm also trying to pull the author's name out of the HTML content itself, using a fairly simple regex.
While testing this, my regex stopped working, and I tried to debug it by writing the contents of $html to a local file, and examining that file.
What I have looks like this, for the relevant section:
The problem I'm having, is that when I read the remote content into $html, via res->content, and try to extract $author from it, it fails.
When I write $html to disk, then IMMEDIATELY read that same physical file back from disk into a new scalar ($new_html above), and then run the same exact regex across it, it works fine.