http://www.perlmonks.org?node_id=646338

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I'm banging my head against the wall on this one, and I don't understand why I'm getting these results.

I have a script I wrote that grabs an XML feed from a news site, extracts <link>, <pubDate> and <title> from the feed (via XML::Simple) follows the link referenced in the news feed to the original article, and then pulls the content out of the body of the article.

As part of the "final article" body extraction, I'm also trying to pull the author's name out of the HTML content itself, using a fairly simple regex.

While testing this, my regex stopped working, and I tried to debug it by writing the contents of $html to a local file, and examining that file.

What I have looks like this, for the relevant section:

my $req = HTTP::Request->new(GET => $link) or die $!; my $res = $ua->request($req); my $html = $res->content; # write_file() comes from File::Slurp # $item_id is the article ID extracted from <link> write_file($item_id, {binmode => ':raw' }, $html); # Original source string looks like: # <a href="http://news.example.com/?author=John_Smith">John Smith</a +> my ($other, $author) = $html =~ /\?author=(.*?)">(.*)<\/a>/; # $author is blank, empty here, why? print "AUTHOR: $author\n"; my $new_html = read_file($item_id); my ($n_other, $n_author) = $new_html =~ /\?author=(.*?)">(.*)<\/a>/; # Now $author contains the right name, "Mike Smith" for # example. print "AUTHOR: $n_author\n";

The problem I'm having, is that when I read the remote content into $html, via res->content, and try to extract $author from it, it fails.

When I write $html to disk, then IMMEDIATELY read that same physical file back from disk into a new scalar ($new_html above), and then run the same exact regex across it, it works fine.

WHY?!