http://www.perlmonks.org?node_id=677833

whakka has asked for the wisdom of the Perl Monks concerning the following question:

I'm parsing multiple web pages that have dates in the text that look like this: "11/17/2006." They look like this in a browser and in the source.

However, when I get the page with a WWW::Mechanize bot and dump its contents, the date changes to: "Fri Nov 17 00:00:00 EST 2006."

This is confusing the heck out of me, does anyone know what's going on? If not, what's the best way to turn it back into its original format?

Many thanks.

Update: code added in comments.

  • Comment on Weird date format behavior with WWW::Mechanize

Replies are listed 'Best First'.
Re: Weird date format behavior with WWW::Mechanize
by Fletch (Bishop) on Apr 01, 2008 at 20:58 UTC

    Well, going by the voluminous amount of sample data and code you've given demonstrating the problem I think we can conclusively nail down the root cause:

    Gremlins.

    (See How (Not) To Ask A Question.)

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      I've read that actually. Really I thought this is a straightforward question that can be answered without code, but here you go:
      #!perl! -w use strict; use WWW::Mechanize; #browser, extends LWP use HTTP::Cookies::Mozilla; #cookie reader for bot my $mech = WWW::Mechanize->new(); $mech->cookie_jar(HTTP::Cookies::Mozilla->new( file => 'cookies.txt', autosave => 1 )) || die "Couldn't fill cookie jar!\n"; my $url = "http://www.insor.org/insasoweb/offenderDetails.do?sid=35465 +6.011"; $mech->get($url); print $mech->content;

      You will notice the difference between how the page looks in the browser and what prints, I hope.

        See, now that you've given a concrete example to look at you can easily see that the page in question (after you accept their disclaimer thing and get back a session cookie . . .) has the full date text. The page contains a call to pull in a JavaScript file "common.js". Said "common.js" contains a function formatDate which looks to munge dates.

        Given this it's not out of the realm of possibility that there's something calling javascript and munging all the dates. This easily explains the difference between what you see in your browser (even if you view source, you're seeing the source after it's been walked over) and what Mechanize is showing. You can easily confirm this by comparing the output from a third party (say curl and using the JSESSIONID cookie value pulled from your browser) which should match what Mechanize says it is.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

Re: Weird date format behavior with WWW::Mechanize
by Anonymous Monk on Oct 02, 2012 at 14:09 UTC
    Hello Whakka, I ran into similar issue recently. Are you able to fix your issue and if yes, please let me know. Thanks in Advance. Venkat.

      Are you aware that this thread is now more than four years old? And obviously, if your problem is similar, you haven't read the old answers. So, please read the old answers, if that does not help, read How (Not) To Ask A Question, then create a new thread.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)