<?xml version="1.0" encoding="windows-1252"?>
<node id="1007958" title="Help with web crawling" created="2012-12-09 02:06:47" updated="2012-12-09 02:06:47">
<type id="115">
perlquestion</type>
<author id="889645">
eversuhoshin</author>
<data>
<field name="doctext">
&lt;p&gt; Dear Monks, &lt;/p&gt;

&lt;p&gt; I need help web crawling. I need to obtain the html code in the web page itself. I have tried WWW::Mechanize and URI to convert it to an absolute URL. But I have failed so far. &lt;/p&gt;

&lt;p&gt; Can someone please help me crawl through or download the html code of the webpage of &lt;/p&gt;

&lt;p&gt; www.sec.gov/Archives/edgar/data/935226/000114420411058092/0001144204-11-058092-index.htm
&lt;/p&gt;

&lt;p&gt; Here is the code trying to crawl the edgar website &lt;/p&gt;

&lt;code&gt;
use strict;
use WWW::Mechanize;
use LWP::Simple;
use URI;

my $url='edgar/data/1750/0001104659-06-059326-index.html';
my $web='www.sec.gov/Archives/'.$url;

my @temp=split(/\//,$url);
chomp($web);
my $rel_url='/'.$temp[2].'/'.$temp[3];
my $base_url='www.sec.gov/Archives/edgar/data';
my $abs_url=URI-&gt;new_abs($rel_url,$base_url);
my $text=get($abs_url) or die $!; 

&lt;/code&gt;

&lt;p&gt; This is the SEC Edgar data base and once I figure out how to crawl through I can do the parsing. I just need the information between the "div class="infoHead"Items div" Thank you so much! &lt;/p&gt;</field>
</data>
</node>
