comment on

Dear Monks,

I need help web crawling. I need to obtain the html code in the web page itself. I have tried WWW::Mechanize and URI to convert it to an absolute URL. But I have failed so far.

Can someone please help me crawl through or download the html code of the webpage of

www.sec.gov/Archives/edgar/data/935226/000114420411058092/0001144204-11-058092-index.htm

Here is the code trying to crawl the edgar website

use strict;
use WWW::Mechanize;
use LWP::Simple;
use URI;

my $url='edgar/data/1750/0001104659-06-059326-index.html';
my $web='www.sec.gov/Archives/'.$url;

my @temp=split(/\//,$url);
chomp($web);
my $rel_url='/'.$temp[2].'/'.$temp[3];
my $base_url='www.sec.gov/Archives/edgar/data';
my $abs_url=URI->new_abs($rel_url,$base_url);
my $text=get($abs_url) or die $!;
[download]

This is the SEC Edgar data base and once I figure out how to crawl through I can do the parsing. I just need the information between the "div class="infoHead"Items div" Thank you so much!

In reply to Help with web crawling by eversuhoshin

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks