Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Making an array from a downloaded web page

by malomar66 (Acolyte)
on Jan 14, 2007 at 04:03 UTC ( #594602=perlquestion: print w/replies, xml ) Need Help??

malomar66 has asked for the wisdom of the Perl Monks concerning the following question:

Everyone was so helpful with my last question that I'm hoping you can help me with a more complicated task. There is a government site called Edgar that has accounting statements on it. I want to be able to automate the process of visiting those pages and download all the pages for a firm. PERL seems ideally suited for the task. If anyone could point me in the direction of existing (or help me write) code that could:

1. Get the table from a page like this one into an array.
http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000051143&owner=include&count=40

2. At the bottom of the page, where there is a button for the next page, to go there (and any other nexts) and put that table in the array as well.

Thanks, Anthony

  • Comment on Making an array from a downloaded web page

Replies are listed 'Best First'.
Re: Making an array from a downloaded web page
by moklevat (Priest) on Jan 14, 2007 at 04:55 UTC
    Hi Anthony, Is there some reason you must use the web interface? The SEC provides an anonymous ftp interface specifically for downloading filings in bulk.
      The reason for not going to ftp is that, best I can tell, I can only download the data based on what was uploaded on a given day. Everything for every day would have to be filtered through and deleted to get just the files I need (well under 5% of the universe). I don't have the hardware for it.

      Using the website, it seemed like I could put various firm "CIK" codes into a data file and then access and parse the table information. Then I just download just the firm-form combinations I needed, possibly renaming in the process to make more descriptive.

      For those giving me the great links to things that could help, is there any way I could trouble you for some "for dummies" level sample code or even for the site in question? Otherwise I suspect I'm just going to be staring at cpan.org with a very confused look on my face for a very long time to come (it's happened before).

        I am not an expert on SEC filings or the Edgar database, but after a few minutes of poking around the README file and docs directory it looks like everything would still be simpler with ftp unless you are a screen-scraping wizard.

        As I have just learned, each company that files with the SEC has a CIK number. In your example, the CIK number for IBM is 51143. All of IBM's filings live in that directory ftp://ftp.sec.gov/edgar/data/51143. From the explanation in the README it seems that prior to Edgar 7.0 (starting in the year 2000) all of each company's filings were stored in one directory, but because of problems with overwriting documents when ammendments were submitted, everything is now stored in sub directories based on the accession number of the documents. This might be why it appears that information is per-day. Fortunately, it looks like the SEC provides an index of filings for each quarter by company name or type of filing so you don't have to mess around with slogging through every sub-directory to find the information you need. The only benefit I see from the http interface in your example is that your search focused on filings related to change of ownership. However, I gather that these are related to a known subset of forms (4, K-8) and this information is available in the index, so you could subset the information yourself.

        So, if it were me I would probably move forward in two stages using Net::ftp to 1) grab indices and subsetting the records I need based on company and filing-type to create a list of the files I want to get, and then 2) and grab those files with Net::ftp again.

Re: Making an array from a downloaded web page
by kyle (Abbot) on Jan 14, 2007 at 09:03 UTC
      And HTML::TableExtract in between (have it parse $mech->content) should making getting the content into tables pretty easy ..
Re: Making an array from a downloaded web page
by initself (Monk) on Jan 15, 2007 at 05:39 UTC
Re: Making an array from a downloaded web page
by KevKev (Acolyte) on Jan 15, 2007 at 15:10 UTC
    Also try increasing your count per page to make for fewer pages to fetch/parse. 100 items per page seems to be their maximum. http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000051143&owner=include&count=100
      I planned to go to 100 actually. It will be more efficient. But the need to hit that next link for more data will still be necessary.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://594602]
Approved by ikegami
Front-paged by moklevat
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2021-06-14 10:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (62 votes). Check out past polls.

    Notices?