Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
go ahead... be a heretic
 
PerlMonks

Making an array from a downloaded web page

by malomar66 (Acolyte)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Jan 14, 2007 at 04:03 UTC ( #594602=perlquestion: print w/ replies, xml ) Need Help??
malomar66 has asked for the wisdom of the Perl Monks concerning the following question:

Everyone was so helpful with my last question that I'm hoping you can help me with a more complicated task. There is a government site called Edgar that has accounting statements on it. I want to be able to automate the process of visiting those pages and download all the pages for a firm. PERL seems ideally suited for the task. If anyone could point me in the direction of existing (or help me write) code that could:

1. Get the table from a page like this one into an array.
http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000051143&owner=include&count=40

2. At the bottom of the page, where there is a button for the next page, to go there (and any other nexts) and put that table in the array as well.

Thanks, Anthony

Comment on Making an array from a downloaded web page
Re: Making an array from a downloaded web page
by moklevat (Priest) on Jan 14, 2007 at 04:55 UTC
    Hi Anthony, Is there some reason you must use the web interface? The SEC provides an anonymous ftp interface specifically for downloading filings in bulk.
      The reason for not going to ftp is that, best I can tell, I can only download the data based on what was uploaded on a given day. Everything for every day would have to be filtered through and deleted to get just the files I need (well under 5% of the universe). I don't have the hardware for it.

      Using the website, it seemed like I could put various firm "CIK" codes into a data file and then access and parse the table information. Then I just download just the firm-form combinations I needed, possibly renaming in the process to make more descriptive.

      For those giving me the great links to things that could help, is there any way I could trouble you for some "for dummies" level sample code or even for the site in question? Otherwise I suspect I'm just going to be staring at cpan.org with a very confused look on my face for a very long time to come (it's happened before).

        I am not an expert on SEC filings or the Edgar database, but after a few minutes of poking around the README file and docs directory it looks like everything would still be simpler with ftp unless you are a screen-scraping wizard.

        As I have just learned, each company that files with the SEC has a CIK number. In your example, the CIK number for IBM is 51143. All of IBM's filings live in that directory ftp://ftp.sec.gov/edgar/data/51143. From the explanation in the README it seems that prior to Edgar 7.0 (starting in the year 2000) all of each company's filings were stored in one directory, but because of problems with overwriting documents when ammendments were submitted, everything is now stored in sub directories based on the accession number of the documents. This might be why it appears that information is per-day. Fortunately, it looks like the SEC provides an index of filings for each quarter by company name or type of filing so you don't have to mess around with slogging through every sub-directory to find the information you need. The only benefit I see from the http interface in your example is that your search focused on filings related to change of ownership. However, I gather that these are related to a known subset of forms (4, K-8) and this information is available in the index, so you could subset the information yourself.

        So, if it were me I would probably move forward in two stages using Net::ftp to 1) grab indices and subsetting the records I need based on company and filing-type to create a list of the files I want to get, and then 2) and grab those files with Net::ftp again.

Re: Making an array from a downloaded web page
by kyle (Abbot) on Jan 14, 2007 at 09:03 UTC
      And HTML::TableExtract in between (have it parse $mech->content) should making getting the content into tables pretty easy ..
Re: Making an array from a downloaded web page
by initself (Monk) on Jan 15, 2007 at 05:39 UTC
Re: Making an array from a downloaded web page
by KevKev (Acolyte) on Jan 15, 2007 at 15:10 UTC
    Also try increasing your count per page to make for fewer pages to fetch/parse. 100 items per page seems to be their maximum. http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000051143&owner=include&count=100
      I planned to go to 100 actually. It will be more efficient. But the need to hit that next link for more data will still be necessary.

Login:
Password
remember me
What's my password?
Create A New User

Node Status?
node history
Node Type: perlquestion [id://594602]
Approved by ikegami
Front-paged by moklevat
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (21)
Corion
GrandFather
jdporter
Your Mother
holli
Gavin
atcroft
kennethk
MidLifeXis
thezip
Eyck
pileofrogs
clinton
socketdave
metaperl
Utilitarian
ssandv
MikeDexter
smile4me
im2
plieberg
As of 2010-02-09 20:12 GMT
Sections?
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information?
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes?
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers?
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth?

What level of existential comfort do you require?

Palace
Executive suite at the best hotel
Regular hotel in a decent part of town
Motel
Boarding house
Sleeping Bag on Couch in Basement
Any port in a storm
Camping under the freeway overpass
Jail
Other

Results (279 votes), past polls