in reply to Scraping HTML: orthodoxy and reality
In the meantime: when I go to the web page from the link via my IE browser and do a Ctl-A and Ctl-C and then paste the text into a Notepad screen, this particular output is quite comprehensible to my HTML-untrained eye (vs the HTML stuff), e.g.
impse400 (I3C) / 172.17.8.182 hp color LaserJet 4600 Information <snip much miscellaneous info> For highest print quality always use genuine Hewlett-Packard supplies. + BLACK CARTRIDGE HP Part Number: HP C9720A 73% Estimated Pages Remaining: 11025 (Based on historical black page coverage of 2%) Low Reached: NO Serial Number: 35860 Pages printed with this supply: 4078 TRANSFER KIT HP Part Number: HP C9724A 87% Estimated Pages Remaining: 103856 Etc.
With my regex sledgehammer it would be straightforward to process this data. Oftentimes, when I look at the "pure text" version of a web page there aren't nearly as many nice hooks for sorting things out. But this is THIS case, and my question is: might there be a tool which emulates this action of select/copy/paste of a web page to automate the production of such text for follow-on regex processing?
|Replies are listed 'Best First'.|
Re: Re: Scraping HTML: orthodoxy and reality
by BrowserUk (Pope) on Jul 09, 2003 at 03:35 UTC
by chanio (Priest) on Jul 09, 2003 at 07:17 UTC