Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Scraping HTML: orthodoxy and reality

by grinder (Bishop)
on Jul 08, 2003 at 06:51 UTC ( [id://272213]=perlmeditation: print w/replies, xml ) Need Help??

Every so often, some asks a question about extracting HTML. They might have a piece of text like...

<p>This is a paragraph</p> <p>And this is <i>another</i> paragraph</p>

... and they wonder why they get such strange results when they try to attack it with something like /<p>(.*)<\/p>/. The standard Perlmonks reply is to point people to HTML::Parser, HTML::TableExtract or even HTML::TreeBuilder, depending on the context. The main thrust of the argument is usually that regexps lead to fragile code. This is what I term The Correct Answer, but alas, In Real Life, things are never so simple, as a recent experience just showed me. You can, with minor care and effort, get perfect results with regexps, with much better performance.

I used HTML::Parser in the past to build the Perlmonk Snippets Index -- the main reason is that I wanted to walk down all the pages, in case an old node was reaped. (I think there's still a dup or two in there). From that I learnt that it's a real bear to ferry information from one callback to another. I hacked it by using global variables to keep track of state. Later on, someone else told me that The Right Way to use H::P is to subclass it, and extend the internal hash object to track your state that way. Fair enough, but this approach, while theoretically correct, is a non-trivial undertaking for a casual user who just wants to chop up some HTML.

More recently I used HTML::TreeBuilder to parse some HTML output from a webified Domino database. Because of the way the HTML was structured in this particular case, it was a snap to just look_down('_tag', 'foo') and get exactly what I wanted. It was a easy to write, and the code was straightforward.

Then last week, I got tired of keeping an eye on our farm of HP 4600 colour printers, to see how their supplies were lasting (they use four cartridges, C, M, Y & K, and two kits, the transfer and fuser). It turns out that this model has a embedded web server (I could have also done it with SNMP, but that's another story). Point your browser at it, and it will produce a status page that shows you how many pages can be printed out on what's left in the consumables.

So I brought HTML::TreeBuilder to bear on the task. It wasn't quite as easy. It was no simple matter to find a reliable part in the tree from whence to direct my search. The HTML contains deeply nested tables, with a high degree of repetition for each kit and cartridge. The various pieces of information in scattered in different elements and collecting and collating it made for some pretty ugly code.

After I'd managed to wrestle the data I wanted out of the web page, I set about stepping through the code in the debugger, to better understand the data structures and see what shortcuts I could by way of method chaining and array slicing in an attempt to tidy up the code. To my suprise, I saw that just building the H::TB object (by calling the parse() with the HTML in a scalar) required about a second to execute, and this, on some fairly recent high-end hardware.

Until then I wasn't really concerned about performance, because I figured the parse time would be dwarfed by the time it took to get the request's results back from the network. In the master plan, however, I intend to use LWP::ParallelAgent to probe all of the printers in parallel, rather than looping though them one at a time, and factor out much of the waiting. In a perfect world it would be as fast as the single slowest printer.

Given the less than stellar performance of the code at this point, however, it was clear that the cost of parsing the HTML would consume the bulk of the overall run-time. Maybe I might be able to traverse a partially-fetched page, but at this point the architecture would start to become unwieldy. Madness.

So, after having tried the orthodox approach, I started again and broke the rule about parsing HTML with regexps and wrote the following:

my (@s) = m{ > # close of previous tag ([^<]+) # text (name of part, e.g. q/BLACK CARTRIDGE/) <br> ([^<]+) # part number (e.g. q/HP Part Number: HP C9724A/ +) (?:<[^>]+>\s*){4} # separated by 4 tags (\d+) # percent remaining | # --or-- (?: # different text values (?: Pages\sRemaining | Low\sReached | Serial\sNumber | Pages\sprinted\swith\sthis\ssupply ) : (?:\s*<[^>]+>){6}\s* # colon, separated by 6 tags # or just this, within the current element | Based\son\shistorical\s\S+\spage\scoverage\sof\s ) (\w+) # and the value we want }gx;

A single regexp (albeit with a /g modifier) pulls out all I want. Actually it's not quite perfect, since it the resulting array also fills up with a pile of undefs, the unfilled parens on the opposite side of the | alternation to the match. But even so there's a pattern to the offsets into the array where the defined values are found, so a few magic numbers wrapped up as constants tidy that up.

Is the code fragile? Not really. For a start, the HTML has errors in it, such as <td valign= op"> (although H::TB coped just fine with this too).

The generated HTML is stored in the printer's onboard firmware, so unless I upgrade the BIOS, the HTML isn't going to change; it's written in stone, bugs and all. Here's the main point: when the HP 4650 or 4700 model is released, it will probably have completely different HTML anyway, perhaps with style sheets instead of tables. Either way, the HTML will have to be inspected anew, in order to tweak the regexp, or to pull something else out of TreeBuilder's parse tree.

Thus neither approach is maintenance free. But the regexp far less code, and 17 times faster. Now the extraction cost is negligeable compared to the page fetch, as it should be. And as a final bonus the regexp approach requires no non-core modules. Case closed.

I intend to make this a module available on CPAN, so about the best I can do will to be to put a BUGS section in the POD, aaking for contributions of HTML page dumps from newer models, and I'll patch it to take account of the variations.

Now all I need is a suggestion of how to name the module...

_____________________________________________
Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

Replies are listed 'Best First'.
Re: Scraping HTML: orthodoxy and reality
by BrowserUk (Patriarch) on Jul 08, 2003 at 11:13 UTC

    Amen grinder++. I whole heartedly agree.

    I came to much the same conclusion in Being a heretic and going against the party line. after having only been using perl for a relatively short time. My experiences since have done little to change my mind.

    Back in that old post I tried to make a distinction between the need to parse HTML and the need to extract something that just happens to be embedded within stuff that happens to be HTML. This distinction was roundly set upon as being wrong. I still hold with this distinction.

    The dictionary definition of parse is

    1. To break (a sentence) down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part.
    2. To describe (a word) by stating its part of speech, form, and syntactical relationships in a sentence.
      1. To examine closely or subject to detailed analysis, especially by breaking up into components: “What are we missing by parsing the behavior of chimpanzees into the conventional categories recognized largely from our own behavior?” (Stephen Jay Gould).
      2. To make sense of; comprehend: I simply couldn't parse what you just said.

    Whilst the dictionary definition of extract is:

    1. To draw or pull out, often with great force or effort: <cite>extract a wisdom tooth; used tweezers to extract the splinter.</cite>
    2. To obtain despite resistance: <cite>extract a promise.</cite>
    3. To obtain from a substance by chemical or mechanical action, as by pressure, distillation, or evaporation.
    4. To remove for separate consideration or publication; excerpt.
      1. To derive or obtain (information, for example) from a source.
      2. To deduce (a principle or doctrine); construe (a meaning).
      3. To derive (pleasure or comfort) from an experience.
    5. Mathematics. To determine or calculate (the root of a number).

    From my perspective, when the need is to locate and capture one or more pieces of information from within any amount or structure of other stuff, without regard to the structural or semantic positioning of those pieces within the overall structure, the term extraction more applicable than parsing. If I need to understand the structure, derive semantic meaning from the structure or verify its correctness, then I need to parse, otherwise I just need to extract. After all, Practical Extraction and Reporting is what that Language was first designed to do.

    My final, and strongest argument lies in a simple premise. If the information I was after was embedded amongst a lot of Arabic, Greek or Chinese, then noone would expect me to find and use a module that understood those languages, just to extract the bits I needed.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      Ironically the distinction that you draw is the same one that I use to argue against using regular expressions for parsing problems.

      Regular expressions are designed as a tool for locating specific patterns in a sea of stuff. (Well until Perl 6 that is...) Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it. Parsing is a lot more work, but for structured text is going to give much more robust solutions. For instance you avoid different kinds of data being mistaken for each other.

      The problem is that people are used to using regular expressions for text manipulation, and then set out to solve what is really a parsing probem with regular expressions. Then fail (and may or may not realize this). This happens so routinely that the knee-jerk response is that virtually anything which can be done with parsing should be, rather than regular expressions. And indeed this is good advice to give to someone who doesn't understand the parsing wheels - if only to avoid the problem of all problems looking like nails for the one hammer (regexps) that you have.

      However the two kinds of problems are different and do overlap. Where they do overlap, it isn't necessarily obvious which is more practical. It isn't even necessarily obvious from the problem specification - sometimes you need to make a guess about how the code will evolve to know that...

        Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it.

        Parsing typically has two phases though, the first is Tokenization and the second Parse Tree Generation (Im sure there is a better term but I forget what it is). These phases more often then not occur in synch but they need not. Either way regexes are perfectly suited to tokenization.

        I learned the most about regexes from writing a regex tokenizer and parser. I learned a lot more from the tokenizer than from the parser tho. :-) Writing regexes to tokenize regexes is a fun head trip. (Incidentally the whole idea was to be able to use regexes to specify and generate random test data.)


        ---
        demerphq

        <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
Re: Scraping HTML: orthodoxy and reality
by PodMaster (Abbot) on Jul 08, 2003 at 08:08 UTC
    After seeing HP200LX:: on cpan, I suggest you stick it in HP::4600::Status(Scrape)? (or something like HP::Printer::4600 thatwhatever somewhat corresponds to the HP naming convention ;) and suggest to the author of HP200LX:: to rename his HP::200:: yada yada.

    As for your notes on html scraping reality, checkout YAPE::HTML, it's regex based.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Scraping HTML: orthodoxy and reality
by gjb (Vicar) on Jul 08, 2003 at 13:37 UTC

    Although I agree that choosing for a regexp approach or a context free grammar approach depends on the problem at hand, I'd like to stress that halley made a very important point:

    Rules are meant to be broken, but you've to understand them before you can break them... safely.

    Although a lot of Monks will know the distiction between a regular language and a context free language (and I'm sure grinder and BrowserUK do), I'm rather sure that some don't. In the latter case, unfortunately those Monks simply don't know the rules and have lots of opportunity to mess up.

    I'd like to paraphrase: "a little thinking is a dangerous thing" if the process is not supported by a proper amount background knowledge.

    It is possible to approximate a context free grammar with a regular expression, a nice survey article about that has been written by Mark-Jan Nederhof. There are several good books about formal languages, but I'd particularly recommend Sipser's since it is well written and is nice to read.

    Conclusion: even if you know the rules, but don't understand them, don't try and break them. More importantly: try and understand the rules you're following.

    Just my 2 cents, -gjb-

Re: Scraping HTML: orthodoxy and reality
by halley (Prior) on Jul 08, 2003 at 13:03 UTC
    I completely agree with the dose of pragmatism, and thanks for showing both your thought process and the perl code. (<AOL>me too!</AOL>) Two quotes which come to mind to shape the discussion of 'orthodoxy and reality':
    • Rules are meant to be broken, but you have to know the rules before you can break them. -- anon
    • "Orthodoxy means not having to think." -- George Orwell

    --
    [ e d @ h a l l e y . c c ]

Re: Scraping HTML: orthodoxy and reality
by mojotoad (Monsignor) on Jul 08, 2003 at 15:22 UTC
    I am of course biased, but...

    So I brought HTML::TreeBuilder to bear on the task. It wasn't quite as easy. It was no simple matter to find a reliable part in the tree from whence to direct my search. The HTML contains deeply nested tables, with a high degree of repetition for each kit and cartridge. The various pieces of information in scattered in different elements and collecting and collating it made for some pretty ugly code.

    From a logical standpoint, HTML::TableExtract seems to be a perfect choice for this. It might not be a perfect choice for efficiency, for which you seem to have a requirement (though I'm not sure what total run time you were shooting for...how often is this tool going to be run?)

    Could you give an example HTML page and some numbers such as how many of them you are expected to handle and how often? For purposes of discussion, let's say your parallel fetch more or less delivers all of the pages simultaneously.

    (Despite my bias, I am not automatically anti-regexp parsing and I see both sides of that particular scuffle)

    Matt

      Well, having never used it, I'd be very interested in seeing how you'd do this with HTML::TableExtract. Here's an example page: http://grinder.perlmonk.org/hp4600/.

      There are 6 printers today, and we'll probably be adding another 4 or so in the future.

      As a general rule I really don't care about performance, but this is a rare case where I have to do something about it. The reason being is that I want to be able to call this from mod_perl, so every tenth of a second is vital (in terms of human perception noticing lag in loading/rendering a page). It's for a small population of users (5 or so), and mod_perl is reverse proxied through lightweight Apache processes, so I'm not worried about machine resources.

      I can't do anything about the time the printer takes to respond, but I do need the extraction to be as fast as possible to make up lost ground. There is always Plan B, which would be to cache the results via cron once or twice an hour; it's not as if the users drain one cartridge per day. I already do this for other status pages where the information is very expensive to calculate. People know the data aren't always fresh up to the minute but they can deal with that (especially since I always label the age of the information being presented).

      I'll be very interested in seeing what you come up with. And if someone wants to show what a sub-classed HTML::Parser solution looks like, I think we'd have a really good real-life tutorial.

      update: here's the proof-of-concept code as it stands today, as a yardstick to go by. The end result is a hash of hashes, C M Y and K are the colour cartridges and X and F are the transfer and fuser kits, respectively. These will mutate into something like HP::4600::Kit and HP::4600::Cartridge.

      This code implements jeffa's observation of grepping the array for definedness, which indeed simplifies the problem considerably. Thanks jeffa!

      #! /usr/bin/perl -w use strict; use LWP::UserAgent; my @cartridge = qw/ name part percent remain coverage low serial print +ed /; my @kit = qw/ name part percent remain /; for my $host( @ARGV ) { my $url = qq{http://$host/hp/device/this.LCDispatcher?dispatch=htm +l&cat=0&pos=2}; my $response = LWP::UserAgent->new->request( HTTP::Request->new( G +ET => $url )); if( !$response->is_success ) { warn "$host: couldn't get $url: ", $response->status_line, "\n +"; next; } $_ = $response->content; my (@s) = grep { defined $_ } m{ (?: > # closing tag ([^<]+) # text (name of part, e.g. q/BLACK CARTRIDGE/) <br> ([^<]+) # part number (e.g. q/HP Part Number: HP C97 +24A/) </font>\s+</td>\s*<td[^>]+><font[^>]+> (\d+) # percent remaining ) | (?: (?: (?: Pages\sRemaining # different text values | Low\sReached | Serial\sNumber | Pages\sprinted\swith\sthis\ssupply ) : \s*</font></p>\s*</td>\s*<td[^>]*>\s*<p[^>]*><font +[^>]*>\s* # separated by this | Based\son\shistorical\s\S+\spage\scoverage\sof\s # or +just this, within a <td> ) (\w+) # and the value we want ) }gx; my %res; @{$res{K}}{@cartridge} = @s[ 0.. 7]; @{$res{X}}{@kit} = @s[ 8..11]; @{$res{C}}{@cartridge} = @s[12..19]; @{$res{F}}{@kit} = @s[20..23]; @{$res{M}}{@cartridge} = @s[24..31]; @{$res{Y}}{@cartridge} = @s[32..39]; print <<END_STATS; $host Xfer $res{X}{percent}% $res{X}{remain} Fuse $res{F}{percent}% $res{F}{remain} C $res{C}{percent}% cover=$res{C}{coverage}% left=$res{C}{remain} +printed=$res{C}{printed} M $res{M}{percent}% cover=$res{M}{coverage}% left=$res{M}{remain} +printed=$res{M}{printed} Y $res{Y}{percent}% cover=$res{Y}{coverage}% left=$res{Y}{remain} +printed=$res{Y}{printed} K $res{K}{percent}% cover=$res{K}{coverage}% left=$res{K}{remain} +printed=$res{K}{printed} END_STATS }

      _____________________________________________
      Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

        Here's a quick example, just to give you an idea. I apologize for the crufty code.

        This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain.

        HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware.

        I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree.

        Enjoy,
        Matt

        #!/usr/bin/perl -w use strict; my $depth = 0; my $count = 0; my $ddepth = 3; use LWP::Simple; my $html = get('http://grinder.perlmonk.org/hp4600/'); my %Device; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new; $te->parse($html); foreach my $ts ($te->table_states) { &process_detail($ts) if ($ts->depth == $ddepth); &process_main($ts) if ($ts->depth == $depth && $ts->count == $coun +t); } # Clean up the empty spots @{$Device{stats}} = grep(defined, @{$Device{stats}}); print Dumper(\%Device); exit; sub process_main { my $ts = shift; my($host, $model) = _scrub(($ts->rows)[1]); $Device{host} = $host; $Device{model} = $model; } sub process_detail { $_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_); } sub _proc_detail_name { my $ts = shift; my($name, $part, $pct) = _scrub(($ts->rows)[0]); $part =~ s/.*:\s+//; $Device{stats}[$ts->count] = { name => $name, part => $part, pct => $pct }; } sub _proc_detail_stats { my $ts = shift; my @stats = map(_scrub($_), $ts->rows); my $i = $ts->count - 1; @{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe +d)} = (map(_scrub($_), $ts->rows))[1,2,4,6,8]; } sub _scrub { grep(!/^\s*$/s, map(split(/(^M|\n)+/,$_), @{shift()})); }
Re: Scraping HTML: orthodoxy and reality
by hsmyers (Canon) on Jul 08, 2003 at 13:12 UTC
    In almost any situation, there are those who prefer maxims to thinking---ignore them. With specific regard to the odd web scrape, there are a couple of things that can be a problem.
    • The need to handle nesting.
    • The need to survive arbitrary changes in the source.
    The first can be handled by tight expression bounding or by using something like Text::DelimMatch or Text::Balanced. You manage the second by vigilance---patch it when needed.

    As for those who knee jerk instead of thinking, since many of them have no experience witting parsers (formal or ad hoc) they fail to see that the regex approach is a form of parsing, just without the overhead of dealing with things that don't matter.

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
(jeffa) Re: Scraping HTML: orthodoxy and reality
by jeffa (Bishop) on Jul 08, 2003 at 14:27 UTC
    "Actually it's not quite perfect, since the resulting array also fills up with a pile of undefs"

    Couldn't you just filter the result through grep first?

    my (@s) = grep $_, m{ big regex here }gx;
    Sure it would be better to not have undefs in there in the first place, and you said you know where they will be, but why worry when grep is at disposal. Just a thought, and thanks for the post. :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Scraping HTML: orthodoxy and reality
by chunlou (Curate) on Jul 08, 2003 at 19:17 UTC

    "Parse" vs "extract" or "regular language" vs "context free," etc. are indeed important distinctions to be made, as pointed out by some monks. Parsing data is a (more or less) mechanical process; extracting info is a human (A.I.) process.

    Suppose you want to extract info by paragraph. Consider the following text fragment:

    ________________________________________

    Look at the table below...

    ho ho ho...



    Could you behold the secret this unfolds?

    A bit more, a bit more, irrelevant thought, a new paragraph...

    ________________________________________

    You might see either two or three paragraphs (if you consider "Look... unfolds?" as one paragraph). Now, let's look at the html of the above text fragment:
    <p>Look at the table below... <table border="1"><tr><td><p>ho ho ho...</p></td></tr></table><br><br> Could you behold the secret this unfold?<br><br> A bit more, a bit more, irrelevant thought, a new paragraph...</p>

    A parser might only see one paragraph between the <p> and </p> tags. There is a <p></p> pair in the table. Is it a paragraph? A parser might ask.

    Suppose the parser takes into consideration that some people use <br><br> to denote the end of a paragraph. "Look..." and "Could..." might be considered two paragraphs. What about "A bit..."? Or are "Look..." and the table two paragraphs?

    Human can read semantically; machine mostly syntactically. That's why extracting info is not the same problem as parsing data.

      I applaud your attempt to clarify, though I don't necessarially agree with your example. For me, the difference between parsing something and and extracting something is that when I parse something, I am interested in the whole something. When I extract something, I am only interested in a small part of the whole.

      That's basically why using an HTML parser when all I want is an extraction, is so wasteful. I go to all the trouble of analysing (sometimes validating) and capturing the structure of the HTML only to throw all that effort away as soon as I have captured the bit I want.

      It's a bit like carefully dismantling, labelling and boxing an entire stately home, brick by brick, when the only part of value is the Adam's fireplace in the drawing room. If the rest of the place is simply going to be discarded, there really isn't any point in doing the extra work unless some conservation or restoration is intended. In code terms, that means I am going modify, reconstuct or otherwise utilise the structure I've spent the effort capturing. In a large majority of screen-scraping tasks, the structure is simply discarded.

      The argument for using a parser is that semantic knowledge gained from the structure is useful when used to locate the bits of data required, and that using the structural clues is more reliable (less brittle) when the page is updated than a dumb match by regex. The problem I have found is that when the page changes, the structure is just as likely to change in ways that require the script to be re-worked as is a match-by-context regex.

      For every example one way, there is a counter example the other, and I would never critisise anyone who chooses to use a parser for this purpose. I just wish that they would give my (and others) informed and cognisant choice to do otherwise, the same respect.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


        What you said was logically sound. Apparently, "parse" and "extract" mean something very specific to you, whereas I merely used them in a loosey-goosey way.

        But I do make distinction between "data" and "info." This is data: "1I2N! 2r2U 1E! 1R 1f"; this is info: "FIRE! RUN!".
Re: Scraping HTML: orthodoxy and reality
by ff (Hermit) on Jul 09, 2003 at 03:18 UTC
    Gosh, I wish I even knew the difference between those HTML:: modules and how to put them to work! Given the examples in this thread, soon I'll have more than a hammer to do my scraping. :-)

    In the meantime: when I go to the web page from the link via my IE browser and do a Ctl-A and Ctl-C and then paste the text into a Notepad screen, this particular output is quite comprehensible to my HTML-untrained eye (vs the HTML stuff), e.g.

    impse400 (I3C) / 172.17.8.182 hp color LaserJet 4600 Information <snip much miscellaneous info> For highest print quality always use genuine Hewlett-Packard supplies. + BLACK CARTRIDGE HP Part Number: HP C9720A 73% Estimated Pages Remaining: 11025 (Based on historical black page coverage of 2%) Low Reached: NO Serial Number: 35860 Pages printed with this supply: 4078 TRANSFER KIT HP Part Number: HP C9724A 87% Estimated Pages Remaining: 103856 Etc.

    With my regex sledgehammer it would be straightforward to process this data. Oftentimes, when I look at the "pure text" version of a web page there aren't nearly as many nice hooks for sorting things out. But this is THIS case, and my question is: might there be a tool which emulates this action of select/copy/paste of a web page to automate the production of such text for follow-on regex processing?

      You could probably automate the C&P from your favorite browser (under Win32 at least) or use one of the console browsers (Lynx etc.) under *nix.

      The question is, what would you have achieved. Not only would you have used a parser (the one builtin to the browser), but you would have also used its rendering engine, spawned a new process and gone through some form of IPC whether its a pipe or clipboard. And you would still need to apply a regex to the result.

      If your going to use a parser, then you mught as well use one of the many available to you via CPAN and avoid all that additional overhead:)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


        For sure, this approach is expensive cpu-wise, etc., but if I need a solution that works right away then "module fetches/renders HTML into text", combined with regex processing that at least I know how to do, IS a solution. Sure, per RBFuller, "... if the solution is not beautiful, I know it is wrong" but if those cycles won't be used for anything else, who cares? This bear of little brain would have his program done.

        So, assuming that efficiency doesn't matter, I'm still fishing for something like building the $html object via a LWP 'get' as above and then turning it into text that I can examine with regexen. (However, since this is turning a golden object into lead, I'll do some more digging as you suggest, like re-reading this thread's Data::Dumper/HTML::TableExtract example! :-)

        Well, /s?he/ had the right idea. One wonders why HP can't just produce simpler HTML, or even provide a port for text output... ok, may be that's a little silly, but all of this to get a few numbers.
        
        <grumble>Isn't the real problem here the
        obsession that some designer at HP has with
        producing beautiful output, good HTML practice
        be damned? Why do Dreamweaver jockies have to
        make my life hard!!! AHHHHH!</grumble>
        
        

        10POKE53280,A:POKE53281,A
        20?"C64 RULES ";
        30A=A+1:IFA=16THENA=0:GOTO10
Re: Scraping HTML: orthodoxy and reality
by John M. Dlugosz (Monsignor) on Jul 10, 2003 at 18:21 UTC
    OK, you convinced me to use regex instead of a parser for my program. This avoids the problem of re-formatting the parse tree to resemble the original input (I can modify the found lines in-place easily), and I can live with "parser" limitations and simply not write goofy stuff in my HTML.

    —John

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://272213]
Approved by Corion
Front-paged by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-03-19 08:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found