Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Re: Scraping HTML: orthodoxy and reality

by tilly (Archbishop)
on Jul 08, 2003 at 16:25 UTC ( #272336=note: print w/ replies, xml ) Need Help??


in reply to Re: Scraping HTML: orthodoxy and reality
in thread Scraping HTML: orthodoxy and reality

Ironically the distinction that you draw is the same one that I use to argue against using regular expressions for parsing problems.

Regular expressions are designed as a tool for locating specific patterns in a sea of stuff. (Well until Perl 6 that is...) Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it. Parsing is a lot more work, but for structured text is going to give much more robust solutions. For instance you avoid different kinds of data being mistaken for each other.

The problem is that people are used to using regular expressions for text manipulation, and then set out to solve what is really a parsing probem with regular expressions. Then fail (and may or may not realize this). This happens so routinely that the knee-jerk response is that virtually anything which can be done with parsing should be, rather than regular expressions. And indeed this is good advice to give to someone who doesn't understand the parsing wheels - if only to avoid the problem of all problems looking like nails for the one hammer (regexps) that you have.

However the two kinds of problems are different and do overlap. Where they do overlap, it isn't necessarily obvious which is more practical. It isn't even necessarily obvious from the problem specification - sometimes you need to make a guess about how the code will evolve to know that...


Comment on Re: Re: Scraping HTML: orthodoxy and reality
Re: Re: Re: Scraping HTML: orthodoxy and reality
by demerphq (Chancellor) on Jul 08, 2003 at 17:00 UTC

    Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it.

    Parsing typically has two phases though, the first is Tokenization and the second Parse Tree Generation (Im sure there is a better term but I forget what it is). These phases more often then not occur in synch but they need not. Either way regexes are perfectly suited to tokenization.

    I learned the most about regexes from writing a regex tokenizer and parser. I learned a lot more from the tokenizer than from the parser tho. :-) Writing regexes to tokenize regexes is a fun head trip. (Incidentally the whole idea was to be able to use regexes to specify and generate random test data.)


    ---
    demerphq

    <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://272336]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (11)
As of 2014-09-19 23:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (151 votes), past polls