Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Re: Re: Scraping HTML: orthodoxy and reality

by demerphq (Chancellor)
on Jul 08, 2003 at 17:00 UTC ( #272351=note: print w/ replies, xml ) Need Help??


in reply to Re: Re: Scraping HTML: orthodoxy and reality
in thread Scraping HTML: orthodoxy and reality

Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it.

Parsing typically has two phases though, the first is Tokenization and the second Parse Tree Generation (Im sure there is a better term but I forget what it is). These phases more often then not occur in synch but they need not. Either way regexes are perfectly suited to tokenization.

I learned the most about regexes from writing a regex tokenizer and parser. I learned a lot more from the tokenizer than from the parser tho. :-) Writing regexes to tokenize regexes is a fun head trip. (Incidentally the whole idea was to be able to use regexes to specify and generate random test data.)


---
demerphq

<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...


Comment on Re: Re: Re: Scraping HTML: orthodoxy and reality
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://272351]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2014-09-30 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (385 votes), past polls