Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Search Engine , Web Crawling, Data Mining question

by Cubes (Pilgrim)
on Jul 24, 2001 at 08:35 UTC ( [id://99231]=note: print w/replies, xml ) Need Help??


in reply to Search Engine , Web Crawling, Data Mining question

I'll warn you now, you're asking a big question here, with an even bigger answer. I will try to give you a push in the right direction, anyway.

To answer your first question, "data mining" works pretty much the way it sounds. Just as mining for gold or diamonds involves sifting through large piles of rocks and saving only the shiny pieces, data mining involves sifting through large stores of information and pulling out the items that are valuable to you. This might mean calculating aggregate statistics, such as the response to a particular type of advertising campaign; or it might mean highlighting individual items, such as selecting potential customers for a new product based on their past purchasing habits.

The first decision you have to make is what you are hoping to learn from your mining expedition -- what information will be your nuggets of gold. Next, you have to have a source of raw data. Finally, you have to figure out how you are going to extract the information you want from the data you have available.

It sounds like you've answered the second part already -- you want to use the entire internet as your data source -- and you're ready to move on to part three. Without the first step, though, you're trying to build a skyscraper without a blueprint. Until you can clearly and in great detail define exactly what you hope to learn from this vast pile of bits and bytes, you won't get very far trying to develop a piece of software to do it.

Defining your goal means more than saying "I want to find new customers." It means spelling out exactly what criteria define a potential customer, and *then* figuring out where you can obtain the data you need. If you've decided ahead of time that the internet is your data source, you won't have much luck if your "new customer" criteria turn out to be "names and addresses of households with an annual income above $50,000 and 2 or more children."

OK, so let's assume that you have done all of this thinking ahead of time, and you have a well-defined set of information that you can reasonably expect to retrieve by fishing through random web sites.

Keep in mind that the body of data you are dealing with in this case ("everywhere throughout the internet") is huge. It's beyond huge. And it changes by the minute.

Now we get into the hairy technical details of going out and grabbing a web page. Several perl modules which will help you do this have already been mentioned. You can find them and their documentation at CPAN. Now that you have one web page, the simplest way to get more is by following the links on that web page. So you use HTML::Parser, and you find all the links, and you go out and grab those web pages too, and so on.

Wait, that doesn't sound all that hairy. Sounds downright simple, with the availability of such wonderful CPAN modules and the help of the PerlMonks.

Not so fast.

So now you've got your perl script churning along, grabbing web pages and following links. Now you get into the data mining part. *Now* you have to come up with that piece of magic that analyzes these web pages and stuffs the specific data you want into a database for later reporting. This is where things aren't so simple to answer, because this bit of magic depends entirely on what you're hoping to learn from all of this data. See? I told you part one was important.

Once you have that part figured out, then you have to take care of the details of keeping your web page grabbing script humming along. This means things like keeping track of where you've been and where you're going, so you don't go around in circles while traversing links. It means making intelligent decisions about link-following, especially on sites with dynamic content, so you don't end up with a database full of nothing but slashdot posts. It means timing your connections so you don't have angry mobs knocking on your door because your fat-pipe-connected web-page-grabbing farm just swamped their little web servers by requesting all 10,000 of their catalog pages in the space of thirty seconds. It means having the hardware, network connectivity, and time to mine something more than a tiny little corner of "everywhere throughout the internet." And it means taking care of a thousand other details that existing web search engines (which really are a form of data mining) have been working on for *years*, and, in most cases, still haven't gotten right.

Whew. So, there's the not-so-quick version of how data mining works. Good luck with it.

  • Comment on Re: Search Engine , Web Crawling, Data Mining question

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://99231]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-09-18 12:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (24 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.