Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Machine learning pattern matching...

by BrowserUk (Pope)
on Dec 26, 2012 at 15:24 UTC ( #1010396=note: print w/ replies, xml ) Need Help??


in reply to Machine learning pattern matching...

If, as it sounds from your post, you seeking to code your algorithm to be entirely independent of the pages and their content that you are intending to parse, you are on a hiding to nothing.

To demonstrate the difficulty, first construct a short set of data that you might hope to be able to extract. Say:

thingy red 10.00 "new fresh thingy" dobrey blue 1.99 "antique doobree" whatsit transparent 16.49/yard indescribable

Now consider all the hundreds of different ways you could wrap that up in html in order to display it.

Then consider how many variations you could contrive of each of those hundreds of ways by adding a little javascript into the mix.

Then consider the effects of adding in images; filler; adverts; links to customer reviews; pagination controls; 'you might also like to consider' and 'other customers also brought" lists; and all the other irrelevances and annoyances that you routinely encounter on websites.

You end up with trying to consider how to extract those same 12 pieces of data from 100s of thousands of different formats, before even considering the possibilities of different languages or deliberate obfuscation to prevent scraping. You could spend months attempting to write such a generic parser only to be foiled when they revamp their websites.

It would be much better to tailor simple front-end scrappers to each of your specific target pages, and only get generic once you have extracted data you require. That way, when one page format changes, or a new page format needs to be handled, you are only faced with modifying (or writing anew) a small front-end script, not trying to adapt your entire parser to the new without breaking the existing.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: Machine learning pattern matching...
Download Code
Re^2: Machine learning pattern matching...
by cLive ;-) (Parson) on Dec 31, 2012 at 16:39 UTC

    The algorithm needs to be independent of the content, but the content will contain list data.

    Not all result pages will contain all the same info either, hence having a best guess on the actual data meaning too.

      That strengthens my argument that it would be much, much simpler to hand-craft a small routine to extract the required data from each type of page than to try and write a single routine that would attempt to recognise and extract whatever appropriate information exists on any page you give it.

      Indeed, depending upon the variety of possible inputs, I would suggest that the latter is close to impossible.

      And if you did expend the time, manpower and money on getting something working, it would not sooner be working than one or more of the input sources would decide to revamp their site and screw the whole thing up.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1010396]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (13)
As of 2014-10-22 16:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (119 votes), past polls