http://www.perlmonks.org?node_id=1010380

cLive ;-) has asked for the wisdom of the Perl Monks concerning the following question:

Awful title, and the reason why I'm having such a hard time finding the right answer - not quite sure where this falls. Machine learning? Semantic programming? Searching is finding general articles or oblique references, but I'm having a hard time finding anything really meaty.

What I'm looking for is suggestions for an approach to identify repeating data patterns on a page, so that I can automatically scrape a selection of attributes from the page of listing results. Eg, item name, price, description. Then, try to identify the semantic meaning of the elements within each repeating pattern, based on regular expressions / string length.

What I'm thinking of is along the lines of an algorithm that looks for repetition in the HTML structure of the page, and then examines them for the relevant data - could be table rows, divs, paragraphs, lists - trying to be as generic as possible...

Does anyone have any good article or module recommendations that could help with this? Ideally, I'd want this to work standalone as much as possible, and only allocate a minority of cases for human review.

Replies are listed 'Best First'.
Re: Machine learning pattern matching...
by BrowserUk (Patriarch) on Dec 26, 2012 at 15:24 UTC

    If, as it sounds from your post, you seeking to code your algorithm to be entirely independent of the pages and their content that you are intending to parse, you are on a hiding to nothing.

    To demonstrate the difficulty, first construct a short set of data that you might hope to be able to extract. Say:

    thingy red 10.00 "new fresh thingy" dobrey blue 1.99 "antique doobree" whatsit transparent 16.49/yard indescribable

    Now consider all the hundreds of different ways you could wrap that up in html in order to display it.

    Then consider how many variations you could contrive of each of those hundreds of ways by adding a little javascript into the mix.

    Then consider the effects of adding in images; filler; adverts; links to customer reviews; pagination controls; 'you might also like to consider' and 'other customers also brought" lists; and all the other irrelevances and annoyances that you routinely encounter on websites.

    You end up with trying to consider how to extract those same 12 pieces of data from 100s of thousands of different formats, before even considering the possibilities of different languages or deliberate obfuscation to prevent scraping. You could spend months attempting to write such a generic parser only to be foiled when they revamp their websites.

    It would be much better to tailor simple front-end scrappers to each of your specific target pages, and only get generic once you have extracted data you require. That way, when one page format changes, or a new page format needs to be handled, you are only faced with modifying (or writing anew) a small front-end script, not trying to adapt your entire parser to the new without breaking the existing.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      The algorithm needs to be independent of the content, but the content will contain list data.

      Not all result pages will contain all the same info either, hence having a best guess on the actual data meaning too.

        That strengthens my argument that it would be much, much simpler to hand-craft a small routine to extract the required data from each type of page than to try and write a single routine that would attempt to recognise and extract whatever appropriate information exists on any page you give it.

        Indeed, depending upon the variety of possible inputs, I would suggest that the latter is close to impossible.

        And if you did expend the time, manpower and money on getting something working, it would not sooner be working than one or more of the input sources would decide to revamp their site and screw the whole thing up.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Machine learning pattern matching...
by CountZero (Bishop) on Dec 26, 2012 at 16:41 UTC
    So you want the web-page you want to scrape to act as some kind of configuration file to define what content you want to retain. I doubt it that anyone already wrote such a program. I think it is a few levels above the state-of-the-art of AI technology.

    But perhaps you are thinking of something more specific: real estate listings, catalogues, ...

    If you can narrow down the scope of your research, there may be some hope yet.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      Yes, it's going to be user suggested search results from shopping sites (honoring any robots.txt restrictions, obviously).

      Point is, I won't know what they're going to suggest until they do and, ideally, I'd like to automate additions where possible to minimize manual review.

      I was thinking of grabbing any possible matchces on the page and present them to the user adding the link as first step, but wondered what was out there already. Short of looking for patterns in the DOM, I'm not sure what else to do.

Re: Machine learning pattern matching...
by LanX (Saint) on Dec 26, 2012 at 17:14 UTC
    > What I'm thinking of is along the lines of an algorithm that looks for repetition in the HTML structure of the page, and then examines them for the relevant data - could be table rows, divs, paragraphs, lists - trying to be as generic as possible...

    Sounds for me like a combination of web mining and cluster analysis! (?)

    I doubt that you can find any ready to use modules combining both¹, cause this is a core technology for some big players in web business.

    Cheers Rolf

    ¹) Especially as generic as you asked

      Not looking for a full solution, but mainly for ideas on what I should be reading up on to build it myself.

      This idea's been floating around in my brain for a while, so I'm giving it some room to see if it grows :)