Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: Scrape a blog: a statistical approach

by epimenidecretese (Acolyte)
on Apr 13, 2014 at 12:26 UTC ( #1082148=note: print w/replies, xml ) Need Help??


in reply to Re: Scrape a blog: a statistical approach
in thread Scrape a blog: a statistical approach

I did a search and, if I understood it correctly, I am trying to remove the boilerplate from some html pages: is it the right term?

So, I have a lot's of data (web pages from 2005 to 2014) from the same blog. And I only want the text of the post. So I have a lot's of data to provide as example of what I don't want. In fact, more or less, in all the html files I have, the post text (title,date, ecc.) should be the only things that change. More or less. So I am trying to figure out how to statistically identify those lines of code that don't change.

So far I processed all the pages and got one big html file. There I have lot's of lines that are the same. I want to somehow count how frequently those lines occurs, so to be able to identify them as boilerplate.

With the (bad) code I posted I've already been able to strip off lot's of code from the original html page. Now I want to clean the rest: but since I have lot's of pages I thought it would be a good idea to try to somehow weight the boilerplate lines (maybe with mutual information?)

I'll post same examplef of code and the results I've got so far as soon as I can.

  • Comment on Re^2: Scrape a blog: a statistical approach

Replies are listed 'Best First'.
Re^3: Scrape a blog: a statistical approach
by soonix (Abbot) on Apr 13, 2014 at 22:18 UTC
    You could take a diff between consecutive pages instead of counting lines. You'd have to experiment with different modules like e.g. HTML::Diff or Text::Diff, but this approach could also help with style/layout changes.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1082148]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2019-11-15 13:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (83 votes). Check out past polls.

    Notices?