by Tanktalus (Canon)
[Lady_Aleena]What is the best/easiest to understand web scraping module?
[choroba]you can imagine WWW::Mechanize as a simple "browser"
[Lady_Aleena]I'm trying to get data out of tables on html pages.
[choroba]HTML::TableExtract then?
[Lady_Aleena]This is a big learning curve.
[Lady_Aleena]Lady_Aleena growls at the lack of info in the doc.
[runrig]LWP::Simple if you just have a web page to 'get', and pass the contents to HTML::TableExtract
[Lady_Aleena]I have not looked at a new to me module in about a year.
[Lady_Aleena]I forogt just how sparse the documentation usually is, and how much I would have to trace out to find out what things do.
[Lady_Aleena]Is it just me or is the code in the synopsis not strict?
[runrig]Synopsis code is often not strict
[choroba]yes, it's not. It's from 2000, anyway
[runrig]it's up to you to put in the 'use strict' and the my's
[Lady_Aleena]I hope HTML::TableExtract installs all its dependencies.
[runrig]it's only prereq is HTML::Parser, so it's then up to HTML::Parser to list all it's dependencies
[runrig]Assuming you're using cpan or cpanm to install
[Marshall]On Actiee State HTML::Parser comes with the distribution. You dist may already have it installed?
[Lady_Aleena]I use cpan.
[Marshall]just installed this thing and it need HTML-Element-Extended-1.18 also.
[Lady_Aleena]Lady_Aleena head desks.

[Lady_Aleena]Do I have to open the file?
[Lady_Aleena]It doesn't say.
[runrig]That seems to be an optional dependency
[Lady_Aleena]I can't get it to work.
[runrig]If the html is in a file, you pass the file to the 'parse_file' method
[runrig]the file name, that is
[Lady_Aleena]It doesn't die when the path to the file is wrong.
[runrig]I never noticed that...
[Lady_Aleena]no file should mean death.
[runrig]you can <code>$p->parse_file($file) or die "Error parsing file: $!";
[Lady_Aleena]Oh, I had to use the exact path to the file, so now I'm trying to figure out how to get the data now.
[runrig]and then, if you call $p->eof to abort parsing, then parse_file will return false anyway, in which case you probably don't want to die.
[Lady_Aleena]This module collects more data than I need.
[runrig]but then, you're using HTML::TableExtract and so probably wouldn't be calling eof() anyway, so nevermind
[runrig]You can configure it to only return selected columns from selected tables.
[runrig]So, anybody up late celebrating/mourning the Brexit?
[Lady_Aleena]I know the stock markets around the world are suffering because of Brexit. (goes back to data diving for the rows)
[Marshall]The pound is at a 30 year low. Maybe time for a holiday in England?
[Lady_Aleena]I can't figure what is being escaped in the returned data on my scratchpad.
[Marshall]These folks may not be so happy once the economic reality sets in. UK was far better off in the EU.
[runrig]should've sold all your pounds for gold...
[RonW]Has Article 50 been invoked, yet?
[Lady_Aleena]Looks like I have a lot of chomping to do with the returned data too.
[Marshall]Article 50 is a next step - this all takes time. The vote was just advisory. Now the implementation must start.
[RonW]I often find s/[\r\n]+$// more useful than chomp
[RonW]I recall Cammeron stating he was going to invoke Article 50 before resigning, but have not heard of he actually did that
[Marshall]Cammeron says he's staying until ~Oct.
[runrig]I usually just s/\s+$//
[Marshall]whether he lasts that long, remains to be seen.
[RonW]All I heard was he resigned. Didn't hear the detaills
[Lady_Aleena]The data format returned is confusing
[RonW]Sometimes I want the spaces/tabs but not the line endings
[Marshall]Calling for this referendum was a HUGE mistake on his part. Should have never even allowed the vote.
[Marshall]Yep, Mr Cameron is now gone. this Oct idea didn't last long!
[Lady_Aleena]WTH?!?!?! Why is the content of the rows being escaped like \'Cyrus&#65533;',?
[Marshall]Lady_Aleena good luck! Been years since I messed with LWP stuff. It can get hairy.
[runrig]LA: Did you read the docs? Are you looping through the tables and/or rows? Or are you just dumping the result of parse_file()?
[Lady_Aleena]$row->[0] returns something which looks like SCALAR(0x9212464)
[runrig]There are no references if you loop through tables() and rows() as it says in the docs.
[Lady_Aleena]runrig, I"m looping through the grids.
[Lady_Aleena]runrig, it looks like I have tables in tables instead of one big table.
[Lady_Aleena]The documentation is very murky.
[Lady_Aleena]The top of my scratchpad is as far as I've gotten.
[RonW]Aleena, looks like you need to run the strings you get through HTML entity decoding (don't remember how to do that, though)
[Lady_Aleena]I don't think this is going to work. The tables I'm trying to scrape are a mess.

