html into an array

monkeybus has asked for the wisdom of the Perl Monks concerning the following question:

Hello again. So here's the deal, I use LWP::Simple to get a web page, then I use HTML::Strip to remove all the markup tags. I am left with a workable text file with maybe a little too much whitespace.
Now the problems start. If I now read this file into an array and then try to grep the array index, it just doesn't work. I'm guessing it is all that whitespace messing things up.
Is there any hope for me?

#! /usr/bin/perl

#fetch the webpage
my $url = 'http://www.foo';
   

  use LWP::Simple;
  my $content = get $url;
  die "Couldn't get $url" unless defined $content;

print $content;
#strip the tags
use HTML::Strip;

  my $hs = HTML::Strip->new();

  my $clean_text = $hs->parse( $content );
  $hs->eof;
#write to file
$append = 0;
if ($append)
 {
 open(MYOUTFILE, ">clean_text"); #open for write, overwrite
 }
else
 {
 open(MYOUTFILE, ">>clean_text"); #open for write, append
 }


#read the file into an array
open(MYINPUTFILE, "<clean_text"); # open for input
my(@lines) = <MYINPUTFILE>; # read file into list

#so far so good, but check below for the troubles.


my $search = 'foo';
my @where = grep { $lines[$_] eq $search } 0 .. $#lines;


print @where;
[download]

will return nothing at all.
The stripped file looks something like below.

464

02

18/03/07
ST / "Turf" / "B+2"

Edit: g0n - code tags

Comment on html into an array Download Code

Replies are listed 'Best First'.
Re: html into an array by Joost (Canon) on May 20, 2007 at 13:57 UTC
If I now read this file into an array and then try to grep the array index, it just doesn't work. Of course it works, it just doesn't do what you think it should do. Since we don't know what code you've got, what you want it to do, or what it currently does, you're probably not going to get very helpful replies until you provide that information. See also How do I post a question effectively?. Update: please mark your updates (like I did this one) and use `<code> .. </code>` tags to mark your code. it makes it a lot easier to read (and type). See also writeup formatting tips. Anyway, I'm assuming you've omitted a few pieces in that code snippet that are actually there in the real code (like the part where you write to "clean_text"). Note that `my(@lines) = <MYINPUTFILE>; # read file into list` [download] Puts the complete lines in @lines. That includes the newline character "\n". Since every line (except possibly the last one) ends with "\n" and $search does not, your match is not going to work. (See also chomp) I also note that it's very likely the lines have other whitespace in them. It's probably easier to match using a regular expression: `# match the string "foo" on "word boundaries" my @line_numers = grep { $lines[$_] =~ /\bfoo\b/ } 0 .. $#lines;` [download] or `# match lines containing the single word "foo", # ignoring whitespace my @line_numers = grep { $lines[$_] =~ /^\sfoo\s$/ } 0 .. $#lines;` [download] updated: bug fixed for the grep() expression. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re: html into an array by swampyankee (Parson) on May 20, 2007 at 14:00 UTC
"Is there any hope for me?" Probably. What do you mean by "grep the array index"? Do you mean "search the array for the index of an array element containing a specific entry?" I'm guessing you're doing something like this to read the array: `open(my $fh, "<", $stripped_file) or die "Could not open $stripped_fil +e because $!\n"; my @text = <$fh>; close($fh);` [download] and aren't getting what you expect. I'd use Data::Dumper to see if what you've read is what you think you've read. It would be helpful to show us some code and a sample of the stripped file. emc Insisting on perfect safety is for people who don't have the balls to live in the real world. —Mary Shafer, NASA Dryden Flight Research Center	[reply] [d/l]
Re: html into an array by davidrw (Prior) on May 20, 2007 at 15:03 UTC
Not sure exactly what your input/output is, but you might mind HTML::TableExtract very handy as a potential alternate approach to your problem.	[reply]
Re^2: html into an array by jZed (Prior) on May 20, 2007 at 16:55 UTC
And if your data is in an HTML table, you can get your array in a single line of AnyData code. AnyData uses both LWP and HTML::TableExtract under the hood: `use AnyData; my $arrayRef = adConvert( 'HTMLtable', # input is an HTML table 'http://host/path/foo.html', # input comes from a remote file 'ARRAY', # output is an Array reference '', # output not sent to file {count=>1} # HTML::TableExtract flags );` [download]	[reply] [d/l]
Re: html into an array by FunkyMonk (Chancellor) on May 20, 2007 at 13:59 UTC
You need to explain what "it just doesn't work" means exactly. Create a complete but minmal program (including data) that demonstrates your problem. Include the output from the program along with what you expected to see.	[reply]
Re: html into an array by johngg (Canon) on May 20, 2007 at 18:32 UTC
Other Monks have given you advice to help with your problem. I would just like to point out what seems to be some slightly topsy-turvy logic in your code and give some advice regarding opening files. Your code `$append = 0; if ($append) { open(MYOUTFILE, ">clean_text"); #open for write, overwrite } else { open(MYOUTFILE, ">>clean_text"); #open for write, append }` [download] looks like you are overwriting if `$append` is true and appending if it is false. Perhaps a little misleading? The use of the three argument form of open is to be encouraged as is the use of lexically scoped filehandles. You should also test that the `open` succeeded. Most importantly, putting the lines `use strict;` and `use warnings;` at the top of your scripts will save you a lot of wasted time in the long run as it will help you spot typos like `$append = 1; ... if ( $apped ) { # Append to my file } else { # Trample all over my irreplaceable data }` [download] Using strictures, three argument open and lexical handles your piece of code might look like `my $append = 0; my $cleanedFile = q{clean_text}; if ( $append ) { open my $cleanedFH, q{>>}, $cleanedFile or die qq{open: $cleanedFile for append: $!\n}; } else { open my $cleanedFH, q{>}, $cleanedFile or die qq{open: $cleanedFile for overwrite: $!\n}; }` [download] I hope these thoughts are of use. Cheers, JohnGG	[reply] [d/l] [select]
Re: html into an array by jZed (Prior) on May 20, 2007 at 17:25 UTC
The AnyData module can read a remote file into an array in a single call: `use AnyData; my $arrayRef = adConvert( 'Text', # input is plain text 'http://host/path/foo.html', # input comes from a remote file 'ARRAY', # output is an Array reference '', # output not sent to file {eol=>"\n"} # define the eol for reading input );` [download] If the data is in an HTML table, AnyData can use HTML::TableExtract to get that, see my response to davidrw below. update : but, um, duh, that won't let you strip the HTML so, nevermind, but if your data is in an HTML table, the solution below would work.	[reply] [d/l]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks