Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

html into an array

by monkeybus (Acolyte)
on May 20, 2007 at 13:48 UTC ( [id://616437]=perlquestion: print w/replies, xml ) Need Help??

monkeybus has asked for the wisdom of the Perl Monks concerning the following question:

Hello again. So here's the deal, I use LWP::Simple to get a web page, then I use HTML::Strip to remove all the markup tags. I am left with a workable text file with maybe a little too much whitespace.
Now the problems start. If I now read this file into an array and then try to grep the array index, it just doesn't work. I'm guessing it is all that whitespace messing things up.
Is there any hope for me?
#! /usr/bin/perl #fetch the webpage my $url = 'http://www.foo'; use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content; print $content; #strip the tags use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $content ); $hs->eof; #write to file $append = 0; if ($append) { open(MYOUTFILE, ">clean_text"); #open for write, overwrite } else { open(MYOUTFILE, ">>clean_text"); #open for write, append } #read the file into an array open(MYINPUTFILE, "<clean_text"); # open for input my(@lines) = <MYINPUTFILE>; # read file into list #so far so good, but check below for the troubles. my $search = 'foo'; my @where = grep { $lines[$_] eq $search } 0 .. $#lines; print @where;

will return nothing at all.
The stripped file looks something like below.

    464



      02



  18/03/07
        ST / "Turf" / "B+2"

Edit: g0n - code tags

Replies are listed 'Best First'.
Re: html into an array
by Joost (Canon) on May 20, 2007 at 13:57 UTC
    If I now read this file into an array and then try to grep the array index, it just doesn't work.
    Of course it works, it just doesn't do what you think it should do. Since we don't know what code you've got, what you want it to do, or what it currently does, you're probably not going to get very helpful replies until you provide that information.

    See also How do I post a question effectively?.

    Update: please mark your updates (like I did this one) and use <code> .. </code> tags to mark your code. it makes it a lot easier to read (and type). See also writeup formatting tips.

    Anyway, I'm assuming you've omitted a few pieces in that code snippet that are actually there in the real code (like the part where you write to "clean_text").

    Note that

    my(@lines) = <MYINPUTFILE>; # read file into list
    Puts the complete lines in @lines. That includes the newline character "\n". Since every line (except possibly the last one) ends with "\n" and $search does not, your match is not going to work. (See also chomp)

    I also note that it's very likely the lines have other whitespace in them. It's probably easier to match using a regular expression:

    # match the string "foo" on "word boundaries" my @line_numers = grep { $lines[$_] =~ /\bfoo\b/ } 0 .. $#lines;
    or
    # match lines containing the single word "foo", # ignoring whitespace my @line_numers = grep { $lines[$_] =~ /^\s*foo\s*$/ } 0 .. $#lines;
    updated: bug fixed for the grep() expression.

Re: html into an array
by swampyankee (Parson) on May 20, 2007 at 14:00 UTC

    "Is there any hope for me?" Probably.

    What do you mean by "grep the array index"? Do you mean "search the array for the index of an array element containing a specific entry?"

    I'm guessing you're doing something like this to read the array:

    open(my $fh, "<", $stripped_file) or die "Could not open $stripped_fil +e because $!\n"; my @text = <$fh>; close($fh);
    and aren't getting what you expect. I'd use Data::Dumper to see if what you've read is what you think you've read.

    It would be helpful to show us some code and a sample of the stripped file.

    emc

    Insisting on perfect safety is for people who don't have the balls to live in the real world.

    —Mary Shafer, NASA Dryden Flight Research Center
Re: html into an array
by davidrw (Prior) on May 20, 2007 at 15:03 UTC
    Not sure exactly what your input/output is, but you might mind HTML::TableExtract very handy as a potential alternate approach to your problem.
      And if your data is in an HTML table, you can get your array in a single line of AnyData code. AnyData uses both LWP and HTML::TableExtract under the hood:
      use AnyData; my $arrayRef = adConvert( 'HTMLtable', # input is an HTML table 'http://host/path/foo.html', # input comes from a remote file 'ARRAY', # output is an Array reference '', # output not sent to file {count=>1} # HTML::TableExtract flags );
Re: html into an array
by FunkyMonk (Chancellor) on May 20, 2007 at 13:59 UTC
    You need to explain what "it just doesn't work" means exactly. Create a complete but minmal program (including data) that demonstrates your problem. Include the output from the program along with what you expected to see.
Re: html into an array
by johngg (Canon) on May 20, 2007 at 18:32 UTC
    Other Monks have given you advice to help with your problem. I would just like to point out what seems to be some slightly topsy-turvy logic in your code and give some advice regarding opening files. Your code

    $append = 0; if ($append) { open(MYOUTFILE, ">clean_text"); #open for write, overwrite } else { open(MYOUTFILE, ">>clean_text"); #open for write, append }

    looks like you are overwriting if $append is true and appending if it is false. Perhaps a little misleading?

    The use of the three argument form of open is to be encouraged as is the use of lexically scoped filehandles. You should also test that the open succeeded. Most importantly, putting the lines use strict; and use warnings; at the top of your scripts will save you a lot of wasted time in the long run as it will help you spot typos like

    $append = 1; ... if ( $apped ) { # Append to my file } else { # Trample all over my irreplaceable data }

    Using strictures, three argument open and lexical handles your piece of code might look like

    my $append = 0; my $cleanedFile = q{clean_text}; if ( $append ) { open my $cleanedFH, q{>>}, $cleanedFile or die qq{open: $cleanedFile for append: $!\n}; } else { open my $cleanedFH, q{>}, $cleanedFile or die qq{open: $cleanedFile for overwrite: $!\n}; }

    I hope these thoughts are of use.

    Cheers,

    JohnGG

Re: html into an array
by jZed (Prior) on May 20, 2007 at 17:25 UTC
    The AnyData module can read a remote file into an array in a single call:
    use AnyData; my $arrayRef = adConvert( 'Text', # input is plain text 'http://host/path/foo.html', # input comes from a remote file 'ARRAY', # output is an Array reference '', # output not sent to file {eol=>"\n"} # define the eol for reading input );
    If the data is in an HTML table, AnyData can use HTML::TableExtract to get that, see my response to davidrw below.

    update : but, um, duh, that won't let you strip the HTML so, nevermind, but if your data is in an HTML table, the solution below would work.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://616437]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-23 07:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found