Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: HTML parsing OR capturing text from a string within tags

by Popcorn Dave (Abbot)
on Dec 24, 2006 at 06:23 UTC ( #591492=note: print w/replies, xml ) Need Help??


in reply to HTML parsing OR capturing text from a string within tags

kevyt,

Might I suggest a differnt tact than you're taking now?

Long ago, I wrote a newspaper headline grabber for a Perl class using LWP::Simple's get function to grab web pages. I found that easier to use since it can return the whole page to a scalar. Then I used HTML::TokeParser to actually divide up the information and based my collection on only the tokens I actually wanted to save.

If you look at Re: HTML::TokeParser help - parsing headlines there's a quick and dirty token parser that I wrote so that you can see how it splits up an HTML file.

Hope that helps!

Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

If quizzes are quizzical, what are tests?

  • Comment on Re: HTML parsing OR capturing text from a string within tags

Replies are listed 'Best First'.
Re^2: HTML parsing OR capturing text from a string within tags
by kevyt (Scribe) on Dec 24, 2006 at 07:09 UTC
    Popcorn Dave, Thanks... I will try that... I just added a lot of prints to Element.pm to see what is going on. I will try your method tomorrow :) Thanks... This is what I have done. The format of Element.pm looks similar to code I use to work with at a former job.
    sub find_by_ktag_name { my(@pile) = shift(@_); # start out the to-do stack for the traverser Carp::croak "find_by_created_tag_name can be called only as an objec +t method" unless ref $pile[0]; return() unless @_; print "pile is @pile\n"; my(@tags) = $pile[0]->_fold_case(@_); print "tags are @tags\n"; my(@matching, $this, $this_tag); while(@pile) { $this_tag = ($this = shift @pile)->{'_tag'}; print "In while loop. this_tag is $this_tag\n"; foreach my $t (@tags) { print "foreach going through elements of tag. Elements are t an +d t is $t\n"; print "next step will check to see if t is eq to this_tag. this_ +tag is $this_tag\n"; if($t eq $this_tag) { print "inside of if... t and this_tag are equal.\n"; if(wantarray) { print "I am here if wantarray is true. Now push this onto +array matching\n"; push @matching, $this; print "matching is @matching\n"; last; } else { print "wantarray not true, returning this $this\n"; return $this; } } } unshift @pile, grep ref($_), @{$this->{'_content'} || next}; } print "returning @matching if wantarray\n"; return @matching if wantarray; return; }
    My print statements showed me that there is a library of predefined tags. If I can add my own tags, I think it will work :) I will also try your method. Tackling this is sort of fun. some output:
    next step will check to see if t is eq to this_tag. this_tag is a In while loop. this_tag is a next step will check to see if t is eq to this_tag. this_tag is font next step will check to see if t is eq to this_tag. this_tag is br
Re^2: HTML parsing OR capturing text from a string within tags
by kevyt (Scribe) on Dec 24, 2006 at 07:31 UTC
    Popcorn Dave, I looked at your code. I dont know how it works yet. Will it allow me to add my own string and remove the text right after it. For exmaple...
    <div\042\... > Person <b> Ran <\div>
    will it allow me to capture Person Ran? I think this is the file where I can add my own tags :)
    HTML-Tree-3.23/lib/HTML/AsSubs.pm
      All that code does is get a html page and parse it in to tokens. It will spit the whole mess out, so I ran it at command line, e.g. perl tokeparser.pl > output.txt

      That way you can scan through the file and see how it's tokenizing the information you fed it.

      Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

      If quizzes are quizzical, what are tests?

        Yahoo offers something that I can use. I can send yahoo a request and yahoo will send me a xml file BUT I am getting errors because yahoo has urls with &'s in the file. I can either replace all of the & with %26 and save the file and then let the XML::Parser do the work or I can look at the Parser code and determine where it parses the file and make the change there. I am found where it parses the file in Expat.pm :: sub parse. Then it calls ParseString() but I cant find the sub ParseString.

        http://local.yahooapis.com/LocalSearchService/V2/localSearch?appid=YahooDemo&query=plumbing&zip=22222&format=php&results=10
        Kevin

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://591492]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2019-06-25 17:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (107 votes). Check out past polls.

    Notices?