Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Ignoring specific html tags before parsing

by ganeshPerlStarter (Novice)
on Oct 07, 2013 at 02:48 UTC ( [id://1057209]=perlquestion: print w/replies, xml ) Need Help??

ganeshPerlStarter has asked for the wisdom of the Perl Monks concerning the following question:

Dear Friends I am learning perl and trying to use it in my project. I want to extract data from content in html files. But, I want to ignore tables & images from html files. I used HTML::TreeBuilder::XPath for html parsing, but could not find a way to specifically ignore certain tags. I also thought of greping out the lines between opening and closing tags, but it broke some other html tags. How can I ignore such tags from html file and then get the text content of that file? Thanks in advance for your help and time. Best Regards ganesh
  • Comment on Ignoring specific html tags before parsing

Replies are listed 'Best First'.
Re: Ignoring specific html tags before parsing
by roboticus (Chancellor) on Oct 07, 2013 at 03:05 UTC

    ganeshPerlStarter:

    If you look at HTML::Parser, it has a couple of examples. In fact, the second one is very close to what you're wanting. Since HTML::TreeBuilder builds on top of HTML parser, you should be able to tweak it to do what you want when you parse it. Looking at the docs, it appears that the eg/hstrip example in the distribution can be coerced into doing what you're attempting to do.

    Disclaimer: I've not done anything significant with HTML::Parser.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Ignoring specific html tags before parsing
by Anonymous Monk on Oct 07, 2013 at 06:19 UTC

    but could not find a way to specifically ignore certain tags. I

    Well, XPath helps you only select what you want so if you want to ignore something, its simple to do, simply don't select it to begin with

      >>its simple to do, simply don't select it to begin with then in that case, we need to list ALL those tags we're interested in. won't this endup in a long list? HTML::Parser has a method ignore_tags() which could be used to ignore tags. I used it as below & tried to get the text, but it returned many nested arrays. I could not figure out how to access to final extracted text from this "@array"
      my @array; my $p = HTML::Parser->new(api_version => 3, handlers => { text => [\@array, "text"]}); $p->ignore_tags(qw(table img)); $p->parse($page); print "Size of array=$#array\n"; foreach my $aline (@array) { print $aline; } print "\n";
      Meanwhile, I found an alternative, but seems it is quite slower than what we could have achieved with HTML::Parser.
      my $link = 'somelinek'; my $page = get($link) or die $!; my $stream = HTML::TokeParser->new(\$page); my $doparse = 1; ## 0 means don't parse while (my $token = $stream->get_token) { if ($token->[0] eq 'S') { if ($token->[1] eq 'table') { $doparse = 0; } elsif ($token->[1] eq 'img') { ;; } } elsif ($token->[0] eq 'E' and $token->[1] eq 'table') { $doparse = 1; } elsif ($token->[0] eq 'C') { ;; } elsif ($token->[0] eq 'T' and $doparse eq 1) { # text process the text in $token->[1] # skip: empty lines, " " if (defined ($token->[1])) { $token->[1] =~ s/ / /ig; $token->[1] =~ s/’/'/ig; $token->[1] =~ s/&#14[7-8];/"/ig; $token->[1] =~ s/—//ig; $token->[1] =~ s/&/&/ig; $token->[1] =~ s/-{2,}//ig; print "$token->[1]"; } } }
      This above use of TokeParser gives lot of broken text. Which could be better way? Thanks
        What is your actual goal?

        then in that case, we need to list ALL those tags we're interested in.

        Or you could select the ones you want and [id://1052072remove them from the tree]

Re: Ignoring specific html tags before parsing
by naChoZ (Curate) on Oct 09, 2013 at 23:35 UTC

    Another module worth mentioning is HTML::Scrubber. Very easy to use. I'm familiar with it because of RT's use of it which I've had to customize on occasion.

    --
    Andy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1057209]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (12)
As of 2024-04-23 08:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found