Ignoring specific html tags before parsing

ganeshPerlStarter has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Ignoring specific html tags before parsing by roboticus (Chancellor) on Oct 07, 2013 at 03:05 UTC
ganeshPerlStarter: If you look at HTML::Parser, it has a couple of examples. In fact, the second one is very close to what you're wanting. Since HTML::TreeBuilder builds on top of HTML parser, you should be able to tweak it to do what you want when you parse it. Looking at the docs, it appears that the eg/hstrip example in the distribution can be coerced into doing what you're attempting to do. Disclaimer: I've not done anything significant with HTML::Parser. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: Ignoring specific html tags before parsing by Anonymous Monk on Oct 07, 2013 at 06:19 UTC
but could not find a way to specifically ignore certain tags. I Well, XPath helps you only select what you want so if you want to ignore something, its simple to do, simply don't select it to begin with	[reply]
Re^2: Ignoring specific html tags before parsing by ganeshPerlStarter (Novice) on Oct 07, 2013 at 06:55 UTC
>>its simple to do, simply don't select it to begin with then in that case, we need to list ALL those tags we're interested in. won't this endup in a long list? HTML::Parser has a method ignore_tags() which could be used to ignore tags. I used it as below & tried to get the text, but it returned many nested arrays. I could not figure out how to access to final extracted text from this "@array" `my @array; my $p = HTML::Parser->new(api_version => 3, handlers => { text => [\@array, "text"]}); $p->ignore_tags(qw(table img)); $p->parse($page); print "Size of array=$#array\n"; foreach my $aline (@array) { print $aline; } print "\n";` [download] Meanwhile, I found an alternative, but seems it is quite slower than what we could have achieved with HTML::Parser. my $link = 'somelinek'; my $page = get($link) or die $!; my $stream = HTML::TokeParser->new(\$page); my $doparse = 1; ## 0 means don't parse while (my $token = $stream->get_token) { if ($token->[0] eq 'S') { if ($token->[1] eq 'table') { $doparse = 0; } elsif ($token->[1] eq 'img') { ;; } } elsif ($token->[0] eq 'E' and $token->[1] eq 'table') { $doparse = 1; } elsif ($token->[0] eq 'C') { ;; } elsif ($token->[0] eq 'T' and $doparse eq 1) { # text process the text in $token->[1] # skip: empty lines, " " if (defined ($token->[1])) { $token->[1] =~ s/ / /ig; $token->[1] =~ s//'/ig; $token->[1] =~ s/&#14[7-8];/"/ig; $token->[1] =~ s///ig; $token->[1] =~ s/&/&/ig; $token->[1] =~ s/-{2,}//ig; print "$token->[1]"; } } } [download] This above use of TokeParser gives lot of broken text. Which could be better way? Thanks	[reply] [d/l] [select]
Re^3: Ignoring specific html tags before parsing by Anonymous Monk on Oct 07, 2013 at 07:22 UTC
What is your actual goal?	[reply]
Re^4: Ignoring specific html tags before parsing by Anonymous Monk on Oct 08, 2013 at 01:36 UTC
Re^5: Ignoring specific html tags before parsing by Anonymous Monk on Oct 08, 2013 at 02:21 UTC
Re^3: Ignoring specific html tags before parsing by Anonymous Monk on Oct 07, 2013 at 07:30 UTC
then in that case, we need to list ALL those tags we're interested in. Or you could select the ones you want and [id://1052072remove them from the tree]	[reply]
Re^4: Ignoring specific html tags before parsing by Anonymous Monk on Oct 07, 2013 at 07:31 UTC
Re: Ignoring specific html tags before parsing by naChoZ (Curate) on Oct 09, 2013 at 23:35 UTC
Another module worth mentioning is HTML::Scrubber. Very easy to use. I'm familiar with it because of RT's use of it which I've had to customize on occasion. -- Andy	[reply]


Think about Loose Coupling
	PerlMonks