How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?

Replies are listed 'Best First'.
Re: How can I strip away some nested markup code in html by perl, like <SCRIPT> ?⭐ by chromatic (Archbishop) on Apr 03, 2000 at 07:27 UTC
Unless you're dealing with very simple HTML (either generated by a program or by a beginner), you might discover that these approaches have limited degrees of success. ender's is the best, as it is least greedy. For all non-trivial HTML parsing, look to CPAN modules: HTML::Parser and HTML::TokeParser.	[reply]
Re: How can I strip away some nested markup code in html by perl, like <SCRIPT> ?⭐ by ender (Novice) on Mar 23, 2000 at 01:04 UTC
If you can get the whole page in one string, then you can use: `s/<script>.*?<\/script>//igs; Which will eat everything between <script> and </script> tags. (and the <script> and </script> tags as well)` [download]	[reply] [d/l]
Re: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ? by Pedro Picasso (Sexton) on Oct 15, 2003 at 14:10 UTC
Let's say you have some html like this: `<b>I like</b> <i>squirrels!</i>.` [download] You could use this: `$html =~ s/<[^>]>([^<])<\/[^>]>/$1/gs;` [download] To turn it into this: `I like squirrels.` [download] {QandAEditors note:* merlyn points out by way of followup that the above regexp only works for simple HTML, and that in real life HTML, the regexp can't be counted upon to not fail. See the followup for details. }	[reply] [d/l] [select]
•Re: Answer: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ? by merlyn (Sage) on Oct 15, 2003 at 15:02 UTC
Sure, that works for simple HTML, but real life HTML can fail on such a simple regex. For example: `<!-- > this is still the comment --> and some more text` [download] In that case, "this is still the comment" would be left within the output, when it shouldn't be. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ? by songahji (Friar) on May 11, 2005 at 17:58 UTC
if you have lynx (a program to browse the World Wide Web which works on simple text terminals) then call it. $text_only = `lynx -dump $filename`; [download] OR If you have Netscape, use its "Save as" option with the type set to "Text". This one works with tables.	[reply] [d/l]


Just another Perl shrine
	PerlMonks