Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?

( #2989=categorized question: print w/ replies, xml ) Need Help??
Contributed by Anonymous Monk on Feb 07, 2000 at 12:15 UTC
Q&A  > regular expressions


Answer: How can I strip away some nested markup code in html by perl, like <SCRIPT> ?
contributed by chromatic

Unless you're dealing with very simple HTML (either generated by a program or by a beginner), you might discover that these approaches have limited degrees of success. ender's is the best, as it is least greedy.

For all non-trivial HTML parsing, look to CPAN modules: HTML::Parser and HTML::TokeParser.

Answer: How can I strip away some nested markup code in html by perl, like <SCRIPT> ?
contributed by ender

If you can get the whole page in one string, then you can use:

s/<script>.*?<\/script>//igs; Which will eat everything between <script> and </script> tags. (and the <script> and </script> tags as well)
Answer: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?
contributed by Pedro Picasso

Let's say you have some html like this:

<b>I like</b> <i>squirrels!</i>.
You could use this:
$html =~ s/<[^>]*>([^<]*)<\/[^>]*>/$1/gs;
To turn it into this:
I like squirrels.
{QandAEditors note: merlyn points out by way of followup that the above regexp only works for simple HTML, and that in real life HTML, the regexp can't be counted upon to not fail. See the followup for details. }
Answer: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?
contributed by songahji

if you have lynx (a program to browse the World Wide Web which works on simple text terminals) then call it.

$text_only = `lynx -dump $filename`;
OR

If you have Netscape, use its "Save as" option with the type set to "Text". This one works with tables.

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (8)
    As of 2014-12-27 22:19 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (177 votes), past polls