http://www.perlmonks.org?node_id=607168

valavanp has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I need to read a file and insert the closing tags for any opened tags which is not closed. How should i approach. Ideas and thoughts will be much appreciated. Thanks monks for your valuable suggestions.
  • Comment on read a file and insert closing tags if not present

Replies are listed 'Best First'.
Re: read a file and insert closing tags if not present
by GrandFather (Saint) on Mar 29, 2007 at 07:17 UTC

    You are most likely looking for modules like HTML::Tidy, HTML::TreeBuilder or XML::Twig.

    If you show us a small sample of the sort of data you have to deal with and the code you have tried we may be able to give more specific answers.


    DWIM is Perl's answer to Gödel
      Hi grandfather, This is the code which i tried.
      require HTML::TokeParser; $p = HTML::TokeParser->new("output.xml") || die "Can't open: $!"; $p->empty_element_tags(1); open(FH, "output.xml"); print FH $p; close FH;
      output.xml
      <greeting class="simple">Hello, world!
      The above file is a sample file which i tried to insert the closing tag for the greeting. Actually i have a file which contains 500 lines of text with tagging. for. example in that file i have a tag named <to> but it's not closed. I have to insert the closing tag. This is an example. Thanks for your suggestion.

        HTML::TreeBuilder handles that simple case:

        use strict; use warnings; use HTML::TreeBuilder; my $sgml = <<SGML; <greeting class="simple">Hello, world! SGML my $root = HTML::TreeBuilder->new (); $root->ignore_unknown (0); $root->parse ($sgml); print $root->guts (0)->as_XML ();

        Prints:

        <greeting class="simple">Hello, world!</greeting>

        although I'd not guarantee it will accept everything a real SGML document may contain.


        DWIM is Perl's answer to Gödel

        You can guess sometimes, but there is no way of knowing where the right place for it is.

        in the example,<p> foo <p> bar, you can see where the </p>'s should go, because you can't nest p tags but if you have <span style="rly">Oh, rly<span style="ya">ya, rly there is no real way of knowing where the </span>'s should go, because they can legally be nested.

        You'll most likely have to write rules for how (and where) to end each tag, so that you don't mess the nesting of things (like finding your whole document in a <a href="foo"> or something)

        @_=qw; ask f00li5h to appear and remain for a moment of pretend better than a lifetime;;s;;@_[map hex,split'',B204316D8C2A4516DE];;y/05/os/&print;
Re: read a file and insert closing tags if not present
by f00li5h (Chaplain) on Mar 29, 2007 at 07:01 UTC

    Exactly what type of file is it? I would presume some sort of markup language.

    What are you hoping to get back from this script, and what is your overall goal?

    What code do you have so far? can we see that, and perhaps offer pointers from there?

    Which modules have you investigaed? I hear there are some really good modules for parsing CSV, HTML and all manner of other things.

    @_=qw; ask f00li5h to appear and remain for a moment of pretend better than a lifetime;;s;;@_[map hex,split'',B204316D8C2A4516DE];;y/05/os/&print;
Re: read a file and insert closing tags if not present
by planetscape (Chancellor) on Mar 29, 2007 at 14:34 UTC

    I second GrandFather's recommendation to have a closer look at HTML Tidy. As documented here,

    • Missing or mismatched end tags are detected and corrected
    • End tags in the wrong order are corrected

    HTH,

    planetscape
Re: read a file and insert closing tags if not present
by shigetsu (Hermit) on Mar 29, 2007 at 07:01 UTC

    May I ask, if you have any code so far?

    Update: Missed f00li5h's post.

      The file is sgml file. i don't know how to approach to find the closing tags in that file.
Re: read a file and insert closing tags if not present
by gopalr (Priest) on Mar 29, 2007 at 10:36 UTC

    Hi Valavan,

    my $sgml = <<SGML; <html> <greeting class="simple">Hello, world!<head>heading</head> </html> SGML while ($sgml=~s#(<)([^/<>\s]+)((?:\s[^/<>]+)?>)([^<>]+)(<[^/<>]+>)#$1$ +2$3$4$1\/$2>$5#){} print "\n\n"; print "\nOutput:\n$sgml\n"; print "\n\n";

    Input:

    <html> <greeting class="simple">Hello, world!<head>heading</head> </html>

    Output:

    <html> <greeting class="simple">Hello, world!</greeting><head>heading</head> </html>
    ~ ~ ~ ~ ~