John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

Many years ago, I wrote a small Perl program that scans a hand-written HTML file and generates/updates a table of contents at the top of the file, with links to the various H\d tags.

That was fairly crude, being line-oriented and required that the header tags and matching names be just so. But it did recognise the stuff it generated before and replaced it with a refreshed copy.

I'd like something modern that does this. A proper HTML parser would take any HTML without relying on special formatting conventions or restrictions. The generated table of contents can have fancy dynamic-expanding/collapsing features.

Someone has got to have done this already! Where can I find it?


Replies are listed 'Best First'.
Re: HTML table-of-contents generator
by gjb (Vicar) on Jul 01, 2003 at 17:56 UTC

    A CPAN search for 'html toc' gives a number of hits, and HTML::GenToc seems promising.

    Hope this helps, -gjb-

      There also exists HTML::Toc.

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

        The problem is that HTML::Toc generates invalid HTML.
        <a name=h-1><h1>Header One</h1></a>
        HTML::GenToc puts the H1 and A tags in the correct order, but doesn't understand (at least, it's not in the docs) the preferred way to do it, which is the way my existing HTML is:
        <H1 id="h-1">Header One</H1>
        I'd want inserted tags to use this way, and it must certainly understand this way!


Re: HTML table-of-contents generator
by tcf22 (Priest) on Jul 01, 2003 at 17:56 UTC
    Don't know of any modules that build Table of Contents pages for you, but for parsing the HTML, try HTML::Parser.
      I would need two passes; first to identify any headers and possibly modiy them to add an id, and second to insert the generated TOC in the correct position (removing the old one).

      HTML::Parser is pretty primitive, but looks like it's enough to spot the header tags easily enough. But what about modifying the HTML? It needs to print out everything it reads, with the same formatting.

      I also looked at HTML::TreeBuilder, and it can't output the same format that it read but produces its own re-generation of the text.

Re: HTML table-of-contents generator
by chunlou (Curate) on Jul 01, 2003 at 20:41 UTC
    Though HTML::Toc is probably what you need, not MS Word, just throw it in as an idea. You can read a html file into Word and have it generate a TOC by "Insert/Index and Tables..." menu option. It will create a TOC with hierarchy according to the header tags. But afterwards Word will totally mess up the html. Only useful for some quick one-time job or getting a quick look-and-feel.