Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

HTML document modification

by rob_au (Abbot)
on May 28, 2004 at 01:32 UTC ( #357104=perlquestion: print w/ replies, xml ) Need Help??
rob_au has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for some feedback on the best way to approach a particular function which I am seeking to implement. I need to modify some existing HTML documents to add a small portion of HTML immediately before the </body> tag within these documents, but do not want to otherwise modify the layout or structure of the HTML document.

The Apache::Footer example module outlined in Lincoln Stein's Cool Tricks With Perl and Apache (Perl Conference 1998) uses a simple regular expression to insert HTML content immediately before the </body> tag - See this code here. I am however concerned about the application of this method across the wide range of HTML documents involved (which may include example HTML within <pre></pre> tags).

The alternate approach which I have considered is using one of the HTML::Parser modules to step through the HTML tokens, but am concerned about modifying the layout and structure of HTML documents.

Any other suggestions or approaches which I should consider?

 

perl -le "print unpack'N', pack'B32', '00000000000000000000001011011101'"

Comment on HTML document modification
Select or Download Code
Re: HTML document modification
by davido (Archbishop) on May 28, 2004 at 01:53 UTC
    Couldn't you just use simple substitution, for such a minimal task? I mean I am the first (or one of them) to decry using a regexp for HTML, but this situation may not warrant more.

    my $newstuff = "<p>New HTML here!</p>\n"; open my $in, '<', "infile.html" or die $!; open my $out, '>', "tempfile.html" or die $!; while ( my $line = <$in> ) { next unless $line =~ m!<\s*/body\s*>!i; $line =~ s!(<\s*/body\s*>)!$newstuff$1!i; } continue { print $out; } close $out or die $!; close $in or die $!; rename "tempfile.html", "infile.html" or die $!;

    ...untested, but it seems about right...


    Dave

      The issue with this approach is where the HTML document may include example HTML, including <body></body> tags, within <pre></pre> tags. The regular expression which BrowserUK has provided appears to be somewhat more robust, although I suspect that I will follow his suggestion to try reading the file backwards for the first </body> tag.

       

      perl -le "print unpack'N', pack'B32', '00000000000000000000001011011110'"

        Uh, that would be invalid HTML, wouldn't it? Any examples within the document would have to be escaped.
Re: HTML document modification
by BrowserUk (Pope) on May 28, 2004 at 02:04 UTC

    Maybe if you read the file backwards and only replaced the last (first:) occurance of </body>.

    If your html is well formed that should be fairly foolproof.

    Perhaps, rather than reading backwards line by line, you could read the last couple of hundred bytes and then use the regex

    s[(</body>)(?=(?:\s*</html>)>\s*\Z][$insert$1]i;

    By only replacing the /body tag if there is only whitespace and an (optional) /html tag between it and the EOF, you'd be pretty certain of correctness assuming reasonably well-formed html. That wouldn't handle comments, but they are (probably) fairly rare at that point in the html?

    If you raised an error in the event that the regex didn't match, any oddities could be fixed up manually.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: HTML document modification
by ryantate (Friar) on May 28, 2004 at 21:43 UTC

    It might be simplest to just use quantifier greediness to your advantage. You wouldn't have to read the file backward or use an awkward lookahead, which requires the presence of a closing HTML tag in already suspect documents.

    The example you link to looks like this:

    s!(</BODY>)!$footer$1!oi;

    If you take the following, similar regex, and run it against the HTML document (as a whole), you should match only the last closing body tag, even if there are multiple closing body tags in the document. *This is untested*:

    s!(.+)(</BODY>)!$1$footer$2!ois;

    The greedy plus sign ("+") and the match-all dot (".") will eat up all text in the document, then backtrack from the end of the file to allow the closing body tag to match.

    This approach has the advantage of great implementational simplicity. The disadvantages are that you have to slurp the whole HTML document into memory, and there is likely significant overhead associated with swapping text into and out of the first match ($1 which holds the results of "(.+)") as the regex engine backtracks to the closing body tag.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://357104]
Approved by BrowserUk
Front-paged by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (15)
As of 2014-07-11 13:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (226 votes), past polls