Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Converting HTML tags into uppercase using Perl

by steve_g50 (Initiate)
on Nov 29, 2005 at 10:49 UTC ( #512568=perlquestion: print w/replies, xml ) Need Help??

steve_g50 has asked for the wisdom of the Perl Monks concerning the following question:

Does anybody have a Perl script that asks the user for an HTML file, and once entered, converts all HTML tags in the file to uppercase. I am relatively new to perl, and would like to see perl interacting with other files and programs.
  • Comment on Converting HTML tags into uppercase using Perl

Replies are listed 'Best First'.
Re: Converting HTML tags into uppercase using Perl
by davorg (Chancellor) on Nov 29, 2005 at 11:04 UTC

    It would be really simple to knock up something that did this using HTML::Parser, but it's perhaps worth pointing out that if you are at all interested in XHTML compatibility then valid XHTML tags are all lower case.

    Update: Here's a basic HTML::Parser solution. It can almost certainly be improved and/or simplified.

    #!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $p = HTML::Parser->new(start_h => [\&start, 'tagname, attr, attrseq +'], end_h => [\&end, 'tagname'], text_h => [\&text, 'text']); $p->parse_file(shift); sub start { my ($name, $attr, $attrseq) = @_; print '<' . uc($name); if (keys %$attr) { foreach (@$attrseq) { print ' ' . uc($_) . '="' . $attr->{$_} . '"'; } } print '>'; } sub end { print '</' . uc($_[0]) . '>'; } sub text { print $_[0]; }
    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Ive tried this, but i can't get it to register the filename after i've entered it. Any ideas?
      #!/usr/bin/perl use warnings; use HTML::Parser; print("Enter an html file (with either a .html or .htm extension): "); $file=<STDIN>; my $file = $ARGV[0]; unless ($file) { print ("No filename given\n"); exit; } my $new; my $p = HTML::Parser->new( start_h => [ \&start_h, 'tagname, text' ], end_h => [\&end_h, 'tagname, text' ], default_h => [sub { $new .= shift }, 'text'], ); $p->parse_file($file); # Rename the old file my $newfile = $file.'.old'; rename($file, $newfile) or die "Can't rename $file: $!"; # Write the new text to the old filename open my $fh, ">", $file or die "Can't create new file: $!"; print $fh $new; close $fh; sub start_h { my($tag, $text) = @_; my $uc = uc $tag; $text =~ s/$tag/$uc/; $new .= $text; } sub end_h { my($tag, $text) = @_; my $uc = uc $tag; $text =~ s/$tag/$uc/; $new .= $text; }

        $file=<STDIN>;

        my $file = $ARGV[0];

        This looks pretty confused to me. You read the filename from STDIN into a package variable called $file (incidently, you don't chomp that value so it still has a newline character on the end). You then ignore that value and create a new, lexical, variable also called $file and into that you copy the value of the first command line argument. You don't say how you call the program, but if you don't give it any command line arguments then that will be 'undef'. You then ignore the package variable (which has the correct value - albeit with an extra newline) and continue to use the lexical value which (probably) contains 'undef'.

        So, no, it almost certainly won't do what you want :)

        This is a good example of why you should always have use strict in your programs.

        You probably want to write that code something like this (untested):

        # check to see if you have a command line argument my $file = $ARGV[0]; # if not, or if it's not an HTML file, then prompt for one until ($file && ($file =~ /\.html?$/i)) { print('Enter an html file (with either a .html or .htm extension): ' +); $file=<STDIN>; chomp $file; }
        --
        <http://dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

Re: Converting HTML tags into uppercase using Perl
by holli (Abbot) on Nov 29, 2005 at 11:18 UTC
    The following uses HTML::Tokeparser and should give you a starting point:
    use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new( "file.html" ); while ( my $t = $p->get_token ) { #forward comments, text and declarations if ( $t->[0] =~ /[CDT]/ ) { print $t->[1]; } #uppercase start tags elsif ( $t->[0] =~ /S/ ) { print "<", uc($t->[1]), " ", join (" ", map { uc($_) . '="' . $t->[2]->{$_} . '"' } @{$t- +>[3]}), ">"; } #uppercase end tag elsif ( $t->[0] =~ /E/ ) { print uc($t->[2]); } #forward processing instruction elsif ( $t->[0] =~ /PI/ ) { print $t->[2]; } }


    holli, /regexed monk/
Re: Converting HTML tags into uppercase using Perl
by planetscape (Chancellor) on Nov 30, 2005 at 01:40 UTC

    Also see HTML Tidy's -upper directive.

    HTH,

    planetscape
OT Re: Converting HTML tags into uppercase using Perl
by ww (Archbishop) on Nov 29, 2005 at 12:24 UTC

    ...and, while uppercase tags are allowed under html 4.01, they are NOT allowed in xhtml xml   Slap ww upside the head!... so if this is other than homework, steve_g50 may wish to learn a bit more about .html as well as about perl.

    Update: Grinder is, of course, correct both re xml and re need to provide cites, and Fletch, thanks! Your cite is bang_on.

    Moral (and message to self): ensure caffeine levels are within normal operating range and put brain in gear before typing.

      uppercase tags [...] are NOT allowed in xml

      ww may wish to learn more about XML, or least be able to quote the specification chapter and verse in order to back up such a claim. I've been doing XML for years (and SGML before that) and I've never heard of such nonsense.

      A start-tag is a Name, and a Name is one or more Letters (more or less, ignoring namespace issues), and a Letter may be drawn from many, many things, including, but not limited to, uppercase and lowercase letters.

      See the section on logical structures in the XML specification for more information.

      Update: my bad, I did ponder how ww could have come up with such an outlandish idea (because his/her advice is spot-on in general), and I failed to make the connection to XHTML. I just wanted to quash the meme before it got any further.

      • another intruder with the mooring in the heart of the Perl

        I think he misspoke and meant "XHTML" rather than XML. While you are correct that XML allows upper-, lower-, and mixed-case tag names, the XHTML spec does specifically require lowercase:

        4.2. Element and attribute names must be in lower case

        XHTML documents must use lower case for all HTML element and attribute names. This difference is necessary because XML is case-sensitive e.g. <li> and <LI> are different tags.

        http://www.w3.org/TR/xhtml1/#h-4.2

Re: Converting HTML tags into uppercase using Perl
by mlh2003 (Scribe) on Nov 30, 2005 at 12:01 UTC
    Please don't post on two separate forums. You will be darting between both and getting more confused about a suitable solution to your problem - particularly if both threads take different approaches. Admittedly there are common ideas in both, but you're best to stick with one and run with that.
    _______
    Code is untested unless explicitly stated
    mlh2003
Re: Converting HTML tags into uppercase using Perl
by Samy_rio (Vicar) on Nov 29, 2005 at 11:03 UTC

    Hi steve_g50, Try this,

    #!/usr/bin/perl -w use strict; local $/; open(INPUT, "input.html") || die("$!"); open(OUTPUT, ">output.html") || die ("$!"); my $txt = <INPUT>; $txt=~s/<([^> ]*)([^>]*>)/"<".uc($1)."$2"/egsi; #Updated: only element + names except attributes print OUTPUT $txt;

    Regards,
    Velusamy R.


    eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

      This script of course breaks for the case of an embedded ">" sign in the value, and it uppercases all values too, both of which will break the HTML file:

      <html> <img src="a_greater_b.gif" alt="a > b" /> <img src="a_smaller_b.gif" alt="a < b"/> </html>

      These niggles are the reason why it is always recommended to avoid parsing HTML with regular expressions.

      Update: Rearranged HTML to be a test case for the second problem as well.

      Thanks Samy_rio, i'll see what i can do with your program. Thank you to everyone else for your help. Keep replying if you can still help with my problem. Merry Christmas to all of you who celebrate it.
      This works very well, thanks. But how do i get the program to ASK for a .htm or .html file, then change the tags in the given file to uppercase, and THEN save the new file using the OPEN fuction? Thanks again.

        It doesn't work very well for all of the reasons thar Corion listed. It will break badly on various (common) types of HTML. Please look at using a solution that uses a real HTML parser.

        --
        <http://dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

Re: Converting HTML tags into uppercase using Perl
by kulls (Hermit) on Nov 29, 2005 at 13:46 UTC
    Basically,
    why do you need like this?.may i know the details?.SO that, it'll leads to solve the probs in better ways
    -kulls
Re: Converting HTML tags into uppercase using Perl
by inman (Curate) on Nov 29, 2005 at 11:21 UTC
    A valid HTML tag starts with a < followed by the name of the tag. A / character is also allowed following the < to indicate the closing tag. Whitespace can also be used in the tag to separate tokens.

    The code below finds and replaces the tag names into upper case.

    while (<>) { s/(<\s*\/?\s*)(\w+)/$1\U$2/g; print; }

      See, this is why you should never try to parse arbitrary HTML with regular expressions. Your regex doesn't handle a number of very common occurances. The first thing that springs to mind is tags with attributes - the tag name will be upper-cased, but the attribute names will be left untouched. The original poster was unclear as to what sohuld be done in those circumstances.

      Also can you be sure that every < character in the document starts a tag? What if it was in a CDATA section?

      All in all, I think it's far better to use an HTML parser. They are there to be used, so why not use them?

      --
      <http://dave.org.uk>

      "The first rule of Perl club is you do not talk about Perl club."
      -- Chip Salzenberg

        I figured that this was a homework question anyway and so a reasonable bit of explanation would allow the student to get away with the numerous variations that exist in real HTML. The OP wants to uppercase his tags. He does not mention attributes so I have left it for him to look at.

        A CDATA section is not defined as an HTML tag as defined by the HTML 4 DTD but a <script> tag is which could contain conditional statements (e.g. start < end)that are matched by the regex. Tackling these issues is also something for the guy to look at.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://512568]
Approved by g0n
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (1)
As of 2021-12-04 03:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (30 votes). Check out past polls.

    Notices?