Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Regexps to change HTML tags/attributes

by Tricky (Sexton)
on Aug 27, 2003 at 15:49 UTC ( #287071=perlquestion: print w/replies, xml ) Need Help??

Tricky has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I have some simple code to read-in an HTML file to an array, remove the image and anchor tags, and write these changes to the source file on my hard-drive. So far, so good.

1. Is there a better way to initialise the variable containing the pattern? Should it be a string literal of the tag i want to remove / change? The code's below, for your perusal.

Once I've read the file in , I'd like to check for the presence of the tags, and if true then call the subs which remove the tags/attributes. One of the brothers, a little while ago, thought that if I tested the patterns as they are at the moment, they would return 'true' as the value of the pattern variables were non-empty strings . Am I doing this right?

2. How may i go about testing for the presence of tags/attributes, without falling down this pit-fall?

I'm also looking into how to alter the font size values of in-line styles via the same approach.

Surely, there are better solutions...

Trix

#!/usr/bin/perl # write mods to HTML file.plx # Program will read in an html file, remove the img tag and rewrite HT +ML on E-drive. # 1. No need for file variable yet: open (INFILE, "<".$htmlFile) or di +e("Can't read source file!\n"); # 2. Alternative: m/<A\s+HREF=[^>]+>(.*?)<\/A>/ - Will not remove clo +sing tag though - why? # 3. Why is interpreter flipping-out over an 'undefined variable', whe +n # original regexp, m/<A\s+HREF=[^>]+>(.*?)<\/A>/, is known to work. + What am I missing? use warnings; use diagnostics; use strict; # Declare and initialise variables. my $pattern1 = '<IMG\s+(.*)>'; my $pattern2 = '<A\s+HREF\s*=[^>]+>'; my $pattern3 = '</A>'; my @htmlLines; # Open HTML test file and read into array. open INFILE, "E:/Documents and Settings/Richard Lamb/My Documents/HTML +/dummy1.html" or die "Sod! Can't open this file.\n"; @htmlLines = <INFILE>; # Call tag-scrapping subs scrapImageTag(); scrapAnchorTag(); # Removes image tag elements in array sub scrapImageTag { # interates through each element (i.e. HTML line) in array foreach my $line (@htmlLines) { # replace <IMG ...> with nothing. $line =~ s/$pattern1//ig; # case insensitivity and global search +for pattern } } # Removes anchor tag elements in array sub scrapAnchorTag { # interates through each element (i.e. HTML line) in array foreach my $line (@htmlLines) { # replace <A HREF ...> with nothing. $line =~ s/$pattern2//ig; # case insensitivity and global search +for pattern $line =~ s/$pattern3//ig; # case insensitivity and global search +for pattern } } # Replacing original file with reformatted file! open (OUTFILE, ">E:/Documents and Settings/Richard Lamb/My Documents/H +TML/dummy1.html") or die("Can't rewrite the HTML file.\n"); print (OUTFILE @htmlLines); close (INFILE); close (OUTFILE);
Cheers,

T

update (broquaint): shifted <code> tags, added formatting and <readmore> tag

Replies are listed 'Best First'.
Re: Regexps to change HTML tags/attributes
by Ovid (Cardinal) on Aug 27, 2003 at 16:00 UTC

    As a general rule, don't use regular expressions to parse HTML. You typically want a parser. Here's a short example that will remove all anchor tags (beginning and ending) and also change font sizes (though you should really use CSS) and delete the "alt" attribute of images (which you also shouldn't do, but it's here as an example):

    use HTML::TokeParser::Simple 2.1; my $parser = HTML::TokeParser::Simple->new($html_file); my $new HTML = ''; while (defined(my $token = $parser->get_token)) { next if $token->is_tag('a'); # strip anchor tags if ($token->is_start_tag('font')) { $token->set_attr('size' 7); } if ($token->is_tag('img')) { $token->delete_attr('alt'); } $html .= $token->as_is; } open HTML, ">", $new_html_doc or die "Cannot open ($new_html_doc) for +writing: $!"; print HTML $html; close HTML;

    As a side note, if you want your HTML "cleaned up" a little bit, prior to the $html .= $token->as_is; line, add:

    $token->rewrite_tag;

    That will preserve and double-quote the values, automatically lowercase the tag name and attribute names (as they properly should be) and preserve an ending forward slash if it's used in a self closing tag:

    # before <img SRC=foo.jpg height='13' width=14 ALT="SOME alt Value +" /> # after <img src="foo.jpg" height="13" width="14" alt="SOME alt Value +" />

    This method is automatically called on tags that have attributes added, changed, or deleted.

    In other words, this is a very common task and HTML::TokeParser::Simple, version 2.1 does all of that for you and then some.

    Cheers,
    Ovid

    New address of my CGI Course.

      Cheers Ovid,
      The TokeParser solution looks inviting, but unfortunately I'm looking into how regexps can be applied to the problem.

      All the best,

      T

      edited by ybiC: format via <br /> and <p> instead of <code>, to eliminate unecessary lateral scrolling in browser window

Re: Regexps to change HTML tags/attributes
by Abigail-II (Bishop) on Aug 27, 2003 at 16:04 UTC
    There's much wrong with your program. First, if you are going to modify the file line-by-line, it's a total waste to first read in all lines into an array. However, when dealing with HTML, it's wrong to look at individual lines. HTML does not have a concept of lines, and tags can have newlines inside them.

    As for the regexes, the first pattern will not do the right thing if there's another tag at the same line. The second pattern will fail to do the right thing if the anchor has another attribute before "HREF", or if it has an attribute value containing a ">".

    You would be far better off using one of the many HTML parsing modules found on CPAN.

    Abigail

Re: Regexps to change HTML tags/attributes
by Aristotle (Chancellor) on Aug 28, 2003 at 10:59 UTC
    How would you build a regex to change the src here?
    <img alt="></(/>" src="/img_handler?alpha=>0.9;name=fish.png" />

    Even if you managed to get this right, the likelihood is very high that your pattern could be broken easily. Don't just blindly look for strings in your HTML.

    Write an actual parser, if you have a lot of time to spare. Otherwise, use one of the existing ones (see Ovid's reply) and get on with your life.

    Makeshifts last the longest.

Regexp and HTML revisited...
by Tricky (Sexton) on Aug 28, 2003 at 11:35 UTC
    Hello Holy Ones,
    In reply to Ovid and Abigail's comments : the TokeParser module is a great idea, the problem is that my remit is to investigate how regexps can be applied to reformatting HTML pages. I have a regexp for a background colour attribute, though the '#' character treats all characters following as a comment! E.g.
    /background-color:\s*#([0-9a-fA-F]{6});*/ig
    To solve this problem, I've declared two scalars, $pattern1 and $pattern2, with different hexadecimal colour codes, e.g.
    my $pattern1 = 'background-color: #FAF519;'; my $pattern2 = 'background-color: #A7D6D5;';
    It's an (overly) simple solution, though I have not worked out how to escape the hash character yet! Any suggestions?

    T

    update (broquaint): fixed formatting

      Tricky wrote:

      In reply to Ovid and Abigail's comments : the TokeParser module is a great idea, the problem is that my remit is to investigate how regexps can be applied to reformatting HTML pages. I have a regexp for a background colour attribute, though the '#' character treats all characters following as a comment!

      I'm not sure what you mean by your statement that your "remit is to investigate how regexps can be applied to reformatting HTML pages". If, by that, you mean that someone else has tasked you with this, then they have made a mistake. If someone comes to me and says "Ovid, I need you to deflea my cat. Here, use this shotgun", then I know that person made a mistake that's all too common in business. In short, the mistake is to say "here's a solution, let's see how we can make it fit our problem." That's absolutely the wrong way to go about things.

      Mind you, it's an easy thing to do. I suspect that cyanide kills fleas. Therefore, I might ask a friend "how can I use cyanide to deflea my cat?" When that friend tells me to use flea powder, my first instinct shouldn't be "but I've got all of this cyanide handy, how do I use that?" Instead, a better tactic is to revisit the original problem. How do I remove the fleas from my HTML ... er ... cat? If the proposed solution is better than mine, I should be willing to swallow my pride and go with the best solution. Heck, if all politicians believed that, we'd have a much better country :)

      Just for giggles, let's look at some valid HTML tags:

      <a href="foobar.txt" onclick="javascript:go_boom()">stuph</a> <A HREF =foobar.txt ONCLICK='javascript:go_boom()'>stuph</a> <A HREF = 'foobar.txt' ONCLICK= 'javascript:go_boom()' > stuph </a > <font color="#FAFA519">test</font> <font color="FAFA519">test</font> <font color="fafa519">test</font> <font color=fafa519>test</font> <font color='fafa519'>test</font> <font color=fafa519 >test</font>

      Do you like all of those font tags? Most browsers will render all of them identically. That's a great example of why most regular expressions will fail. They're tough to write.

      But just to show you that I'm a good sport about how to deflea your cat, here's a link to Tom Christiansen's article, HTML Hacking with Regular Expressions. Enjoy!

      Cheers,
      Ovid

      New address of my CGI Course.

        Cat Bathing As A Martial Art

        Some people say cats never have to be bathed. They say cats lick themselves clean. They say cats have a special enzyme of some sort in their saliva that works like new, improved Wisk - dislodging the dirt where it hides and whisking it away.

        I've spent most of my life believing this folklore. Like most blind believers, I've been able to discount all the facts to the contrary, the kitty odours that lurk in the corners of the garage and dirt smudges that cling to the throw rug by the fireplace.

        The time comes, however, when a man must face reality: when he must look squarely in the face of massive public sentiment to the contrary and announce: "This cat smells like a port-a-potty on a hot day in Juarez."

        When that day arrives at your house, as it has in mine, I have some advice you might consider as you place your feline friend under your arm and head for the bathtub:

        Know that although the cat has the advantage of quickness and lack of concern for human life, you have the advantage of strength. Capitalize on that advantage by selecting the battlefield. Don't try to bathe him in an open area where he can force you to chase him. Pick a very small bathroom. If your bathroom is more than four feet square, I recommend that you get in the tub with the cat and close the sliding-glass doors as if you were about to take a shower. (A simple shower curtain will not do. A berserk cat can shred a three-ply rubber shower curtain quicker than a politician can shift positions.)

        Know that a cat has claws and will not hesitate to remove all the skin from your body. Your advantage here is that you are smart and know how to dress to protect yourself. I recommend canvas overalls tucked into high-top construction boots, a pair of steel-mesh gloves, an army helmet, a hockey face mask, and a long-sleeved flak jacket. Prepare everything in advance. There is no time to go out for a towel when you have a cat digging a hole in your flak jacket. Draw the water. Make sure the bottle of kitty shampoo is inside the glass enclosure. Make sure the towel can be reached, even if you are lying on your back in the water.

        Use the element of surprise. Pick up your cat nonchalantly, as if to simply carry him to his supper dish. (Cats will not usually notice your strange attire. They have little or no interest in fashion as a rule. If he does notice your garb, calmly explain that you are taking part in a product testing experiment for J.C. Penney.)

        Once you are inside the bathroom, speed is essential to survival. In a single liquid motion, shut the bathroom door, step into the tub enclosure, slide the glass door shut, dip the cat in the water and squirt him with shampoo. You have begun one of the wildest 45 seconds of your life.

        Cats have no handles. Add the fact that he now has soapy fur, and the problem is radically compounded. Do not expect to hold on to him for more than two or three seconds at a time. When you have him, however, you must remember to give him another squirt of shampoo and rub like crazy. He'll then spring free and fall back into the water, thereby rinsing himself off. (The national record for cats is three latherings, so don't expect too much.)

        Next, the cat must be dried. Novice cat bathers always assume this part will be the most difficult, for humans generally are worn out at this point and the cat is just getting really determined. In fact, the drying is simple compared to what you have just been through. That's because by now the cat is semipermanently affixed to your right leg. You simply pop the drain plug with your foot, reach for your towel and wait. (Occasionally, however, the cat will end up clinging to the top of your army helmet. If this happens, the best thing you can do is to shake him loose and to encourage him toward your leg.) After all the water is drained from the tub, it is a simple matter to just reach down and dry the cat.

        In a few days the cat will relax enough to be removed from your leg. He will usually have nothing to say for about three weeks and will spend a lot of time sitting with his back to you. He might even become psychoceramic and develop the fixed stare of a plaster figurine.

        You will be tempted to assume he is angry. This isn't usually the case. As a rule he is simply plotting ways to get through your defenses and injure you for life the next time you decide to give him a bath.

        But at least now he smells a lot better.


        From off the web somewhere a couple of years ago, attributed to "Stephen Schulze". I have a shortcut to this on my desktop. Very little gets a place on my desktop, but this always renders me to tears and cheers me up no matter how frustrating my computer/compiler/heating system/... is being.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.

        <font color="FAFA519">test</font> ...
        Do you like all of those font tags? Most browsers will render all of them identically.

        I have never, ever run into a font tag like that in the wild. No tools that I know of will generate such tags. They have to be mis-coded by hand. If the corpus you're dealing with is "safe" (e.g., all generated by working tools), then you probably won't need to worry about these at all.

        Defleaing a cat with old tactical nukes would seem a little daft(as we say in Yorkshire). Regular expressions do seem to be very...long-winded and difficult, after looking briefly into parse-trees.

        Reinventing nukes and cyanide aside, I'm stuck with this approach for my project, so I'll just have to demonstrate
        the limitations of regexps for this kind of work. Tom Cristiansen's article was informative, and showed me how limited my approach is.

        Still, thanks for the help. Always appreciated.
        T

      To escape '#', just use '\#', that should do.

      Two further remarks:

      1. Since you're using the /i modifier, [0-9a-f] will do.
      2. I don't think you want ';;;;' to match, so I'd replace the ';*' by ';?'.

      Hope this helps, -gjb-

      You don't need to escape the # like you don't need to escape it in strings either: print $1 if 'background-color:#fe34e5' =~ /background-color:\s*#([0-9a-fA-F]{6});*/ig yields fe34e5
      By the way: ;* is a no-op in your regex, so you colud as well leave it out.
      Hope this helped.
      CombatSquirrel.
      Entropy is the tendency of everything going to hell.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://287071]
Approved by ybiC
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (9)
As of 2020-05-28 13:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If programming languages were movie genres, Perl would be:















    Results (165 votes). Check out past polls.

    Notices?