Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

regex and html tags

by Parham (Friar)
on Oct 18, 2002 at 02:01 UTC ( #206197=perlquestion: print w/ replies, xml ) Need Help??
Parham has asked for the wisdom of the Perl Monks concerning the following question:

Someone came to me with a PHP question, I used perl to solve it :P. This was what the person wanted:
<tag#1>BLA</tag#1><tag#2>BLA</tag#1><tag#3>BLA</tag#1>
to turn into:
<tag#1>BLA</tag#1><tag#2>BLA</tag#2><tag#3>BLA</tag#3>
meaning the first tag opened had to be the first tag closed. There was a trick to this question though. The person wanted the finishing tag to be the first paramater in the html tag. So if i had < font size=2 > I would have to end it with < /size > and not < /font >. My first instinct was to do a while loop through the tags, find what i needed, do a replace, and continue:
$text = "<tag#1>BLA</tag#2><tag#2>BLA</tag#2><tag#3>BLA</tag#3>"; $number = 1; while ($text =~ s/<(.+?)#\d>(.+?)<\/(.+?)#\d>/<$1#$number>$2<\/$3#$num +ber>/) { $number++; last; } print "$text\n";
Now I couldn't even answer it, but the process would only work with one loop, thus the "last;" (I guess I didn't need to increment $number then :P). Anyway, that solution partially worked. So i tried again:
$text = "<FONT face=arial>this is <FONT SIZE=2>TWO</FONT>bla<FONT colo +r=red>red</FONT>bla bla</FONT>"; $text =~ s#<(.+?\s(.+?)=.+?)>(.+?)<\/.+?>#<$1>$3<\/$2>#g; print "$text\n";
which also partially worked. Although the problem is long past, the situation still creeps in my head. This is all old code, but the question has bothered me for a while. The only real reason I'm asking is because I want to gain some valuable experience from it. So I'm just wondering if anyone has a better solution than the two I provided above? Seeking the wisdom of the perlmonks :)

Edit: Added some <code> tags. larsen

Comment on regex and html tags
Select or Download Code
Re: regex and html tags
by graff (Chancellor) on Oct 21, 2002 at 03:07 UTC
    This opening part makes sense:

    This was what the person wanted:

    <tag#1>BLA</tag#1><tag#2>BLA</tag#1><tag#3>BLA</tag#1>
    to turn into:
    <tag#1>BLA</tag#1><tag#2>BLA</tag#2><tag#3>BLA</tag#3>

    This is a matter of taking a badly formed html stream and making it well formed. This is sensible, and easily done in cases like your initial example, where there are no nested tags involved in the bad forms. (Your first attempt simply stopped after doing the first tag in the stream, and used a while loop for no purpose). The following would work over a series of non-nested tags:

    s{(<(\w+.*?)>[^<]*?)</.+?>}{$1</$2>}g
    update: I'm using >[^<]*? instead of >.*? so that it won't corrupt streams that include properly nested tags.

    But working across nested tags would take more code and more care. You'd need to work through the stream tag by tag, pushing each open-tag name onto a stack, and popping the last name off the stack each time you hit a close-tag, to make sure the output was well formed (though it might still have other problems, depending on how bad the input was).

    But other stuff in your post makes little or no sense:

    The person wanted the finishing tag to be the first paramater in the html tag. So if i had < font size=2 > I would have to end it with < /size > and not < /font >. My first instinct was to do a while loop...

    My first instinct would be to say "No, you don't really want that. You're asking to have ill-formed html as the output. What makes you think you want that?"

    Then, looking at your last example, I think I understood the idea; you don't want well-formed html as output. You want a form where a person reading the stream can figure out more easily what the scope is for a given tag in a densely nested html structure. Is that it?

    If so, there are better ways to do this than corrupting the html tags in the odd way your friend suggested. What if the name of the first attribute is the least important information? Why have a "human-readable" form that can't be used reliably as input to a browser?

    For instance, one thing that can aid human readability of html is to simply place the tags and the text content on separate lines; something like this:

    s/>\s*</>\n</g; # normalize whitespace between adjacent tags s/([^\n])</$1\n</g; # make sure every tag begins a new line s/>([^\n])/>\n$1/g; # make sure every tag is followed by newline
    More code and more care could be used to good effect, e.g. to indent the tag lines to reflect nesting depth, to eliminate new-lines from within long open-tags, etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://206197]
Approved by VSarkiss
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2014-08-30 18:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (293 votes), past polls