http://www.perlmonks.org?node_id=75543

Guildenstern has asked for the wisdom of the Perl Monks concerning the following question:

Well, progress marches on for my conversion of HTML to a proprietary ML. (See this node for more info.)

As it turns out, the source HTML needs quite a bit of doctoring before I can run it through the conversion process with a high certainty of success. One of the cleaning tasks I must perform is some link management. The markup language I am converting to only allows linking within a page, so external links must be edited. This part I can handle through an XSL transform. The real problem lies within the intra-page links.

The HTML defines a rather large number of <a name="foo"> anchors, with <a href="#foo"> links. What happens, however, is that there is also a large number of anchors defined that are not linked to, and links defined to anchors that have not been declared. What I do, then, is to parse the HTML and generate two hashes, one for anchors declared, and one for links that target anchors within the document. From there it's a simple task to see where the two hashes meet and count those links and anchors as valid.

For unlinked anchors, it makes sense just to remove the declaration, since there's nothing that will link to it due to the constriction of not being able to link to other documents. Removing invalid links is a bit tougher, but I worked up this simple regex to handle it:
foreach (keys %links) { if ($intext =~ s#$links{$_}([^</]+)</a>#$1#ig) { print "Link removed: $_\n"; } else { print "Problem removing link: $_\n"; } }

Basically, I wanted to be able to preserve the text within the link while removing the tags. %links is a hash that has the link target as the keys and the full tag as the values. e.g. foo => <a href="#foo">. This makes it easy to compare to the defined anchors to determine valid links.


The problem (finally!).

There are two entries in %links that are acting a bit strangeley:
use() => <a href="#use()"> use => <a href="#use">

The problem arises when the above regex is applied to these two entries. The use() entry replaces all instances of use, and the use entry fails to make any replacements. The resulting output is left with all occurrences of <a href="#use()"> instead of replacing them.

What I can't understand is why /<a href="#use()">/ is matching <a href="#use">. Is there something happening due to the parens? Am I just smoking crack? I'm really stuck at this point, and while I could manually fix the missed replacements, it kind of defeat the whole notion of an automated process.

Guildenstern
Negaterd character class uber alles!