Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Parsing XML/HTML

by sartzava (Initiate)
on Apr 08, 2005 at 15:25 UTC ( #446041=perlquestion: print w/replies, xml ) Need Help??
sartzava has asked for the wisdom of the Perl Monks concerning the following question:

Okay, I'm new to Perl, so this is probably a simple question...

I am attempting to edit an XML/XHTML document that was generated by a Quark extraction utility. The paragraphs in the document use nested span tags to apply formatting, and I am attempting to fix the issues associated with that.

For instance:

<p><span class="type1"><span class="type2">text </span>italictext<span + class="type2"> text</span></span></p>

In that example, the span class="type1" is applying an italic style to the entire paragraph. Then the type2 is applying a non-italic style to large sections of the paragraph and leaving individual words to be italicized.

Now, instances like this are easy to catch with a regex, but they can also be more involved:

<p><span class="type1"><span class="type2">text </span></span><span cl +ass="SmallCaps">text</span><span class="type1"><span class="type2"> t +ext </span>italictext<span class="type2"> text </span>italictext<span + class="type2"> text</span>italictext<span class="type2"> text </span +>italictext<span class="type2"> text </span>italictext<span class="ty +pe2"> text </span>italictext<span class="type2"> text.</span></span>< +/p>

Notice that the "SmallCaps" span is added in the middle of the paragraph and that there are multiple instances of the type2 tags.

Of course, I also have to deal with the possiblilty of the type2 tags being used to apply an italics style, like in this example I found:

<p>text <span class="type2">italictext</span>text<span class="type2">i +talictext</span>text</p>

What I would like to do is be able to match the opening and closing tags to each other and make adjustments as necessary to remove the extraneous mark-up. For instance, I want the first instace above to look like this:

<p>text <i>italictext</i> text</p>

I need to know if anyone can help me or direct me to an extremely simple example of/tutorial on the XML::Parser or HTML::Parser module, since I am sure that one of those does what I need to do. Again, I am very new at this, so any help will be greatly appreciated.

Replies are listed 'Best First'.
Re: Parsing XML/HTML
by satchm0h (Beadle) on Apr 08, 2005 at 16:06 UTC
    What you no not want to do is parse XML/XHTML/HTML yourself. There are a number of great perl modules available to you. Take a look at HTML::Parser or XML::Twig as potential starting places. Getting your document into a meaningful data structure will vastly simplify the process of dealing with nested tags.

    Good Luck!

Re: Parsing XML/HTML
by rg0now (Chaplain) on Apr 08, 2005 at 15:41 UTC
    I recall once I had to do similar things to a messy XHTML document, and XML::XSH came in pretty handy. It is an XML editing shell written in Perl with a number of utility shell commands to ease the work with structured data like XML, HTML, etc. It can do node insertions and deletions, filtering and manipulation of attributes (which might fit your needs here) and similar stuff with a very easy syntax.

    Here is a nice tutorial on how to use XML::XSH from our beloved merlyn...


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://446041]
Approved by pelagic
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2017-03-28 21:09 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (341 votes). Check out past polls.