http://www.perlmonks.org?node_id=1065468

TheVend has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing XML and need a script that will read my XML file and add in the open tag for all missing open tags. Example from XML below: <promo_code/> has no matching open tag <promo_code> I need the code to scan and add missing open tags. End result would be <promo_code> <promo_code/> Example XML code below…

<orders> <order> <alt_number>2222</alt_number> <state>processed</state> <user_id>1234</user_id> <dd_id>54321</dd_id> <placed_at/> <total_price>449.0</total_price> <total_shipping_costs>90.0</total_shipping_costs> <discount>0</discount> <dd_status_code>HD</dd_status_code> <dd_status_desc>ORDER IS ON HOLD UNTIL 11/01/13</dd_status_desc> <promo_code/> <billing_address> <address_id/> <dd_customer_id/> <address_1>1212 Not Real</address_1> <address_2></address_2> <city>san mateo</city> <title></title> <company></company> <first_name>Jane</first_name> <last_name>Doe</last_name> <state_abbr/> <formated_telephone/> <zipcode/> <email>smith@smith.com</email> <country_code>USA</country_code> </billing_address> <transaction> </transaction> <shipments> <shipment_id>21286</shipment_id> <shipment> <shipping_address> <address_id>27696</address_id> <dd_customer_id/> <address_1>1212 Not Real</address_1> <address_2></address_2> <city>san jose</city> <title>Dr</title> <company></company> <first_name>Jane</first_name> <last_name>Doe</last_name> <state_abbr>CA</state_abbr> <formated_telephone>555-555-5555 Jane Smith</formated_teleph +one> <zipcode>95125-3344</zipcode> <email></email> <country_code>USA</country_code> </shipping_address> <delivery_date>2011-10-01</delivery_date> <gift>true</gift> <gift_email>smith@smith.com</gift_email> <gift_message>Happy Birthday</gift_message> <status>N</status> <tracking_nr></tracking_nr> <sent_at/> </shipment> </shipments> <line_items> <line_item> <line_item_id>16821</line_item_id> <dd_id>IDID</dd_id> <name>Keep It Coming</name> <quantity>1</quantity> <price>359.0</price> <shipment_id>21286</shipment_id> <shipping_costs>90.0</shipping_costs> <tax_amount>0</tax_amount> </line_item> </line_items> </order> <orders>

Replies are listed 'Best First'.
Re: How can I add missing XML open tags
by Crackers2 (Parson) on Dec 03, 2013 at 18:41 UTC

      You are so right. They are not technically “missing.” I use perl to parse my original XML and it needs open and closed tags (a design from years ago that needs some spit and polish). You have pointed me in the right direction. Now I can build a sub that will save me many hours of reengineering. Thank you so much!

        Let this be a warning to all those that "parse" XML by regexps and homebrewn "parsers". I wonder what's the next unexpected, but valid XML feature that'll force you to "fix" your source XMLs so that you can "parse" it.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re: How can I add missing XML open tags
by toolic (Bishop) on Dec 03, 2013 at 18:40 UTC
    XML::Twig empty_tags => 'expand' looks like it does what you want. I shortened your XML and fixed the closing "orders tag":
    use warnings; use strict; use XML::Twig; my $xml = <<XML; <orders> <order> <alt_number>2222</alt_number> <state>processed</state> <user_id>1234</user_id> <dd_id>54321</dd_id> <placed_at/> <total_price>449.0</total_price> <total_shipping_costs>90.0</total_shipping_costs> <discount>0</discount> <dd_status_code>HD</dd_status_code> <dd_status_desc>ORDER IS ON HOLD UNTIL 11/01/13</dd_status_desc> <promo_code/> </order> </orders> XML my $twig = XML::Twig->new( pretty_print => 'indented', empty_tags => 'expand', ); $twig->parse($xml); $twig->print; __END__ Output: <orders> <order> <alt_number>2222</alt_number> <state>processed</state> <user_id>1234</user_id> <dd_id>54321</dd_id> <placed_at></placed_at> <total_price>449.0</total_price> <total_shipping_costs>90.0</total_shipping_costs> <discount>0</discount> <dd_status_code>HD</dd_status_code> <dd_status_desc>ORDER IS ON HOLD UNTIL 11/01/13</dd_status_desc> <promo_code></promo_code> </order> </orders>
      Can XML::Twig be used to create a simple tab delimited CSV file?
Re: How can I add missing XML open tags
by hippo (Bishop) on Dec 03, 2013 at 18:31 UTC

    I suggest trying a canonicalizer such as XML::CanonicalizeXML. Your supplied data is not valid XML to begin with, so you would need to address that problem first, I suspect.

      The only problem with it is the closing tag <orders>, which is missing the all-important slash: </orders>. Other than this single typo, the XML is well-formed. (Whether or not the XML is "valid" is a different matter altogether.)

        The trailing <orders> was a typo it is </orders>
Re: How can I add missing XML open tags
by Jim (Curate) on Dec 03, 2013 at 22:54 UTC

    Personally, I'd just do this:  s{<([^/>]+?)/>}{<$1></$1>}g. Only if this didn't work for some peculiar reason would I bother with a more complicated solution.

    Jim

    UPDATE:  A possibly better alternative:  s{<(\w+)/>}{<$1></$1>}ag. (Notice the /a character set modifier, which is purposeful.)

      That was surprisingly very effective. Thanks!

        You're welcome. But you really shouldn't find it "surprisingly very effective." Perl excels at text transformation. XML is text. Changing all occurrences of <foo/> to <foo></foo> in an XML document is a trivial transformation (unless it isn't due to some rare and unlikely edge case).

        Jim

Re: How can I add missing XML open tags
by Laurent_R (Canon) on Dec 03, 2013 at 18:44 UTC

    I don't know of there is any module to do that, but if you want to do it manually, you could have a hash in which the keys will be the tag name (probably stripped of the < and >). Parse your file, for each opening tag, increment the value of the hash for that tag. For each closing tag, strip the </ and >, see if the hash exists for that tag and if the value is 0: if the hash does not exist for that tag or the value is zero, add an opening tag; if the value exists and if the value is larger than 0, decrement the hash value for that tag.

    EDIT: well it seems there is a module, I had not seen the previous answers when I started to post mine. Forget the above and use the module.