PerlPilgrim has asked for the wisdom of the Perl Monks concerning the following question:
I have spent several days perusing all of the good ideas on this site about how to parse and manipulate tables, but alas, no one has specifically asked about this particular situation, which I will describe.
Currently, I am using (with permission) vendor web sites which have product data in tabular form. My goal is to integrate their pages into our e-commerce system. I grab the pages with HTTP::Request, modify the pages, and then re-serve them as if they are our own. (This is a simplification - some pages are static, where we use wget as a crontab and store them locally to be polite.) The tables are consistent, and I need to extract part number, descrption and application, and then insert a form to each row, which contains a button to add the item to the shopping cart.
Thus far, my approach, which works nicely (today), but is not the proper approach from what I have read, is to parse the pages using regexp, split and join. My ultimate goal is to use one of the modules to accomplish this in a cleaner, more robust fashion. It looks like HTML::ElementTable is the way to go, but most examples I have seen build the table from scratch. Reading the CPAN docs shows that this module will operate on HTML::Element objects, but the only way I know of to build them from an HTML string is with HTML::TreeBuilder, which appears to be very CPU-hungry.
Is there a better way to create the HML::Element objects from an HTML string? Also, once I do the necessary manipulation, will the as_HTML subroutine recreate the original document satisfactorily? Is this even the direction I want to go with this?
Many thanks in advance for any wisdom that may be shared.
With kind regards,