Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Parsing and converting HTML

by tevolo (Novice)
on Jul 26, 2012 at 18:40 UTC ( [id://983906]=perlquestion: print w/replies, xml ) Need Help??

tevolo has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks. I am looking for suggestions/guidance on a problem I am working on.

I have multiple html files that are all slightly different and may contain different tags in each. I want to parse them and convert the tags to a string based on certain rules. For instance:

 <p>  some text </p>

Would get converted to text="some text" At first I was doing this will loops and arrays and regex but the code quickly started growing out of control. Then I tried HTML::TREEBUILDER but it didn't seem to work with the tags the way I needed.

Any advice would be greatly appreciated

Thanks in advannce

Replies are listed 'Best First'.
Re: Parsing and converting HTML
by aitap (Curate) on Jul 26, 2012 at 18:50 UTC

    Try writing a recursive function using content_list method of HTML::Element.

    For example,

    my $html = HTML::TreeBuilder->new_from_content("$text") || die "$@\n"; sub to_text { if (ref $_[0] eq "HTML::Element") { foreach my $sub_element ($_[0]->content_list) { &to_text($sub_element); } } else { print qq{text="$_[0]"}; } } &to_text($html);

    Sorry if my advice was wrong.

      Hello, thanks but for some reason this did not seem to work. Though it is probably something I am doing wrong.

      here is my code

      #!c:/strawberry/perl/bin/perl.exe use HTML::TokeParser; use HTML::Element; use HTML::TreeBuilder; use warnings; open(MYINPUTFILE, '<C:\acs\SA\content\acs\meetings\expositions\CNBP_ +028491'); while(<MYINPUTFILE>) { my $text = $_; my $html = HTML::TreeBuilder->new_from_content("$text") || die "$@\n +"; sub to_text { if (ref $_[0] eq "HTML::Element") { foreach my $sub_element ($_[0]->content_list) { &to_text($sub_element); } } else { print qq{text="$_[0]"}; } } &to_text($html); }

      any other thoughts or did I miss something? Thanks again

        You are trying to parse your file by line. Every line an HTML::Element object gets created and then destroyed. You can use new_from_file HTML::TreeBuilder method instead.
        Sorry if my advice was wrong.
        my $tree = HTML::TreeBuilder->new;
        $tree->parse_file( $filename );
        ...

      Thanks!!! I will give it a try and report back.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://983906]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-23 10:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found