Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

html2pyx

by OeufMayo (Curate)
on Aug 31, 2001 at 03:13 UTC ( #109244=sourcecode: print w/ replies, xml ) Need Help??

Category: HTML Utility
Author/Contact Info Briac Pilpré - briac@cpan.org
Description:

Pyxie is an alternative way of representing XML datas. These datas are represented in a really simple way, one information per line.
The nice thing about PYX is the ease of parsing the informations you get, on the other hand, there are a lot of features found in the XML format that can't be representated by PYX (CDATA, entities,...)

Now, I know the module XML::PYX exists, and it even comes with a script called pyxhtml, which does pretty much what this code does.
But XML::PYX per se isn't really flexible if you want a finer control over what's being kept or not in the HTML file.

Hopefully, this code can be easily customized to suit your needs, provided you know how to use HTML::Parser (which is really fun to use, especially the v.3).

And the really cool thing is that your HTML doesn't have to be a valid XML file! (I wouldn't try to feed it Word 2000 pseudo-HTML though...)

More infos on PYX

#!/usr/bin/perl -w
use strict;
use HTML::Parser ();

# See PYX format description
# http://www.xml.com/pub/a/2000/03/15/feature/index.html

my $parser = HTML::Parser->new(
        xml_mode        => 1,
        unbroken_text   => 1,
        ignore_elements => ['style', 'script'], # CDATA isn't supporte
+d
        start_h => [
                sub {
                        my ($tag, $attr) = @_;
                        print "($tag\n";
                        print "A$_\n-$attr->{$_}\n" foreach keys %{$at
+tr};
                }, "tagname, attr"],
        end_h   => [
                sub {
                        print ")" . shift() . "\n";
                }, "tagname"],
        text_h  => [
                sub {
                        my $text = shift;
                        $text =~ s/^\s*|\s*$//g;
                        print "-$text\n"
                }, "dtext"],
);

die "usage: $0 file1.html > file1.pyx\n" unless @ARGV;

foreach (@ARGV){
        $parser->parse_file($_);
        $parser->eof();
}

Comment on html2pyx
Download Code

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://109244]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (16)
As of 2015-07-28 09:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (254 votes), past polls