Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Better way of finding HTML tags positions in HTML string

by phoenix007 (Sexton)
on May 14, 2019 at 13:50 UTC ( #1233757=note: print w/replies, xml ) Need Help??


in reply to Re: Better way of finding HTML tags positions in HTML string
in thread Better way of finding HTML tags positions in HTML string

Is there any other way to get tags and there position similar to HTML::TagReader. Or any other module which operate on string

  • Comment on Re^2: Better way of finding HTML tags positions in HTML string

Replies are listed 'Best First'.
Re^3: Better way of finding HTML tags positions in HTML string
by holli (Monsignor) on May 14, 2019 at 17:24 UTC
    I just patched the TagReader.xs file like so: Basically I added this function. I didnt test it thoroughly. It seems to work though.
    HTML::TagReader tr_new_from_io(class, pio) SV *class InputStream pio CODE: if (pio == NULL){ croak("ERROR: Help"); } /* malloc and zero the struct */ Newz(0, RETVAL, 1, struct trstuct ); /* malloc */ New(0, RETVAL->filename, 1, char ); strncpy(RETVAL->filename,newSVpv("",0),0); /* put a zero at the end of the string, perl might not do it */ *(RETVAL->filename + 1 )=(char)0; /* malloc initial buffer */ New(0, RETVAL->buffer, BUFFLEN+1, char ); RETVAL->currbuflen=BUFFLEN; RETVAL->fd=pio; RETVAL->charpos=0; RETVAL->tagcharpos=0; RETVAL->fileline=1; RETVAL->tagline=0; OUTPUT: RETVAL
    And then you can use it as
    my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>\n"; open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }
    Which gives you
    line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i> line: 3: col: 1: </blockquote>
    Note, the module is buggy (or maybe to the spec i dont know), but if the html does not end with a newline the last tag gets "forgotten".
    my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>"; #no newline + at the end open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }
    Which gives you
    line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i>


    holli

    You can lead your users to water, but alas, you cannot drown them.
Re^3: Better way of finding HTML tags positions in HTML string
by talexb (Canon) on May 14, 2019 at 13:58 UTC

    Maybe HTML::Bare? Have a look around CPAN, there are plenty of options. That's just the first one that looked like it might do the job.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Re^3: Better way of finding HTML tags positions in HTML string (updated)
by haukex (Chancellor) on May 15, 2019 at 20:43 UTC

    What's wrong with File::Temp? It's a core module, and it cleans up after itself pretty reliably:

    use File::Temp qw/tempfile/; my ($tfh,$tfn) = tempfile(UNLINK=>1); print $tfh $contents; close $tfh; # File named $tfn will exist till end of program

    And if you want to control the filename, you can do something like tempfile( TMPDIR=>1, TEMPLATE=>'.something_XXXXXXXXXX', SUFFIX => '.html', UNLINK=>1 ), or if you wanted to create the file in the same directory and based on the same name as some other file (File::Basename is also a core module):

    use File::Basename qw/fileparse/; use File::Temp qw/tempfile/; my ($fn,$dir,$ext) = fileparse($filename, qr/\.[^.]+$/); my ($tfh,$tfn) = tempfile(DIR=>$dir, TEMPLATE=>'.'.$fn.'_XXXXXXXXXX', SUFFIX => $ext, UNLINK=>1 ); ...

    I also like to use something like Corion's Text::CleanFragment on the above $fn, but that's not necessarily required.

    Update: Since I'm already dumping some File::Temp snippets, here's two more that use a temporary directory instead, allowing you to keep the original file name. File::Spec is also a core module. tempdir supports the same TEMPLATE, DIR, and TMPDIR arguments as above. Note that if you use only TEMPLATE with a relative name, the resulting filename will also be relative to the current working directory, which is IMO not good, so I'd strongly recommend using an additional TMPDIR=>1 or DIR argument.

    use File::Temp qw/tempdir/; use File::Basename qw/fileparse/; use File::Spec::Functions qw/catfile/; my $tmpdir = tempdir(CLEANUP=>1); my $tfn = catfile($tmpdir, scalar fileparse($filename)); ... # - OR - my ($fn,$dir) = fileparse($filename); my $tmpdir = tempdir(DIR=>$dir, TEMPLATE=>'.XXXXXXXXXX', CLEANUP=>1 ); my $tfn = catfile($tmpdir, $fn); ...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1233757]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2019-08-21 15:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?