http://www.perlmonks.org?node_id=1233756


in reply to Better way of finding HTML tags positions in HTML string

Looking at the source code for HTML::TagReader, it looks like the module only operates on files. That tells me that if you want to use this module, you'll need to create a (temporary) file, write your string into it, and go from there. The File::Temp module is a good choice for that.

Alex / talexb / Toronto

Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

  • Comment on Re: Better way of finding HTML tags positions in HTML string

Replies are listed 'Best First'.
Re^2: Better way of finding HTML tags positions in HTML string
by phoenix007 (Sexton) on May 14, 2019 at 13:50 UTC

    Is there any other way to get tags and there position similar to HTML::TagReader. Or any other module which operate on string

      I just patched the TagReader.xs file like so: Basically I added this function. I didnt test it thoroughly. It seems to work though.
      HTML::TagReader tr_new_from_io(class, pio) SV *class InputStream pio CODE: if (pio == NULL){ croak("ERROR: Help"); } /* malloc and zero the struct */ Newz(0, RETVAL, 1, struct trstuct ); /* malloc */ New(0, RETVAL->filename, 1, char ); strncpy(RETVAL->filename,newSVpv("",0),0); /* put a zero at the end of the string, perl might not do it */ *(RETVAL->filename + 1 )=(char)0; /* malloc initial buffer */ New(0, RETVAL->buffer, BUFFLEN+1, char ); RETVAL->currbuflen=BUFFLEN; RETVAL->fd=pio; RETVAL->charpos=0; RETVAL->tagcharpos=0; RETVAL->fileline=1; RETVAL->tagline=0; OUTPUT: RETVAL
      And then you can use it as
      my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>\n"; open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }
      Which gives you
      line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i> line: 3: col: 1: </blockquote>
      Note, the module is buggy (or maybe to the spec i dont know), but if the html does not end with a newline the last tag gets "forgotten".
      my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>"; #no newline + at the end open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }
      Which gives you
      line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i>


      holli

      You can lead your users to water, but alas, you cannot drown them.

      Maybe HTML::Bare? Have a look around CPAN, there are plenty of options. That's just the first one that looked like it might do the job.

      Alex / talexb / Toronto

      Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      What's wrong with File::Temp? It's a core module, and it cleans up after itself pretty reliably:

      use File::Temp qw/tempfile/; my ($tfh,$tfn) = tempfile(UNLINK=>1); print $tfh $contents; close $tfh; # File named $tfn will exist till end of program

      And if you want to control the filename, you can do something like tempfile( TMPDIR=>1, TEMPLATE=>'.something_XXXXXXXXXX', SUFFIX => '.html', UNLINK=>1 ), or if you wanted to create the file in the same directory and based on the same name as some other file (File::Basename is also a core module):

      use File::Basename qw/fileparse/; use File::Temp qw/tempfile/; my ($fn,$dir,$ext) = fileparse($filename, qr/\.[^.]+$/); my ($tfh,$tfn) = tempfile(DIR=>$dir, TEMPLATE=>'.'.$fn.'_XXXXXXXXXX', SUFFIX => $ext, UNLINK=>1 ); ...

      I also like to use something like Corion's Text::CleanFragment on the above $fn, but that's not necessarily required.

      Update: Since I'm already dumping some File::Temp snippets, here's two more that use a temporary directory instead, allowing you to keep the original file name. File::Spec is also a core module. tempdir supports the same TEMPLATE, DIR, and TMPDIR arguments as above. Note that if you use only TEMPLATE with a relative name, the resulting filename will also be relative to the current working directory, which is IMO not good, so I'd strongly recommend using an additional TMPDIR=>1 or DIR argument.

      use File::Temp qw/tempdir/; use File::Basename qw/fileparse/; use File::Spec::Functions qw/catfile/; my $tmpdir = tempdir(CLEANUP=>1); my $tfn = catfile($tmpdir, scalar fileparse($filename)); ... # - OR - my ($fn,$dir) = fileparse($filename); my $tmpdir = tempdir(DIR=>$dir, TEMPLATE=>'.XXXXXXXXXX', CLEANUP=>1 ); my $tfn = catfile($tmpdir, $fn); ...