Better way of finding HTML tags positions in HTML string

phoenix007 has asked for the wisdom of the Perl Monks concerning the following question:

I am using following code to get positions of start and end of html tags. Problem is HTML::Tagreader requires file as argument. But I have HTML as a sting in some variable. I dont want to create a file and delete it. Just for using this module. Can any one suggest better solution where I can use string instead of file

Note : Problem is HTML::TagReader does not allow string argument. I am only trying to get position of html tags using this module. Is there any better option?

  use HTML::TagReader;
  my $filename = 'test2.html'; # Here instead of using this file I wan
+t to do same thing using HTML as a string in some variable say $html_
+string = 'content of test2.html'
  my $p=new HTML::TagReader "$filename";
    open(my $fh, '<', $filename) or die "Could not open file '$filenam
+e' $!";
  my %line_chars;
  my $line_number = 1;
  while (my $row = <$fh>) {
        if ($line_number > 1) {
              $line_chars{$line_number} =  $line_chars{$line_number - 
+1} + length($row);
        } 
        else {
              $line_chars{$line_number} =  length($row);
        }
        $line_number++;
  }
  my @atags;
  my %atagrange;
  while(my ($tagOrText,$tagtype,$linenumber,$column)=$p->getbytoken($s
+howerr)) {
        my $position;
        my $a_start_tag_pos;
        if ($linenumber > 1) {
              $position = $line_chars{$linenumber - 1} + $column;
      }#print "\ntagOrText:" . $tagOrText . "\ntagtype : " . $tagtype 
+. "\nline number :" .  $linenumber . "\ncolumn : " . $column . "\npos
+ition : " . $position . "\n";
        if ($tagtype eq "a" or $tagtype eq '/a') {
              if ($tagtype eq "a") {
                    push(@atags, $position);
              } 
              else {
                    $a_start_tag_pos = pop(@atags);
                    $atagrange{$a_start_tag_pos} = $position;
              }
        }
  }
[download]

thanks in advance...

Comment on Better way of finding HTML tags positions in HTML string Download Code

Replies are listed 'Best First'.
Re: Better way of finding HTML tags positions in HTML string by talexb (Chancellor) on May 14, 2019 at 13:29 UTC
Looking at the source code for HTML::TagReader, it looks like the module only operates on files. That tells me that if you want to use this module, you'll need to create a (temporary) file, write your string into it, and go from there. The File::Temp module is a good choice for that. Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply]
Re^2: Better way of finding HTML tags positions in HTML string by phoenix007 (Sexton) on May 14, 2019 at 13:50 UTC
Is there any other way to get tags and there position similar to HTML::TagReader. Or any other module which operate on string	[reply]
Re^3: Better way of finding HTML tags positions in HTML string by holli (Abbot) on May 14, 2019 at 17:24 UTC
I just patched the TagReader.xs file like so: Read more... (19 kB) Basically I added this function. I didnt test it thoroughly. It seems to work though. HTML::TagReader tr_new_from_io(class, pio) SV class InputStream pio CODE: if (pio == NULL){ croak("ERROR: Help"); } / malloc and zero the struct / Newz(0, RETVAL, 1, struct trstuct ); / malloc / New(0, RETVAL->filename, 1, char ); strncpy(RETVAL->filename,newSVpv("",0),0); / put a zero at the end of the string, perl might not do it / (RETVAL->filename + 1 )=(char)0; /* malloc initial buffer / New(0, RETVAL->buffer, BUFFLEN+1, char ); RETVAL->currbuflen=BUFFLEN; RETVAL->fd=pio; RETVAL->charpos=0; RETVAL->tagcharpos=0; RETVAL->fileline=1; RETVAL->tagline=0; OUTPUT: RETVAL [download] And then you can use it as `my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>\n"; open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }` [download] Which gives you `line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i> line: 3: col: 1: </blockquote>` [download] Note, the module is buggy (or maybe to the spec i dont know), but if the html does not end with a newline the last tag gets "forgotten". `my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>"; #no newline + at the end open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }` [download] Which gives you `line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i>` [download] holli* You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l] [select]
Re^3: Better way of finding HTML tags positions in HTML string by talexb (Chancellor) on May 14, 2019 at 13:58 UTC
Maybe HTML::Bare? Have a look around CPAN, there are plenty of options. That's just the first one that looked like it might do the job. Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply]
Re^3: Better way of finding HTML tags positions in HTML string (updated) by haukex (Archbishop) on May 15, 2019 at 20:43 UTC
What's wrong with File::Temp? It's a core module, and it cleans up after itself pretty reliably: `use File::Temp qw/tempfile/; my ($tfh,$tfn) = tempfile(UNLINK=>1); print $tfh $contents; close $tfh; # File named $tfn will exist till end of program` [download] And if you want to control the filename, you can do something like `tempfile( TMPDIR=>1, TEMPLATE=>'.something_XXXXXXXXXX', SUFFIX => '.html', UNLINK=>1 )`, or if you wanted to create the file in the same directory and based on the same name as some other file (File::Basename is also a core module): `use File::Basename qw/fileparse/; use File::Temp qw/tempfile/; my ($fn,$dir,$ext) = fileparse($filename, qr/\.[^.]+$/); my ($tfh,$tfn) = tempfile(DIR=>$dir, TEMPLATE=>'.'.$fn.'_XXXXXXXXXX', SUFFIX => $ext, UNLINK=>1 ); ...` [download] I also like to use something like Corion's Text::CleanFragment on the above `$fn`, but that's not necessarily required. Update: Since I'm already dumping some File::Temp snippets, here's two more that use a temporary directory instead, allowing you to keep the original file name. File::Spec is also a core module. `tempdir` supports the same `TEMPLATE`, `DIR`, and `TMPDIR` arguments as above. Note that if you use only `TEMPLATE` with a relative name, the resulting filename will also be relative to the current working directory, which is IMO not good, so I'd strongly recommend using an additional `TMPDIR=>1` or `DIR` argument. `use File::Temp qw/tempdir/; use File::Basename qw/fileparse/; use File::Spec::Functions qw/catfile/; my $tmpdir = tempdir(CLEANUP=>1); my $tfn = catfile($tmpdir, scalar fileparse($filename)); ... # - OR - my ($fn,$dir) = fileparse($filename); my $tmpdir = tempdir(DIR=>$dir, TEMPLATE=>'.XXXXXXXXXX', CLEANUP=>1 ); my $tfn = catfile($tmpdir, $fn); ...` [download]	[reply] [d/l] [select]
Re: Better way of finding HTML tags positions in HTML string by Anonymous Monk on May 14, 2019 at 11:32 UTC
Re: How do I treat a string like a filehandle?	[reply]
Re^2: Better way of finding HTML tags positions in HTML string by phoenix007 (Sexton) on May 14, 2019 at 11:42 UTC
HTML::TagReader only accepts filename in string scalar. And open file by its own. Does not accepts file handles. So this will not work for HTML::TagReader	[reply]

Back to Seekers of Perl Wisdom