Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Better way of finding HTML tags positions in HTML string

by phoenix007 (Sexton)
on May 14, 2019 at 11:23 UTC ( #1233750=perlquestion: print w/replies, xml ) Need Help??

phoenix007 has asked for the wisdom of the Perl Monks concerning the following question:

I am using following code to get positions of start and end of html tags. Problem is HTML::Tagreader requires file as argument. But I have HTML as a sting in some variable. I dont want to create a file and delete it. Just for using this module. Can any one suggest better solution where I can use string instead of file

Note : Problem is HTML::TagReader does not allow string argument. I am only trying to get position of html tags using this module. Is there any better option?

use HTML::TagReader; my $filename = 'test2.html'; # Here instead of using this file I wan +t to do same thing using HTML as a string in some variable say $html_ +string = 'content of test2.html' my $p=new HTML::TagReader "$filename"; open(my $fh, '<', $filename) or die "Could not open file '$filenam +e' $!"; my %line_chars; my $line_number = 1; while (my $row = <$fh>) { if ($line_number > 1) { $line_chars{$line_number} = $line_chars{$line_number - +1} + length($row); } else { $line_chars{$line_number} = length($row); } $line_number++; } my @atags; my %atagrange; while(my ($tagOrText,$tagtype,$linenumber,$column)=$p->getbytoken($s +howerr)) { my $position; my $a_start_tag_pos; if ($linenumber > 1) { $position = $line_chars{$linenumber - 1} + $column; }#print "\ntagOrText:" . $tagOrText . "\ntagtype : " . $tagtype +. "\nline number :" . $linenumber . "\ncolumn : " . $column . "\npos +ition : " . $position . "\n"; if ($tagtype eq "a" or $tagtype eq '/a') { if ($tagtype eq "a") { push(@atags, $position); } else { $a_start_tag_pos = pop(@atags); $atagrange{$a_start_tag_pos} = $position; } } }

thanks in advance...

Replies are listed 'Best First'.
Re: Better way of finding HTML tags positions in HTML string
by talexb (Canon) on May 14, 2019 at 13:29 UTC

    Looking at the source code for HTML::TagReader, it looks like the module only operates on files. That tells me that if you want to use this module, you'll need to create a (temporary) file, write your string into it, and go from there. The File::Temp module is a good choice for that.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Is there any other way to get tags and there position similar to HTML::TagReader. Or any other module which operate on string

        I just patched the TagReader.xs file like so: Basically I added this function. I didnt test it thoroughly. It seems to work though.
        HTML::TagReader tr_new_from_io(class, pio) SV *class InputStream pio CODE: if (pio == NULL){ croak("ERROR: Help"); } /* malloc and zero the struct */ Newz(0, RETVAL, 1, struct trstuct ); /* malloc */ New(0, RETVAL->filename, 1, char ); strncpy(RETVAL->filename,newSVpv("",0),0); /* put a zero at the end of the string, perl might not do it */ *(RETVAL->filename + 1 )=(char)0; /* malloc initial buffer */ New(0, RETVAL->buffer, BUFFLEN+1, char ); RETVAL->currbuflen=BUFFLEN; RETVAL->fd=pio; RETVAL->charpos=0; RETVAL->tagcharpos=0; RETVAL->fileline=1; RETVAL->tagline=0; OUTPUT: RETVAL
        And then you can use it as
        my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>\n"; open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }
        Which gives you
        line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i> line: 3: col: 1: </blockquote>
        Note, the module is buggy (or maybe to the spec i dont know), but if the html does not end with a newline the last tag gets "forgotten".
        my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>"; #no newline + at the end open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }
        Which gives you
        line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i>


        holli

        You can lead your users to water, but alas, you cannot drown them.

        Maybe HTML::Bare? Have a look around CPAN, there are plenty of options. That's just the first one that looked like it might do the job.

        Alex / talexb / Toronto

        Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

        What's wrong with File::Temp? It's a core module, and it cleans up after itself pretty reliably:

        use File::Temp qw/tempfile/; my ($tfh,$tfn) = tempfile(UNLINK=>1); print $tfh $contents; close $tfh; # File named $tfn will exist till end of program

        And if you want to control the filename, you can do something like tempfile( TMPDIR=>1, TEMPLATE=>'.something_XXXXXXXXXX', SUFFIX => '.html', UNLINK=>1 ), or if you wanted to create the file in the same directory and based on the same name as some other file (File::Basename is also a core module):

        use File::Basename qw/fileparse/; use File::Temp qw/tempfile/; my ($fn,$dir,$ext) = fileparse($filename, qr/\.[^.]+$/); my ($tfh,$tfn) = tempfile(DIR=>$dir, TEMPLATE=>'.'.$fn.'_XXXXXXXXXX', SUFFIX => $ext, UNLINK=>1 ); ...

        I also like to use something like Corion's Text::CleanFragment on the above $fn, but that's not necessarily required.

        Update: Since I'm already dumping some File::Temp snippets, here's two more that use a temporary directory instead, allowing you to keep the original file name. File::Spec is also a core module. tempdir supports the same TEMPLATE, DIR, and TMPDIR arguments as above. Note that if you use only TEMPLATE with a relative name, the resulting filename will also be relative to the current working directory, which is IMO not good, so I'd strongly recommend using an additional TMPDIR=>1 or DIR argument.

        use File::Temp qw/tempdir/; use File::Basename qw/fileparse/; use File::Spec::Functions qw/catfile/; my $tmpdir = tempdir(CLEANUP=>1); my $tfn = catfile($tmpdir, scalar fileparse($filename)); ... # - OR - my ($fn,$dir) = fileparse($filename); my $tmpdir = tempdir(DIR=>$dir, TEMPLATE=>'.XXXXXXXXXX', CLEANUP=>1 ); my $tfn = catfile($tmpdir, $fn); ...
Re: Better way of finding HTML tags positions in HTML string
by Anonymous Monk on May 14, 2019 at 11:32 UTC

      HTML::TagReader only accepts filename in string scalar. And open file by its own. Does not accepts file handles. So this will not work for HTML::TagReader

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1233750]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2019-10-21 17:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?