Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Better way of finding HTML tags positions in HTML string

by phoenix007 (Acolyte)
on May 14, 2019 at 11:23 UTC ( #1233750=perlquestion: print w/replies, xml ) Need Help??

phoenix007 has asked for the wisdom of the Perl Monks concerning the following question:

I am using following code to get positions of start and end of html tags. Problem is HTML::Tagreader requires file as argument. But I have HTML as a sting in some variable. I dont want to create a file and delete it. Just for using this module. Can any one suggest better solution where I can use string instead of file

Note : Problem is HTML::TagReader does not allow string argument. I am only trying to get position of html tags using this module. Is there any better option?

use HTML::TagReader; my $filename = 'test2.html'; # Here instead of using this file I wan +t to do same thing using HTML as a string in some variable say $html_ +string = 'content of test2.html' my $p=new HTML::TagReader "$filename"; open(my $fh, '<', $filename) or die "Could not open file '$filenam +e' $!"; my %line_chars; my $line_number = 1; while (my $row = <$fh>) { if ($line_number > 1) { $line_chars{$line_number} = $line_chars{$line_number - +1} + length($row); } else { $line_chars{$line_number} = length($row); } $line_number++; } my @atags; my %atagrange; while(my ($tagOrText,$tagtype,$linenumber,$column)=$p->getbytoken($s +howerr)) { my $position; my $a_start_tag_pos; if ($linenumber > 1) { $position = $line_chars{$linenumber - 1} + $column; }#print "\ntagOrText:" . $tagOrText . "\ntagtype : " . $tagtype +. "\nline number :" . $linenumber . "\ncolumn : " . $column . "\npos +ition : " . $position . "\n"; if ($tagtype eq "a" or $tagtype eq '/a') { if ($tagtype eq "a") { push(@atags, $position); } else { $a_start_tag_pos = pop(@atags); $atagrange{$a_start_tag_pos} = $position; } } }

thanks in advance...

Replies are listed 'Best First'.
Re: Better way of finding HTML tags positions in HTML string
by talexb (Canon) on May 14, 2019 at 13:29 UTC

    Looking at the source code for HTML::TagReader, it looks like the module only operates on files. That tells me that if you want to use this module, you'll need to create a (temporary) file, write your string into it, and go from there. The File::Temp module is a good choice for that.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Is there any other way to get tags and there position similar to HTML::TagReader. Or any other module which operate on string

        I just patched the TagReader.xs file like so: Basically I added this function. I didnt test it thoroughly. It seems to work though.
        HTML::TagReader tr_new_from_io(class, pio) SV *class InputStream pio CODE: if (pio == NULL){ croak("ERROR: Help"); } /* malloc and zero the struct */ Newz(0, RETVAL, 1, struct trstuct ); /* malloc */ New(0, RETVAL->filename, 1, char ); strncpy(RETVAL->filename,newSVpv("",0),0); /* put a zero at the end of the string, perl might not do it */ *(RETVAL->filename + 1 )=(char)0; /* malloc initial buffer */ New(0, RETVAL->buffer, BUFFLEN+1, char ); RETVAL->currbuflen=BUFFLEN; RETVAL->fd=pio; RETVAL->charpos=0; RETVAL->tagcharpos=0; RETVAL->fileline=1; RETVAL->tagline=0; OUTPUT: RETVAL
        And then you can use it as
        my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>\n"; open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }
        Which gives you
        line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i> line: 3: col: 1: </blockquote>
        Note, the module is buggy (or maybe to the spec i dont know), but if the html does not end with a newline the last tag gets "forgotten".
        my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>"; #no newline + at the end open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }
        Which gives you
        line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i>


        holli

        You can lead your users to water, but alas, you cannot drown them.

        Maybe HTML::Bare? Have a look around CPAN, there are plenty of options. That's just the first one that looked like it might do the job.

        Alex / talexb / Toronto

        Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

        What's wrong with File::Temp? It's a core module, and it cleans up after itself pretty reliably:

        use File::Temp qw/tempfile/; my ($tfh,$tfn) = tempfile(UNLINK=>1); print $tfh $contents; close $tfh; # File named $tfn will exist till end of program

        And if you want to control the filename, you can do something like tempfile( TMPDIR=>1, TEMPLATE=>'.something_XXXXXXXXXX', SUFFIX => '.html', UNLINK=>1 ), or if you wanted to create the file in the same directory and based on the same name as some other file (File::Basename is also a core module):

        use File::Basename qw/fileparse/; use File::Temp qw/tempfile/; my ($fn,$dir,$ext) = fileparse($filename, qr/\.[^.]+$/); my ($tfh,$tfn) = tempfile(DIR=>$dir, TEMPLATE=>'.'.$fn.'_XXXXXXXXXX', SUFFIX => $ext, UNLINK=>1 ); ...

        I also like to use something like Corion's Text::CleanFragment on the above $fn, but that's not necessarily required.

        Update: Since I'm already dumping some File::Temp snippets, here's two more that use a temporary directory instead, allowing you to keep the original file name. File::Spec is also a core module. tempdir supports the same TEMPLATE, DIR, and TMPDIR arguments as above. Note that if you use only TEMPLATE with a relative name, the resulting filename will also be relative to the current working directory, which is IMO not good, so I'd strongly recommend using an additional TMPDIR=>1 or DIR argument.

        use File::Temp qw/tempdir/; use File::Basename qw/fileparse/; use File::Spec::Functions qw/catfile/; my $tmpdir = tempdir(CLEANUP=>1); my $tfn = catfile($tmpdir, scalar fileparse($filename)); ... # - OR - my ($fn,$dir) = fileparse($filename); my $tmpdir = tempdir(DIR=>$dir, TEMPLATE=>'.XXXXXXXXXX', CLEANUP=>1 ); my $tfn = catfile($tmpdir, $fn); ...
Re: Better way of finding HTML tags positions in HTML string
by Anonymous Monk on May 14, 2019 at 11:32 UTC

      HTML::TagReader only accepts filename in string scalar. And open file by its own. Does not accepts file handles. So this will not work for HTML::TagReader

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1233750]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2019-05-25 04:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you enjoy 3D movies?



    Results (151 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!