Re^2: Better way of finding HTML tags positions in HTML string

Replies are listed 'Best First'.
Re^3: Better way of finding HTML tags positions in HTML string by holli (Abbot) on May 14, 2019 at 17:24 UTC
I just patched the TagReader.xs file like so: Read more... (19 kB) Basically I added this function. I didnt test it thoroughly. It seems to work though. HTML::TagReader tr_new_from_io(class, pio) SV class InputStream pio CODE: if (pio == NULL){ croak("ERROR: Help"); } / malloc and zero the struct / Newz(0, RETVAL, 1, struct trstuct ); / malloc / New(0, RETVAL->filename, 1, char ); strncpy(RETVAL->filename,newSVpv("",0),0); / put a zero at the end of the string, perl might not do it / (RETVAL->filename + 1 )=(char)0; /* malloc initial buffer / New(0, RETVAL->buffer, BUFFLEN+1, char ); RETVAL->currbuflen=BUFFLEN; RETVAL->fd=pio; RETVAL->charpos=0; RETVAL->tagcharpos=0; RETVAL->fileline=1; RETVAL->tagline=0; OUTPUT: RETVAL [download] And then you can use it as `my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>\n"; open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }` [download] Which gives you `line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i> line: 3: col: 1: </blockquote>` [download] Note, the module is buggy (or maybe to the spec i dont know), but if the html does not end with a newline the last tag gets "forgotten". `my $str = "<blockquote>\n<i>Perlmonks</i>\n</blockquote>"; #no newline + at the end open my $io, "<", \$str; my $p = HTML::TagReader->new_from_io($io); my @tag; while(@tag = $p->gettag(1)){ print "line: $tag[1]: col: $tag[2]: $tag[0]\n"; }` [download] Which gives you `line: 1: col: 2: <blockquote> line: 2: col: 1: <i> line: 2: col: 13: </i>` [download] holli* You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l] [select]
Re^3: Better way of finding HTML tags positions in HTML string by talexb (Chancellor) on May 14, 2019 at 13:58 UTC
Maybe HTML::Bare? Have a look around CPAN, there are plenty of options. That's just the first one that looked like it might do the job. Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply]
Re^3: Better way of finding HTML tags positions in HTML string (updated) by haukex (Archbishop) on May 15, 2019 at 20:43 UTC
What's wrong with File::Temp? It's a core module, and it cleans up after itself pretty reliably: `use File::Temp qw/tempfile/; my ($tfh,$tfn) = tempfile(UNLINK=>1); print $tfh $contents; close $tfh; # File named $tfn will exist till end of program` [download] And if you want to control the filename, you can do something like `tempfile( TMPDIR=>1, TEMPLATE=>'.something_XXXXXXXXXX', SUFFIX => '.html', UNLINK=>1 )`, or if you wanted to create the file in the same directory and based on the same name as some other file (File::Basename is also a core module): `use File::Basename qw/fileparse/; use File::Temp qw/tempfile/; my ($fn,$dir,$ext) = fileparse($filename, qr/\.[^.]+$/); my ($tfh,$tfn) = tempfile(DIR=>$dir, TEMPLATE=>'.'.$fn.'_XXXXXXXXXX', SUFFIX => $ext, UNLINK=>1 ); ...` [download] I also like to use something like Corion's Text::CleanFragment on the above `$fn`, but that's not necessarily required. Update: Since I'm already dumping some File::Temp snippets, here's two more that use a temporary directory instead, allowing you to keep the original file name. File::Spec is also a core module. `tempdir` supports the same `TEMPLATE`, `DIR`, and `TMPDIR` arguments as above. Note that if you use only `TEMPLATE` with a relative name, the resulting filename will also be relative to the current working directory, which is IMO not good, so I'd strongly recommend using an additional `TMPDIR=>1` or `DIR` argument. `use File::Temp qw/tempdir/; use File::Basename qw/fileparse/; use File::Spec::Functions qw/catfile/; my $tmpdir = tempdir(CLEANUP=>1); my $tfn = catfile($tmpdir, scalar fileparse($filename)); ... # - OR - my ($fn,$dir) = fileparse($filename); my $tmpdir = tempdir(DIR=>$dir, TEMPLATE=>'.XXXXXXXXXX', CLEANUP=>1 ); my $tfn = catfile($tmpdir, $fn); ...` [download]	[reply] [d/l] [select]


P is for Practical
	PerlMonks