Dirk80 has asked for the wisdom of the Perl Monks concerning the following question:
I have a string which can contain several tags and I have an input list of tag names. Content which is not between tags shall always be kept. Content within a tag shall only be kept if the tag is active, i.e. is in the given tag list. Attention: tags can be nested.
my $str = 'word1 <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0> w
+ord6 <tag2>word7 word8</tag2> word9 <tag3>word10</tag3> word11';
## Examples:
# @tags = ('tag0', 'tag1'): 'word1 word2 word3 word4 word5 word6 word9
+ word11'
# @tags = ('tag3'): 'word1 word6 word9 word10 word11'
# @tags = ('tag1', 'tag2', 'tag3'): 'word1 word6 word7 word8 word9 wor
+d10 word11'
Dependent on the given tags you can see the desired results in the commented examples above.
Now I ask you because I'm confused how to solve it. It is neither XML nor is it HTML. It's a string with tags. Would you recommend regular expressions or any module from CPAN? It would be very kind of you if you could give me some advice. Thank you very much!
And it would be cool if the string would be denied if the tagging is invalid, e.g. if a start tag has no end tag or something like that. But this would be bonus.
Re: Parsing string with tags
by haukex (Archbishop) on May 31, 2022 at 11:47 UTC
|
You said "It is neither XML nor is it HTML", but at least the sample you showed is a valid XML document if parsed as a fragment, so that's what I did below. If your data is ever going to not be valid XML, you'd have to show an example of that. Mojo::DOM may still be able to parse it in that case while XML::LibXML is generally not very forgiving.
use warnings;
use strict;
# ##### Version with Mojo::DOM #####
# nicer API but less strict on parsing (useful for HTML)
use Mojo::DOM;
sub strip_tags_mojo {
my ($str, $activetags) = @_;
my %active = map {($_=>1)} @$activetags;
my $dom = Mojo::DOM->new->xml(1)->parse($str);
$dom->find('*')->grep(sub{ not $active{$_->tag} })->map('remove');
my $text = $dom->all_text;
$text =~ s/\s+/ /g;
return $text;
}
# ##### Version with XML::LibXML #####
# generally a more verbose API but strict XML-compliant parsing
# (in the following I've tried to shorten the code quite a bit)
use XML::LibXML;
sub strip_tags_xml {
my ($str, $activetags) = @_;
my %active = map {($_=>1)} @$activetags;
my $dom = XML::LibXML->new->parse_balanced_chunk($str);
$active{$_->nodeName} or $_->parentNode->removeChild($_)
for $dom->findnodes('*');
my $text = $dom->textContent;
$text =~ s/\s+/ /g;
return $text;
}
# ##### Tests #####
use Test::More;
sub exception (&) { eval { shift->(); 1 } ? undef : ($@ || die) } # po
+or man's Test::Fatal
my $input = 'word1 <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0>
+ word6 <tag2>word7 word8</tag2> word9 <tag3>word10</tag3> word11';
my @tests = (
{ in=>['tag0', 'tag1'], exp=>'word1 word2 word3 word4 word5 word6
+word9 word11' },
{ in=>['tag3'], exp=>'word1 word6 word9 word10 word11' },
{ in=>['tag1', 'tag2', 'tag3'], exp=>'word1 word6 word7 word8 word
+9 word10 word11' },
);
plan tests => (@tests+1)*2;
for my $t (@tests) {
is strip_tags_mojo($input, $t->{in}), $t->{exp};
is strip_tags_xml($input, $t->{in}), $t->{exp};
}
ok not exception { strip_tags_mojo('foo <tag0> bar </tag1>') };
ok exception { strip_tags_xml('foo <tag0> bar') };
Update 2: Removed "Update" tags from the first paragraph for readability. | [reply] [d/l] |
|
| [reply] |
|
Neither nor don’t know what the OP was lurking for. Or was it xor? I abandon all hope. Holy shit.
«The Crux of the Biscuit is the Apostrophe»
| [reply] |
Re: Parsing string with tags
by Discipulus (Canon) on May 31, 2022 at 11:02 UTC
|
use strict;
use warnings;
use Text::Balanced qw(extract_tagged);
my $str = '<tag0> word2 <tag1>word3 word4</tag1> word5 </tag0> word6 <
+tag2>word7 word8</tag2> word9 <tag3>word10</tag3>';
# this leads to errors
# my $str = 'BEFORE <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0
+> word6 <tag2>word7 word8</tag2> word9 <tag3>word10</tag3> AFTER';
my @res = extract_tagged( $str , '<tag0>', '</tag0>');
foreach (qw(extracted remainder prefix opening included closing)){
my $res = shift @res // 'NA';
print "$_:\t$res\n";
}
__END__
extracted: <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0>
remainder: word6 <tag2>word7 word8</tag2> word9 <tag3>word10</ta
+g3>
prefix:
opening: <tag0>
included: word2 <tag1>word3 word4</tag1> word5
closing: </tag0>
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] |
Re: Parsing string with tags
by tangent (Parson) on May 31, 2022 at 15:00 UTC
|
This is fairly verbose but I like to use HTML::Parser for parsing non-standard tagged text as it is very flexible and can be fine-tuned. Also good as a learning exercise.
use HTML::Parser;
use Data::Dumper;
my $tag_string = 'word1 <tag0> word2 <tag1>word3 word4</tag1> word5 </
+tag0> word6 <tag2>word7 word8</tag2> word9 <tag3>word10</tag3> word11
+';
my $tags = parse_string( $tag_string );
print Dumper( $tags );
sub parse_string {
my $string = shift;
my %tags;
my %check_tags;
my $in_tag = 0;
my $current_tag = 'no_tag';
my $prev_tag;
# process start tag event
my $start = sub {
my ($tag) = @_;
$check_tags{$tag}++;
$in_tag++;
$prev_tag = $current_tag;
$current_tag = $tag;
};
# process text event
my $text = sub {
my ($text) = @_;
$text =~ s/^\s+//;
$text =~ s/\s+$//;
return if not length $text;
my @words = split(m/\s+/, $text);
push( @{ $tags{$current_tag} }, @words);
};
# process end tag event
my $end = sub {
my ($tag) = @_;
$check_tags{$tag}++;
$in_tag--;
$current_tag = $in_tag ? $prev_tag : 'no_tag';
};
my $parser = HTML::Parser->new(
api_version => 3,
start_h => [$start, "tagname"],
text_h => [$text, "text"],
end_h => [$end, "tagname"],
default_h => [$text, "text"],
);
$parser->parse($string);
$parser->eof;
# check each tag has an end tag
for my $tag (keys %check_tags) {
if ($check_tags{$tag} % 2) {
print "<$tag> is not valid\n";
}
}
return \%tags;
}
Output:
$VAR1 = {
'no_tag' => [
'word1',
'word6',
'word9',
'word11'
],
'tag0' => [
'word2',
'word5'
],
'tag1' => [
'word3',
'word4'
],
'tag2' => [
'word7',
'word8'
],
'tag3' => [
'word10'
]
};
| [reply] [d/l] [select] |
Re: Parsing string with tags
by tybalt89 (Monsignor) on May 31, 2022 at 20:41 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11144309
use warnings;
my $str = 'word1 <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0> w
+ord6 <tag2>word7 word8</tag2>word9 <tag3>word10</tag3> word11';
print parsestringwithtags( $str, 'tag0', 'tag1' ), "\n";
print parsestringwithtags( $str, 'tag3' ), "\n";
print parsestringwithtags( $str, 'tag1', 'tag2', 'tag3' ), "\n";
sub parsestringwithtags
{
local $_ = shift;
my %tags = map { $_, 1 } @_;
my $active = 1;
my @state = ['', 1];
s{ <(/?)(\w+)> | ([^<>]+) | ([<>]) }{
$4
? die "rogue angle bracket $4 at $-[4]"
: $2 # tag
? $1 # it is an end tag
? do { $state[-1][0] eq $2 # matches current tag
? do { $active = (pop @state)->[1]; '' }
: die "mismatched tags $state[-1][0] vs $2" }
: do { push @state, [$2, $active]; $tags{$2} or $active = 0;
+ '' }
: $3 x $active # non-tag only if active
}gex;
@state > 1 and die "missing close tag for $state[-1][0]";
return $_;
}
Outputs:
word1 word2 word3 word4 word5 word6 word9 word11
word1 word6 word9 word10 word11
word1 word6 word7 word8word9 word10 word11
| [reply] [d/l] [select] |
Re: Parsing string with tags
by perlfan (Vicar) on Jun 01, 2022 at 17:28 UTC
|
Some moons ago I wanted to do a similar thing that I ended up not doing, and that was parsing tags that including spaces. I didn't get past what syntax I'd use but with all the tagging a lot of sites do now I am sure this is a solved problem. Additionally, I was looking for the flexibility to associate an "attribute" with a tag or set of tags. The application I was writing would have benefitted greatly from this sort of scheme.
For example, "tags"
- word1
- "word2Tag1 word3Tag1 word4Tag3"
- ATTRIBUTE:word5
- ATTRIBUTE:"word6Tag2 word7Tag2 word8Tag2"
Inline, a real example would be something like:
humans
"good picture for thumbnail preview"
boysEyes:brown
fathersHair:blonde
description:"a father and son are playing catch"
This is obviously not a solution but I wanted to expand a bit on the scope of the _problem_ space. Maybe some wisdom will enlighten how to approach this since I am still interested in figuring this out since it'd have applications for labeling things like images for the borg. At this point there might be a well established way to do this that is not SGML/tag-based. | [reply] [d/l] |
Re: Parsing string with tags
by Anonymous Monk on May 31, 2022 at 12:08 UTC
|
It is neither XML nor is it HTML.
Why not? Make it xml or html? Its literally what youre describing
xsh2, xsh2
And it would be cool if the string would be denied if the tagging is invalid
yeah dtd can specify such a denial :D or xsd
| [reply] |
|
|