demerphq has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks, recently I have been looking into a data structure to use for representing unicode character classes. The issue is that the current regex engine uses a fairly inefficient structure for representing such character classes and suffers from the poor performance that comes from such a design.

The data structure is called an inversion list, and has been discussed elsewhere* and is actually documented in a book on unicode programming that I unfortunately dont have yet. Nevertheless the algorithm seemed fairly straightforward from its description so I decided to do a perl implementation as a prototype to work out the kinks before I translate it to C.

The basic idea is that an inversion list is a sequence of codepoints, with the even elements indicating ranges of code points that are "in" the set, and the odd elements indicating ranges of code points that are out. When a single codepoint is in the set but not part of a range then its value would be on an even index, and its successor on the next odd. The elements are ordered in increasing order.

The nice thing about this structure is that it is fairly efficient, usually small compares to what it represents, and allows relatively efficient boolean set operations such a union, intersection, set difference etc. Negation for instance involves inserting or removing a 0 from the front of the array, and adding or removing 0x110000 from the end of the array, thus negation can be performed in O(N) time, where N is the number of elements in the original inversion list. In fact all set operations besides lookup and creation can be performed in O(N) time, with lookup potentially being O(log N) (if a binary search is used) and creation is O(N log N) (due to the requirement to sort the list after construction which take O(N) time).

Id really appreciate it anybody has any comments or suggestions, or new tests that would show that what I have is broken, or could be improved. Please remember this is prototype code, so its not intended to be beautiful, and as its intended for eventual conversion to C it is not intended to be idiomatic perl either.

Anyway, thoughts appreciated.

* Algorithm::InversionList and related article on IBM's site as well as in quite useful Unicode programming PowerPoint Presentation, (Works fine in open office).

package InvList; use strict; use warnings; #use DDS; use constant UNI_INF => 0x10FFFF+1; =pod Inversion Lists This code implements a form of inversion list for handling charclasses (or any numbered and orderable set) The structure of an inversion list is an ordered array, where the even elements indicate the minimum value that is in the charclass, and odd elements indicate the minimum value that is not in the set. Lookup is by scanning the list to find the pair which straddle the value being looked up. The list (64,66,68,70) represents the char class [ABEF], the list (0,64,66,68,70,0x110000) represents the char class [^ABEF]. =cut our $DEBUG=0; # note decode() is bogus code, but its a proof of concept, so its fine +. # the code to parse escapes properly is already in the perl core. sub decode { wantarray ? map { ord substr $_,-1 } @_ : ord substr $_[0],-1 } sub make_inv_list { my ($class) = shift; my @list; while ($class=~s/^(?:((\\?.)-(\\?.))|(\\.)|(.))//) { if (defined $1) { print "Range: $1\n" if $DEBUG; my ($l,$r)=decode($2,$3); die "Bad range $1\n" if $l>$r; push @list, [$l,$r+1]; } elsif (defined $4) { print "Escape: $4\n" if $DEBUG; my ($n) = decode($4); push @list, [$n,$n+1]; } elsif (defined $5) { print "Char: $5\n" if $DEBUG; my ($n) = decode($5); push @list, [$n,$n+1]; } else { Carp::confess("Failed to parse inv_list specifier: $class\ +n"); } } # make sure the elements in the list are ordered correctly @list=sort { $a->[0] <=> $b->[0] || $a->[1] <=> $b->[1] } @list; #Dump(\@list) if $DEBUG; # merge any overlapping elements in the list, they could have writ +ten # something like [A-ZBBBBBBB] for instance. # Also flatten the list so its an array of numbers and not an AoAo +N my @ret; my ($in,$out)=@{$list[0]}; for my $idx (1..$#list) { my $rec = $list[$idx]; if ($rec->[0]<=$out && $in <= $rec->[1] ) { $in= $rec->[0] if $in > $rec->[0]; $out= $rec->[1] if $out < $rec->[1]; } else { push @ret,$in,$out; ($in,$out)=@$rec; } } push @ret,$in,$out; #Dump(\@ret) if $DEBUG; return bless \@ret; } # "not" a class sub invert { my $l=shift; my @ret=@$l; if (@ret && $ret[0]==0) { shift @ret; } else { unshift @ret,0; } if ($ret[-1]==UNI_INF) { pop @ret; } else { push @ret,UNI_INF; } return bless \@ret; } # "or" two classes together sub union { my ($l,$r)=@_; my @ret; # make sure $l has the leftmost element in it. ( $l, $r ) = ( $r, $l ) if (($l->[0] <=> $r->[0]) || ($l->[1] <=> $r->[1]))>0; my $lp=2; my $rp=0; my ($in,$out)=@{$l}[0,1]; # while both lists still have elements merge them together as need +ed while ($lp < @$l && $rp < @$r) { warn "w: $in,$out\n" if $DEBUG; if ( $l->[$lp] <= $out && $in <= $l->[$lp+1] ) { $in= $l->[$lp] if $in > $l->[$lp]; $out= $l->[$lp+1] if $out < $l->[$lp+1]; $lp += 2; } elsif ( $r->[$rp] <= $out && $in <= $r->[$rp+1] ) { $in= $r->[$rp] if $in > $r->[$rp]; $out= $r->[$rp+1] if $out < $r->[$rp+1]; $rp += 2; } else { push @ret,$in,$out; if ((($l->[$lp] <=> $r->[$rp]) || ($l->[$lp+1] <=> $r->[$rp+1]))<1) { ($in,$out)=@{$l}[$lp,$lp+1]; $lp += 2; } else { ($in,$out)=@{$r}[$rp,$rp+1]; $rp += 2; } } } # make sure any remaining elements in @$l are processed for ( ; $lp < @$l ; $lp +=2 ) { warn "$lp: l: ($in,$out) ($l->[$lp],$l->[$lp+1])\n" if $DEBUG +; if ( $l->[$lp] <= $out && $in <= $l->[$lp+1] ) { $in= $l->[$lp] if $in > $l->[$lp]; $out= $l->[$lp+1] if $out < $l->[$lp+1]; } else { push @ret,$in,$out; ($in,$out)=@{$l}[$lp,$lp+1]; } } # make sure any remaining elements in @$r are processed for ( ; $rp < @$r ; $rp +=2 ) { warn "$rp: r: ($in,$out) ($r->[$rp],$r->[$rp+1])\n" if $DEBUG +; if ( $r->[$rp] <= $out && $in <= $r->[$rp+1] ) { $in= $r->[$rp] if $in > $r->[$rp]; $out= $r->[$rp+1] if $out < $r->[$rp+1]; } else { push @ret,$in,$out; ($in,$out)=@{$r}[$rp,$rp+1]; } } push @ret,$in,$out; return bless \@ret; } # "and" two char classes together sub intersection { my ($l,$r)=@_; my @ret; my ($lp,$rp)=(0,0); while ($lp < @$l && $rp < @$r) { if ( $l->[$lp] < $r->[$rp+1] && $r->[$rp] < $l->[$lp+1]) { my $in = ($l->[$lp] < $r->[$rp]) ? $r->[$rp] : $l->[$lp]; my $out = ($l->[$lp+1] < $r->[$rp+1]) ? $l->[$lp+1] : $r-> +[$rp+1]; push @ret,$in,$out; } if ( $l->[$lp+1] < $r->[$rp+1] ) { $lp += 2; } else { $rp += 2; } } return bless \@ret; } # "subtract" one set from the other sub difference { my ($l,$r)=@_; my $inv = invert($r); return intersection($l,$inv); } # "xor" two sets sub symmetric_difference { my ($l,$r)=@_; return intersection(invert(intersection($l,$r)),union($l,$r)) } # check if a char is in the set sub has { my ($l,$char)=@_; my $v=decode($char); return 0 if $l->[0]>$v || $l->[-1] < $v; for ( my $i=0; $i<@$l; $i+=2 ) { if ( $l->[$i]<=$v && $v<$l->[$i+1] ) { return 1; } } return 0; } my %protect=map { ord($_)=>1 } qw( - \\ [ ] + & | ^ ); sub d2c { map { $_<32||$protect{$_} ? sprintf "\\%o",$_ : $_>127 ? sprintf "\\x{%X}",$_ : chr $_ } @_ } # stringify an invlsit as a charclass specification sub as_string { my $l = shift; my ($i,@c); #warn "@$l\n"; for ( $i = 0 ; $i < @$l ; $i += 2 ) { my ($in,$out) = @{$l}[$i,$i+1]; if ($in == $out-1) { push @c, d2c($in); } else { push @c, join "-",d2c($in, $out-1); } } #warn "@c\n"; return "[".join("",@c)."]"; } # test the code #die make_inv_list('L-NJ-O')->as_string; use Test::More 'no_plan'; while (<DATA>) { next if /^\s*#/ || !/\S/; my ($c1,$c2,$u,$i,$d) = split /\s+/,$_; my $l1 = make_inv_list($c1); my $l2 = make_inv_list($c2); my $ul = union($l1,$l2); my $il = intersection($l1,$l2); my $dl = difference($l1,$l2); for ( ['|',$u,$ul->as_string], ['&',$i,$il->as_string], ['-',$d,$dl->as_string] ) { is($_->[2],$_->[1],"[$c1] $_->[0] [$c2]"); } } my $l = make_inv_list("\2AEIOU"); my $str=""; for ("A".."Z") { $str.=$_ if $l->has($_); } is($str,'AEIOU','has()'); my $li=invert($l); is($li->as_string,'[\\0-\\1\\3-@B-DF-HJ-NP-TV-\\x{10FFFF}]',"invert([\ +\2AEIOU]) works"); my $lii=invert($li); my ($liii)=invert($lii); is($l->as_string,$lii->as_string,"Double invert works:".$l->as_string) +; is($li->as_string,$liii->as_string,"Tripple invert works:".$li->as_str +ing); __DATA__ ##C1 C2 | & - A-C C-E [A-E] [C] [A-B] ABEF B-E [A-F] [BE] [AF] A-C G-I [A-CG-I] [] [A-C] A-Z AEIOU [A-Z] [AEIOU] [B-DF-HJ-NP-TV-Z] ACE BDF [A-F] [] [ACE] YVES EVE [ESVY] [EV] [SY] A-F D-I [A-I] [D-F] [A-C] A-C D-G [A-G] [] [A-C] A-D D-G [A-G] [D] [A-C] A-D C-G [A-G] [C-D] [A-B] D-G A-C [A-G] [] [D-G] D-G A-D [A-G] [D] [E-G] C-G A-D [A-G] [C-D] [E-G] ABC B [A-C] [B] [AC] A-D BC [A-D] [B-C] [AD] A-DX-Z ONML [A-DL-OX-Z] [] [A-DX-Z] A-DE-FG-H A-Z [A-Z] [A-H] [] E-J A-C [A-CE-J] [] [E-J] E-J B-D [B-J] [] [E-J] E-J C-E [C-J] [E] [F-J] E-J D-F [D-J] [E-F] [G-J] E-J E-G [E-J] [E-G] [H-J] E-J F-H [E-J] [F-H] [EI-J] E-J G-I [E-J] [G-I] [E-FJ] E-J H-J [E-J] [H-J] [E-G] E-J I-K [E-K] [I-J] [E-H] E-J J-L [E-L] [J] [E-I] E-J K-M [E-M] [] [E-J] E-J L-N [E-JL-N] [] [E-J] A-CU-Z U-WA-F [A-FU-Z] [A-CU-W] [X-Z] B-DT-Y S-UC-H [B-HS-Y] [C-DT-U] [BV-Y] C-ES-X Q-SE-J [C-JQ-X] [ES] [C-DT-X] D-FR-W O-QG-L [D-LO-W] [] [D-FR-W] E-GQ-V M-OI-N [E-GI-OQ-V] [] [E-GQ-V] F-HP-U K-MK-P [F-HK-U] [P] [F-HQ-U] G-IO-T I-KM-R [G-KM-T] [IO-R] [G-HS-T] H-JN-S G-IO-T [G-JN-T] [H-IO-S] [JN] I-KM-R E-GQ-V [E-GI-KM-V] [Q-R] [I-KM-P] J-LL-Q C-ES-X [C-EJ-QS-X] [] [J-Q] K-MK-P A-CU-Z [A-CK-PU-Z] [] [K-P] L-NJ-O B-DT-Y [B-DJ-OT-Y] [] [J-O] M-OI-N D-FR-W [D-FI-OR-W] [] [I-O] N-PH-M F-HP-U [F-U] [HP] [I-O] O-QG-L H-JN-S [G-LN-S] [H-JO-Q] [GK-L] P-RF-K J-LL-Q [F-R] [J-KP-Q] [F-IR] Q-SE-J L-NJ-O [E-OQ-S] [J] [E-IQ-S] R-TD-I N-PH-M [D-PR-T] [H-I] [D-GR-T] S-UC-H P-RF-K [C-KP-U] [F-H] [C-ES-U] T-VB-G R-TD-I [B-IR-V] [D-GT] [B-CU-V] U-WA-F T-VB-G [A-GT-W] [B-FU-V] [AW]

note: Added more tests. Fixed a bug. Updated code (yesterday i accidentally deleted it :-(

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re: Inversion list prototype
by jbert (Priest) on Feb 16, 2007 at 17:34 UTC
    If I've understood your data structure correctly, I've written similar code to this in the past (that code isn't mine to release I'm afraid).

    The initial code was brittle as hell, in that it was hard to fix a bug without introducing others. I'd *really* recommend a good set of unit tests for this data type.

    There are a lot of different cases to consider, with edge cases where the ends of the unions are equal, overlap in different ways, are subsets, proper subsets etc.

    However, I've also written some related code (not identical, this is for a half open interval, so the top endpoint isn't included) as part of exmap. If you're interested, it's Range.cpp/Range.hpp in there, with the tests in t_range.cpp.

    In fact now I come to think of it, there's a perl version in there too, in the perl-deprecated directory, "Range.pm and test-range.pl".

    If it's of any use, lift anything you like from that tarball. That project is GPL'd. (If you'd like it under Artistic License let me know, I'm the author and can re-license if required).

    (Edit - can you steal the unit tests from Algorithm::InversionList? It's often quite easy to translate unit tests from one language to another)

      There are a lot of different cases to consider, with edge cases where the ends of the unions are equal, overlap in different ways, are subsets, proper subsets etc.

      Yes indeed. I think i covered most of them in the embedded tests in the script, but perhaps not.

      can you steal the unit tests from Algorithm::InversionList?

      There actually arent that many, and they arent really relevent, since it seems his version does bitmap compression, whereas mine is working with codepoints. (Not that they arent at certain levels equivelent, but....) Also he hasnt implemented any set operations, just compression/lookup.

      Ill check out the exmap. And yeah, ive done code /very/ similar to this for $work, but wasnt looking at it as a set relation, rather as date ranges.

      ---
      $world=~s/war/peace/g

        I think there are more cases for union.

        The two ranges can be disjoint, they can have upper of one == lower of the other, partially overlap, partially overlap with an end-point in common, be equal, be a strict subset and unless you canonicalise the order somehow you also have all of these the other way round (so double the number of cases, plus some I missed probably).

        I can think of at least ten cases anyway.

        (Edit: after re-reading the tests, I think most cases are covered, so sorry for originally saying "a lot more cases", that was my error in reading the tests)

Re: Inversion list prototype
by jbert (Priest) on Feb 16, 2007 at 15:04 UTC
    On the subject of unicode performance, do you have any views on this?

    I don't mind paying the cost of a unicode regex engine when I need it, but it would be nice to use the byte engine when handling ASCII (without jumping through additional hoops).

      Looks like a bug to me.

      You should report it with perlbug.

      ---
      $world=~s/war/peace/g

Re: Inversion list prototype
by bart (Canon) on Mar 02, 2007 at 13:17 UTC
    A tiny remark: don't set your bar for the upper limit so articially low (0x110000). I expect that sooner or later, Unicode will be extended to a wider range, and I see no reason for not using an upper limit that is as high as you technically can without increasing the storage size. That way you'll never have to edit your program in case that ever happens.

    So, I assume that for storage in 24 bit, 0xFFFFFF would be a good value, and in 32 bit, 0x7FFFFFFF, the highest positive integer that will never pose a problem.