Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: Find what characters never appear

by Narveson (Chaplain)
on Sep 04, 2009 at 22:20 UTC ( [id://793611]=note: print w/replies, xml ) Need Help??


in reply to Re: Find what characters never appear
in thread Find what characters never appear

If we've seen $chr once, can we somehow avoid repeating the assignment to $seen[ord($chr)] during the rest of the read?

Can we avoid even testing $seen[ord($chr)]?

I'd like to make a regex that matches any of our dwindling array of unseen characters, and update this regex every time I update $seen. Has anybody done this?

Replies are listed 'Best First'.
Re^3: Find what characters never appear
by kennethk (Abbot) on Sep 04, 2009 at 23:21 UTC
    If you want to avoid potential issues w/ regex metacharacters, you can use a set of hash keys to track what's been seen and rebuild the regex once for each character:

    #!/usr/bin/perl use strict; use warnings; my %char_hash = (); $char_hash{ chr($_) } = undef foreach (33 .. 127); my $chars = join "", keys %char_hash; my $regex = "([\Q$chars\E])"; while (<DATA>) { while (/$regex/g) { delete $char_hash{$1}; $chars = join "", keys %char_hash; $regex = "([\Q$chars\E])"; } } my @good_array = keys %char_hash; print @good_array; __DATA__ !"#$%&'()*+,-./01234567 89:;<=>?@ABCDE FGHIJKLMOPQRSTUVWXYZ[\]^_`abcdefghijklmnop qrstuvwxyz{|}~

    though I feel like there must be a simpler way of implementing this approach.

      This ran in just a few minutes against my big 2GB file.

      All I had to do was change the printable range to 33..126, change <DATA> to <>, and for my own curiosity, add print "$1 seen on line $.\n"; after delete $char_hash{$1};

Re^3: Find what characters never appear
by almut (Canon) on Sep 04, 2009 at 23:01 UTC

    Maybe something like this  (demo with reduced charset):

    #!/usr/bin/perl my $s = "fccccaaaaeaaaddaaaaabbcccaaacaaabbaaaa"; my $set = "[abcdefg]"; while ($s =~ /($set)/g) { my $ch = $1; $set =~ s/$ch//; # remove $ch from search set printf "found %s at %d -> regex now: %s\n", $ch, pos($s), $set; } __END__ found f at 1 -> regex now: [abcdeg] found c at 2 -> regex now: [abdeg] found a at 6 -> regex now: [bdeg] found e at 10 -> regex now: [bdg] found d at 14 -> regex now: [bg] found b at 21 -> regex now: [g]

    Update: kennethk noted that you would run into complications with regex metacharacters with this simple approach (when using the full ASCII set) — which is of course correct...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://793611]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-04-19 22:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found