http://www.perlmonks.org?node_id=976482

dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

The title almost gets at what I want to do. I have some code at the top of my scratchpad that almost does what I'd like. If I can use the perl metachars, that would be okay, I'm not wedded to the chars I picked for the mapping.

I'd like to express runs of character classes as one char. For example, all contiguous A-Za-z0-9 would be replaced by an 'A', all contiguous \000 replaced by an 'E', all contiguous \377 replaced by an 'F'. I've almost got it, with the code on my scratchpad, but I'm wondering if there's a more efficient way, or a method that will let me get the last coverage I need.

Now, I'm doing this:

# main while(read(FIN,$data,$bsize)) { while (length($data)) { if ($data =~ /^([\d\w]+)/ ) { notate('A', length($1)); } # more if's } } # sub notate { my $c = shift; my $n = shift; if ((! defined $c) || (! defined $n)) { close(FIN); if ($opt_M) { close(FOUT); } print STDERR "$0 : FATAL ERROR in sub notate\n"; exit; } $count{$c} += $n; substr($data,0,$n) = ''; # dangerous if (! $opt_M) { $mstr .= $c; } else { print FOUT "$c"; } return(0); }

Replies are listed 'Best First'.
Re: How to express contents of a file as regex metachars?
by jwkrahn (Abbot) on Jun 16, 2012 at 05:54 UTC
    if ($data =~ /^([\d\w]+)/ ) { notate('A', length($1)); }

    Your use of [\d\w] is redundant because the \w character class also includes the same characters of the \d character class.    Also, the \w character class includes the _ character (underscore) which is not an alphanumeric character.    You should just use [A-Za-z0-9] which only matches alphanumeric characters.

    You don't have to capture the match and get its length, you can just pass its length directly:

    if ($data =~ /^[A-Za-z0-9]+/ ) { notate('A', $+[0]); }
      Argh - I'm an idiot, I should have known \w wasn't correct. And I didn't know (or completely forgot) about $+ so thanks for that tip!
Re: How to express contents of a file as regex metachars?
by SuicideJunkie (Vicar) on Jun 15, 2012 at 19:18 UTC

    If you don't mind consuming the input, you can use an anchored regex to eat the input as you generate the summary.

    my %mapping = ( '[a-zA-A0-9]+' => 'A', ); my $summary = ''; CHUNK: while (length $filedata) { foreach my $reg (keys %mapping) { if ($filedata =~ s/^$reg//) { $summary .= $mapping{$reg}; next CHUNK; } } die "Chunk starting with '". substr($filedata, 0,10) . "' did not ma +tch any rules!"; }

    If I'm confused as to your goal and you actually want to do the opposite of this, you can take the reverse of the %mapping hash and concatenate a big regex string by looking up the regex substring for each character in the summary in turn.

      I think that's freaking brilliant. Using variables in regexes has always scared the pants off me, I can never get them to do what I want.

      As cavac asked, I'm building a crude lossy compression.

Re: How to express contents of a file as regex metachars?
by cavac (Parson) on Jun 15, 2012 at 21:50 UTC

    I don't have a solution for you. But i'm fascinated by the problem.

    So, basically, what you want to do is a (one way/lossy) table-based compression algorithm?

    "You have reached the Monastery. All our helpdesk monks are busy at the moment. Please press "1" to instantly donate 10 currency units for a good cause or press "2" to hang up. Or you can dial "12" to get connected directly to second level support."
      Yes! exactly. I should have described it that way.