Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Compressing a text file using count of continous characters

by nirvana4ol (Initiate)
on Dec 14, 2007 at 16:46 UTC ( [id://657079]=perlquestion: print w/replies, xml ) Need Help??

nirvana4ol has asked for the wisdom of the Perl Monks concerning the following question:

I need to count the number of continous character occurances(more than 1) in a file, and replace it with a compressed version, like below XYZAAAAAAAADEFAAcdAA --> XYZ8ADEF2Acd2A Thanks

Replies are listed 'Best First'.
Re: Compressing a text file using count of continous characters
by oha (Friar) on Dec 14, 2007 at 17:00 UTC
    s/((\D)\2+)/length($1).$2/ge;
    it search for a non-number, then search if it's repeated 1 or more times, then replace it by the count of them followed by the the repeated char
    Oha

    PS: could be usefull to limit the number of repeated matches, and include the numbers. but in this case numbers must always have a counter:

    s/((\D)\2{1,8}|(\d)\3{0,8})/length($1).$2.$3/ge;
    this one will compare 2 or more char, or 1 or more if digit and match no more then a sequence of 9.
    in this way you can decode the data with no side-effect if strange data is used: X2AAAAAAAAAAAAAAAAAAAAAAA1111 become X129A9A5A41
    A repetitions are grouped up to a max of 9, numbers are counted as repetition also if not repeated.
    to decode use:
    s/(\d)(.)/$2x$1/ge;

    Oha

Re: Compressing a text file using count of continous characters
by pfaut (Priest) on Dec 14, 2007 at 17:01 UTC
    $string =~ s/(.)\1+/length($&).$1/eg;
    90% of every Perl application is already written.
    dragonchild
Re: Compressing a text file using count of continous characters
by tuxz0r (Pilgrim) on Dec 14, 2007 at 18:48 UTC
    My guess is you'll need to decode it at some point too. This all very similar to run length encoding, though I think RLE encodes even single character occurrences. So here's short snippet to encode/decode which you can modify to suit your needs:
    use strict; sub encode { s/((.)\2+)/(length $1) . $2/eg; $_; } sub decode { $_ = shift; my @list; while (/((\d+)?(.))/g) { push @list, [$2,$3]; } join '', map { (defined $_->[0]) ? $_->[1] x $_->[0] : $_->[1]; } +@list; } while (<DATA>) { print; my $enc = encode($_); my $dec = decode($enc); print $enc; print $dec; } __DATA__ XYZAAAAAAAADEFAAcdAA
    Which gives the following output:
    XYZAAAAAAAADEFAAcdAA XYZA8DEFA2cdA2 XYZAAAAAAAADEFAAcdAA

    ---
    s;;:<).>|\;\;_>?\\^0<|=!]=,|{\$/.'>|<?.|/"&?=#!>%\$|#/\$%{};;y;,'} -/:-@[-`{-};,'}`-{/" -;;s;;$_;see;
    Warning: Any code posted by tuxz0r is untested, unless otherwise stated, and is used at your own risk.

      You should at least make it symmetric...
      sub encode { $_ = shift; s/((\D)\2+)/length($1).$2/eg; $_ } sub decode { $_ = shift; s/(\d+)(\D)/$2 x $1/eg; $_ }
Re: Compressing a text file using count of continous characters
by KurtSchwind (Chaplain) on Dec 14, 2007 at 21:10 UTC

    And by 'characters' I sincerly hope we are talking about only ALPHAs and no NUMERICs.

    ALPHANUMERICS works for RLE because every char gets a count. But if you are only doing multiples, you can't have digits in there as part of your character set.

    --
    I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://657079]
Approved by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-03-28 22:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found