http://www.perlmonks.org?node_id=810289

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I'll keep it simple: I'm trying to write a quick and dirty parser of some Project Gutenberg etexts, and ran into a puzzle.

Each of the etexts is stored in a split directory structure that models the name of the etext. For example, the HTML version of etext 12345 exists in /1/2/3/4/12345/12345-h/12345-h.htm.

Here's what I have to split that out, when given just the 12345 as the argument to my parser:

my $etext = $ARGV[0]; my $site = 'http://pod/Gutenberg'; my $splitguten = join('/', split(/ */, $etext)); my $clipguten = substr($splitguten, -2, 2, ''); my $link = "$site/$splitguten/$etext/$etext-h/$etext-h.htm";

I'm trying to find a cleaner way to do this. Any ideas or suggestions?

Replies are listed 'Best First'.
Re: A Better Guten Split
by ikegami (Patriarch) on Dec 01, 2009 at 00:35 UTC

    I guess we can start by getting rid of the useless variable and shorten the silly pattern.

    my $split = join '/', split //, $id; substr($split, -2, 2, ''); my $url = "$base_url/$split/$id/$id-h/$id-h.htm";

    Two lines to calculate and one line to assemble. I don't think length is really a problem here. We're dealing with readability issues if we try to shorten it any more. These are just too complicated:

    substr( ( my $split = join '/', split //, $id ), -2, 2, ''); my $url = "$base_url/$split/$id/$id-h/$id-h.htm";
    ( my $url = join '/', split //, $id ) =~ s{(.*)/}{$base_url/$1/$id/$id-h/$id-h.htm}s;

    I'm partial to this *longer* version:

    my $url = join('/', $base_url, $id =~ /(.)(?=.)/sg, $id, "$id-h", "$id-h.htm" );

    The flow is very simple, so it's easy to understand.