Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Sort this data

by Anonymous Monk
on Nov 19, 2000 at 11:01 UTC ( #42407=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

So I've got this data:

Title Line
Author Line
Link Line
-blank-
... repeat ad nauseum

Only right now it's in an array, with the items separated by
blanks:
[ 'Title', 'Author', 'Link', '', 'Title', ... ]

I really would like it to end up as an array of hashes of the form:
[ { Title => 'Title Data', Author => 'Author Data', Link => 'Link Data' }, etc. ]
Any ideas how I go through the array, putting things
where they should go?
--Jay

Comment on Sort this data
Select or Download Code
Re: Sort this data
by extremely (Priest) on Nov 19, 2000 at 11:31 UTC
    Well, I'd probably use perlfunc:splice to rip 4 items at a time off the list and then push hashes onto a new list,
    tested now =):
    @bigarray = ... ; #your data my @LoH; while (my ($t, $a, $l, $j)= splice (@bigarray,0,4)) { push @LoH, { Title => $t, Author => $a, Link => $l }; }

    --
    $you = new YOU;
    honk() if $you->love(perl)

      I'd guard against splice() continually from the beginning of an array. I'd much rather call it from the end of the array. "But Jeff," you'd say, "what if the order of the hash references matters? You couldn't do:
      while (my ($a,$b,$c) = splice(@data, -3)) { push @hashrefs, { a => $a, b => $b, c => $c }; pop @data; # null field }
      because then @hashrefs would be in reverse!" Yes, that's absolutely right. And it might be inefficient to reverse() the array when we're done with it, and it's also inefficient to keep unshift()ing to the array. So what possible efficient solution could I come up with to combine the splice() speed with the insertion speed?

      Pre-extend the array! (Dun, dun, DUN!)
      my @hashrefs; $#hashrefs = int(@data / 4); my $i = $#hashrefs; while (@data and my ($a,$b,$c) = splice(@data, -3)) { $hashrefs[$i--] = { a => $a, b => $b, c => $c }; pop @data; }
      What a sneaky trick I've done.

      Update

      Oops, used -4 above, when I meant -3. Thanks, jcwren.

      Update

      splice() will wig out when the array is empty. The while loop has been adjusted.

      japhy -- Perl and Regex Hacker
        Nice solution! But a couple of questions:

        It would be interesting to benchmark if using unshift() into the $hashrefs array would be more or less efficient than messing with $i. If nothing else, it would be two lines shorter, and more Perlish.

        And perhaps this is too early in the morning, but why are you splicing by -4, and then popping @data?

        It's Sunday morning where I am, and I'm too lazy to try to write a coherent benchmark...

        --Chris

        e-mail jcwren
        I would have to check, but I thought that splice was only slow if you have to move the array around. But I don't know whether pulling from the beginning of the array has been optimized. And (like jcwren) I am too lazy to benchmark it at the moment. In any case shift is fast and has the benefit of avoiding tracking indices. (Note that you had a bug in your code which jcwren caught? Yes, I am talking about that.)
        while (@big_array) { my $href; @$href{'title', 'author', 'link'} = map shift(@big_array), 1..4; push @structs, $href; }
        This might be faster than your sneaky trick. It might be slower. It certainly has fewer indices.

        Also the cost of reverse is overstated. You have just walked through a list of n things in Perl. You then want to reverse a list of n/4 things. What is the relative cost of those two operations? Right.

        Pick up good material on optimization. Such as this sample chapter from Code Complete. Or RE: Efficient Perl Programming. You will find that experienced people understand that getting maintainable code with good algorithms can result in better overall speed wins than trying to optimize every line.

        Now noticing the splice, that matters. If it isn't optimized then that is an order(n) operation n times - which is n^2 and therefore is likely to be slow. But one reverse at the end is an order n operation once. Should the body of the loop be slightly more efficient from doing the slice rather than repeated manipulation of indices (something I would have to benchmark to have a feeling for either way) then your attempt to optimize would actually lose.

        To summarize, don't worry about slow operations, worry about bad algorithms. A slow operation inside a loop may matter. A slow operation outside a loop which speeds up the loop can go either way. An order n (or worse) operation inside a loop - that is the only one which should cause you to want to care up front about optimizing the structure of the code!

        EDIT
        I had messed up the final paragraph.

        What if there is a blank line at the END of the list? You might want to sniff for that and pop off blank lines first.

        Also, what if fields are allowed to be null? If so, you HAVE to read from the front...

        --
        $you = new YOU;
        honk() if $you->love(perl)

Re: Sort this data (alternate method)
by extremely (Priest) on Nov 19, 2000 at 11:42 UTC
    Or, if you just read that data into that array from disk you can likely do it all in one pass. The trick here is that you said that there are blank lines. You can rig perl up so that it looks for double blanks when reading "lines" and then split your data on the single linefeed that divides your data up:
    my @LoH; { local $/="\n\n"; # may be "\r\n\r\n" under windows... while (<>) { my ($t, $a, $l) = split /\n/; push @LoH, { Title => $t, Author => $a, List => $l }; } }

    I should probably have added this to the other post but I had just edited it like 5 times. =P

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: Sort this data
by japhy (Canon) on Nov 19, 2000 at 23:04 UTC

    UPDATE

    There was a huge error in this test (and I'm stupid for not using strict in it). I was testing @c where I should have been testing @a. I am now going to replace the bad results with the GOOD results.
    Here's a benchmark. Six methods were tested, and run for 5 seconds, three times; once on a set of 100 elements, once on 1000, and once on 10000.
    in_fr_pu
    slices the array in chunks of 4, push()es the hash ref to the new array
    140.9 Hz @ 100, 13.9 Hz @ 1000, 1.3 Hz @ 10000

    sh_fr_pu
    shift()s the array, push()es the hash ref to the new array
    156.7 Hz @ 100, 15.7 Hz @ 1000, 1.4 Hz @ 10000

    sp_bk_in
    presizes the new array, splice()s from the back, inserts the hash ref via index into the new array
    149.4 Hz @ 100, 14.8 Hz @ 1000, 1.4 Hz @ 10000

    sp_bk_rv
    splice()s from the back, push()es the hash ref to the new array, the reverse()s
    151.5 Hz @ 100, 14.7 Hz @ 1000, 1.3 Hz @ 10000

    sp_bk_un
    splice()s from the back, unshift()s the hash ref to the new array
    151.6 Hz @ 100, 11.9 Hz @ 1000, .4 Hz @ 10000

    sp_fr_pu
    splice()s from the front, push()es the hash ref to the new array
    160.0 Hz @ 100, 15.6 Hz @ 1000, 1.4 Hz @ 10000
    Notice how all of them appear rather linear in form, EXCEPT for sp_bk_un. This is because unshift() is rather inefficient. sp_fr_pu is the fastest because splice()ing from the front just moves the beginning-of-array marker forward in the array. Thus, the next fastest is sh_fr_pu, which just uses 4 calls to shift() instead of splice().
    use Benchmark; for $SIZE (100, 1000, 10000) { timethese(-5, { sp_fr_pu => sub { my @a = (1..$SIZE); my @b; while (my @c = splice(@a, 0, 4)) { push @b, { @c } } }, sp_bk_un => sub { my @a = (1..$SIZE); my @b; while (@a and my @c = splice(@a, -4)) { unshift @b, { @c } } }, sp_bk_in => sub { my @a = (1..$SIZE); my @b; $#b = int(@a / 4); my $i = $#b; while (@a and my @c = splice(@a, -4)) { $b[$i--] = { @c } } }, in_fr_pu => sub { my @a = (1..$SIZE); my @b; my $i = 0; while ($i < @a) { push @b, { @a[$i .. $i + 3] }; $i += 4; } }, sh_fr_pu => sub { my @a = (1..$SIZE); my @b; while (@a) { push @b, { map shift(@a), 1..4 } } }, sp_bk_rv => sub { my @a = (1..$SIZE); my @b; while (@a and my @c = splice(@a, -4)) { push @b, { @c } } @b = reverse @b; }, }); }


    japhy -- Perl and Regex Hacker

      UPDATE

      These results were wrong. I'll get correct ones later.
      Those benchmarks were under 5.005_02. Here are the 5.6 results. They're for different sizes of the list, since the test hung horribly at 10,000 elements.
      Name Hz @ 100 Hz @ 1000 Hz @ 2000
      in_fr_pu 466.1 45.2 22.5
      sh_fr_pu 509.7 49.5 24.6
      sp_bk_in ??? ??? ???
      sp_bk_rv ??? ??? ???
      sp_bk_un ??? ??? ???
      sp_fr_pu 489.3 47.2 23.4


      japhy -- Perl and Regex Hacker
      Updated. I was stupid -- not using strict ended up kicking me in the ass with my benchmark. It's now working properly, and the results are far more... expected.

      japhy -- Perl and Regex Hacker
japhy looks at av.c (not av.h)
by japhy (Canon) on Nov 20, 2000 at 00:46 UTC

    UPDATE

    As per my revelations in sort this data, here is a bit of an adjusted report on the av.c source.
    Here's the relevant portion from av.c in the 5.6 source:
    /* this is Perl_av_unshift() it unshifts 'num' undef values to an array */ /* determine how much non-used spaced is left that's been allocated for this array */ i = AvARRAY(av) - AvALLOC(av); /* if there's room left... */ if (i) { /* if there's more room than we need, just use 'num' */ if (i > num) i = num; /* this will set 'num' to 0 if we had enough room */ /* 'num' is now how many new undef values we need added */ num -= i; AvMAX(av) += i; /* set the highest subscript??? */ AvFILLp(av) += i; /* add to highest subscript */ SvPVX(av) = (char*)(AvARRAY(av) - i); /* where Perl's array starts +*/ } /* if there wasn't enough room already... */ if (num) { i = AvFILLp(av); /* highest subscript */ av_extend(av, i + num); /* extend array to i+num elements */ AvFILLp(av) += num; /* add to highest subscript */ ary = AvARRAY(av); /* get at the array */ Move(ary, ary + num, i + 1, SV*); /* slide elements up */ do { ary[--num] = &PL_sv_undef; /* set new elements to undef */ } while (num); }
    Ok, so unshift() has to do a lot of sliding the elements up if there's not already some free space generated by a call to shift().

    What shift() does is simply slide the pointer up one notch, which is very fast (which provides free space for unshift() later). It stores the value in *AvARRAY(av) into retval, and then sets *AvARRAY(av) to the C equivalent of Perl's undef value. Then it moves the pointer that marks the beginning of the array to AvARRAY(av) + 1. It also subtracts one from the max subscript and the length of the array. Then it returns retval.

    <revelation>I'm beginning to grok the source code! Look out!</revelation>.

    japhy -- Perl and Regex Hacker
      Nice!

      For those who are not following the code, the logic here is based on first trying to allocate elements which there is room for (this is the "if (i)" bit). If it cannot get them all in it then makes sure it has enough space, does some accounting, copies everything over, then inserts some new stuff. Note that japhy's comment about unused space is misleading, he means unused at the beginning of the array. There is also unused stuff at the other end, but we cannot directly use that.

      For full details you have to also look at av.h. The call to AvMAX is just a macro to set xav_max which is the largest element which there is space for (adding to the front increases that), and AvFILL is setting xav_fill which is how many you have (adding obviously increases that as well).

      The call to av_extend is where the array size increases. What it does is extends the definition of what the array is occupying. In fact it is in a buffer whose size is allocated in powers of two. If the extend takes the array beyond the size of the buffer, then a new buffer is allocated and the array is moved. If it does not then the array is left where it is.

      Now it is clear how to make repeated calls to unshift fairly efficient. Right now it moves stuff by the minimum necessary. What we need is to have a new variable for how much to move it. That variable should be the maximum of num and the length of the array. This will cause space wastage, but it will also mean that when you hit an unshift and it has to move the array, it will not hit that copying logic again right away.

      Even so building up an array using n calls to push will still be faster than unshift because there is less accounting to do. But both will be order n rather than having n calls to unshift being order n^2 as it is today.

      You know, from looking at the code elsewhere (look for av_extend) they have special code in there to stretch out the array for preextending. Maybe we need a slight syntax change. Maybe $#=-500; should preallocate space at the beginning of the array without changing the size of it. It is shame the 6.0 process is frozen or I would hack up a quick RFC.

      --
      $you = new YOU;
      honk() if $you->love(perl)

Re: Sort this data
by princepawn (Parson) on Nov 20, 2000 at 04:48 UTC
    Boulder was made for this type of problem.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://42407]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (17)
As of 2014-04-18 14:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (469 votes), past polls