Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Unpacking and converting

by dwalin (Monk)
on Feb 15, 2011 at 18:56 UTC ( [id://888353]=perlquestion: print w/replies, xml ) Need Help??

dwalin has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise brethren, I'm seeking your help again. What I need to do is to parse fixed-width records in arrays and send the resulting data over network. Bandwidth matters a lot in this case, so I'd like to "compress" the data by converting numbers from string representation given by unpack() to numerical values. I'm trying to find the most effective way to do this in Perl, this is what I managed so far:
my @types = qw(number number string number boolean...) my @data = unpack($template, $line); my @newdata; my $i = 0; while (my $item = shift @data) { if ($types[$i] ne 'string') { $item = $item ? $item + 0 : 0; } push @newdata, $item; }

This is weird but shifting items one by one into new array seems to be a lot faster than iterating over the array with canonical for (@list). I wonder if there is any more deep Perl magic that would allow making processing even faster?

Thanks in advance.

Regards,
Alex.

Replies are listed 'Best First'.
Re: Unpacking and converting
by BrowserUk (Patriarch) on Feb 15, 2011 at 22:18 UTC

    Your code above never does anything with the modified $item--ie. it never pushes it into @newdata, which means that your loop does exactly nothing, but slowly. However, looking past that...

    This is weird but shifting items one by one into new array seems to be a lot faster than iterating over the array with canonical for (@list).

    Update: Ignore. The Benchmark is broken!

    Frankly, I thought you were talking through the top of your hat when you said that, until I benchmarked it. And, despite my spending a while trying to see a flaw in the benchmark, you seem to be right. Not only am I surprised that it seems to be true, but I'm utterly staggered by the difference in performance. And at an utter loss to explain why it should be the case.

    our @a = @b = @c = '0001' .. '1000';; cmpthese -1,{ a => q[ $_ += 0 for @a; ], b => q[ my @new; push @new, $_ + 0 while defined( $_ = shift @b ) +], c => q[ $c[ $_ ] += 0 for 0 .. $#c; ] };; Rate c a b c 6220/s -- -37% -100% a 9893/s 59% -- -100% b 4562313/s 73247% 46016% --

    And the bigger the array, the more extraordinary the difference becomes:

    our @a = @b = @c = @d = '00001' .. '10000';; cmpthese -1,{ a => q[ $_ += 0 for @a;], b => q[ my @new; push @new, $_ + 0 while defined( $_ = shift @b ) +], c => q[ $c[ $_ ] += 0 for 0 .. $#c; ], d => q[ my @new = map $_ += 0, @d ] };; Rate d c a b d 258/s -- -58% -72% -100% c 615/s 138% -- -34% -100% a 932/s 261% 52% -- -100% b 4651085/s 1800042% 756579% 499189% --

    There is something amiss here, but if it is the benchhmark I cannot see it.

    And if not, I'm loathed to explain why creating a new array by pushing them one at a time whilst destroying the old one, would be so much faster than iterating over the original in place.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I recommend to try

      cmpthese -1,{ c => q[ my @c = '0001' .. '1000'; $c[ $_ ] += 0 for 0 .. $#c; ], a => q[ my @a = '0001' .. '1000'; $_ += 0 for @a; ], b => q[ my @b = '0001' .. '1000'; my @new; push @new, $_ + 0 while + defined( $_ = shift @b ) ], };

      Then the results are appropriate

      Rate b c a b 999/s -- -22% -28% c 1279/s 28% -- -7% a 1382/s 38% 8% --
        I think in this case the array creation and population makes significant influence on results. Consider this:
        cmpthese -1,{ d => q[ my @d = '0001' .. '1000'; ], c => q[ my @c = '0001' .. '1000'; $c[ $_ ] += 0 for 0 .. $#c; ], a => q[ my @a = '0001' .. '1000'; $_ += 0 for @a; ], b => q[ my @b = '0001' .. '1000'; my @new; push @new, $_ + 0 while + defined( $_ = shift @b ) ], };
        Results speak for themselves:
        Rate b c a d b 1267/s -- -29% -37% -63% c 1794/s 42% -- -10% -47% a 2000/s 58% 11% -- -41% d 3413/s 169% 90% 71% --

        Regards,
        Alex.

      Your code above never does anything

      Oopsie, thanks for pointing out this braino. I ran my benchmarks against working code, you can be sure of that. :)

      Considering array processing, well I'm not a perl guru so I can only guess; and my guess is that in perl, arrays are not really arrays internally but linked lists with external indices. That would at least explain why shifting and tossing is so blazing fast - there's very little expense in following a pointer, after all - but it cannot explain why index lookup is so shocking slow. And why for (@list) is 60% faster than index lookup but otherwise slow is beyond me. This was exactly the reason I asked my question, since my first impression was that I'm doing something wrong but couldn't pinpoint what exactly.

      Regards,
      Alex.

Re: Unpacking and converting
by chrestomanci (Priest) on Feb 15, 2011 at 20:55 UTC
Re: Unpacking and converting
by ikegami (Patriarch) on Feb 15, 2011 at 19:20 UTC
      ikegami,

      Not so. Simple addition results in warnings "Argument isn't numeric" whenever an empty string is there for a field that is nevertheless numeric.

      Second option has no sense either, as I said that this data will be sent over the network. Considering that one line of data can easily be 2-4k (yes, kilo) and there can be like 5-7k lines per batch, savings from converting Unix time from 10-character strings to 4-byte integers are significant. And no, Storable does not convert the data by itself, I have checked before asking.

      Anyway, the question was how to make array processing faster, not if there is any sense in doing it.

      Regards,
      Alex.

        Simple addition results in warnings "Argument isn't numeric" whenever an empty string is there for a field that is nevertheless numeric.

        And your code warns for the string "abc" in a numeric field. You didn't perform validation, so neither did I.

        savings from converting Unix time from 10-character strings to 4-byte integers are significant.

        A measly 16% savings.

        $ perl -MDevel::Size=total_size -E'my @a; for (1..3000*6000/10) { my $ +item = "1234567890"; push @a, $item; } say(total_size(\@a))' 87588692 $ perl -MDevel::Size=total_size -E'my @a; for (1..3000*6000/10) { my $ +item = "1234567890"; $item = 0+$item; push @a, $item; } say(total_siz +e(\@a))' 73188692

        Anyway, the question was how to make array processing faster

        So why are you complaining about a little extra memory.

Re: Unpacking and converting
by andal (Hermit) on Feb 16, 2011 at 08:11 UTC
    Bandwidth matters a lot in this case, so I'd like to "compress" the data by converting numbers from string representation given by unpack() to numerical values.

    I really don't understand this part. Normally unpack is used after the data is received from the network. And to compress the data function pack is used. Why do you need to convert strings to numbers?

    Normally, if I want to send over network something in compact form I simply use pack. Example

    my $one = "1234"; my $two = 1234; my $d = pack("nn", $one, $two); # See the size of packed data (it is 4 bytes). print length($d), "\n"; # See that both string and number are packed the same way. print unpack("H*", $d), "\n";

      I really don't understand this part.

      There is a black box of software that dumps some data in text format every three seconds. The data is actually a representation of internal process state in a report format with fixed-width fields containing numeric values or, in certain cases, empty strings instead. I need to gather this data, process it and put in a database. There is also a side requirement of placing minimal possible load on that server. This is why the database and all processing is on an external machine I can control.

      The amount of text is quite significant, as I said each dump can easily contain 15-30 mb of text depending on server state, and these dumps are coming every three seconds. Some kind of compression is required even when the data is sent over fast LAN to a nearby machine, otherwise I run the risk of not having enough time to send over and process one batch before next one is ready.

      My first thought was to use an agent software that would collect the data, parse it into arrays with unpack(), serialize with Storable and send over to processing machine. Now it appears to me that this approach is wrong altogether; it will be much easier to compress the text with gzip before sending. Nevertheless, the question of the fastest array iteration remains since I still need to process the data.

      Regards,
      Alex.

        Well. Probably you are looking the wrong way. If your programm directly connects to DB using DBI or alike, then DBI takes care of all necessary optimizations. Your attempts to convert strings to numbers just make system less efficient. If your programm simply passes data from one server to another, where another program picks it up, then again it makes sense simply take the text, zip it (or bzip2 it :) and then copy to remote. Probably rsync with option -z would be best for this.

        The point is. Conversion of perl variable from text to number does not add any compactness.

        The other point is. Iterating over elements of array using foreach is faster.

        Alex,

        The amount of text is quite significant ... easily contain 15-30 mb. of text..

        You really have 2 problems that you are trying to solve with one script. First, you need to get the data on a separate server, and second, you need to process the data.

        For the first part of the problem, I would use  use IO::Compress::Gzip and then send the data to the second machine. Your mileage may vary, but I would expect your 15-30MByte file to compress to 1-3MBype. Fast and secure, and core code. Then use  use IO::Uncompress::Gunzip on the second machine to get back to the original data.

        For the second problem, IMO, use the power of Unix. Use multiply scripts to process the data in parallel. The data should be time-stamped, so the data going into the database will be correct, which is more important than the fastest script. I would use cron to check on the status of the running scripts. Save your pids ($$) in a common place and use a small simple perl script to check on their running. It's quite simple to send a "text message" to multiple admins if you discover problems with the scripts! And use 'sleep' or 'usleep' between script passes, you'll get a lot more work done in the long run.

        Good Luck!

        "Well done is better than well said." - Benjamin Franklin

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://888353]
Approved by planetscape
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2024-04-23 08:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found