dwalin has asked for the wisdom of the Perl Monks concerning the following question:
Hello wise brethren, I'm seeking your help again.
What I need to do is to parse fixed-width records in arrays and send the resulting data over network. Bandwidth matters a lot in this case, so I'd like to "compress" the data by converting numbers from string representation given by unpack() to numerical values. I'm trying to find the most effective way to do this in Perl, this is what I managed so far:
my @types = qw(number number string number boolean...)
my @data = unpack($template, $line);
my @newdata; my $i = 0;
while (my $item = shift @data) {
if ($types[$i] ne 'string') {
$item = $item ? $item + 0 : 0;
}
push @newdata, $item;
}
This is weird but shifting items one by one into new array seems to be a lot faster than iterating over the array with canonical for (@list). I wonder if there is any more deep Perl magic that would allow making processing even faster?
Thanks in advance.
Regards,
Alex.
Re: Unpacking and converting
by BrowserUk (Patriarch) on Feb 15, 2011 at 22:18 UTC
|
Your code above never does anything with the modified $item--ie. it never pushes it into @newdata, which means that your loop does exactly nothing, but slowly. However, looking past that...
This is weird but shifting items one by one into new array seems to be a lot faster than iterating over the array with canonical for (@list).
Update: Ignore. The Benchmark is broken!
Frankly, I thought you were talking through the top of your hat when you said that, until I benchmarked it. And, despite my spending a while trying to see a flaw in the benchmark, you seem to be right. Not only am I surprised that it seems to be true, but I'm utterly staggered by the difference in performance. And at an utter loss to explain why it should be the case.
our @a = @b = @c = '0001' .. '1000';;
cmpthese -1,{
a => q[ $_ += 0 for @a; ],
b => q[ my @new; push @new, $_ + 0 while defined( $_ = shift @b )
+],
c => q[ $c[ $_ ] += 0 for 0 .. $#c; ]
};;
Rate c a b
c 6220/s -- -37% -100%
a 9893/s 59% -- -100%
b 4562313/s 73247% 46016% --
And the bigger the array, the more extraordinary the difference becomes: our @a = @b = @c = @d = '00001' .. '10000';;
cmpthese -1,{
a => q[ $_ += 0 for @a;],
b => q[ my @new; push @new, $_ + 0 while defined( $_ = shift @b )
+],
c => q[ $c[ $_ ] += 0 for 0 .. $#c; ],
d => q[ my @new = map $_ += 0, @d ]
};;
Rate d c a b
d 258/s -- -58% -72% -100%
c 615/s 138% -- -34% -100%
a 932/s 261% 52% -- -100%
b 4651085/s 1800042% 756579% 499189% --
There is something amiss here, but if it is the benchhmark I cannot see it.
And if not, I'm loathed to explain why creating a new array by pushing them one at a time whilst destroying the old one, would be so much faster than iterating over the original in place.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
cmpthese -1,{
c => q[ my @c = '0001' .. '1000'; $c[ $_ ] += 0 for 0 .. $#c; ],
a => q[ my @a = '0001' .. '1000'; $_ += 0 for @a; ],
b => q[ my @b = '0001' .. '1000'; my @new; push @new, $_ + 0 while
+ defined( $_ = shift @b ) ],
};
Then the results are appropriate
Rate b c a
b 999/s -- -22% -28%
c 1279/s 28% -- -7%
a 1382/s 38% 8% --
| [reply] [d/l] [select] |
|
I think in this case the array creation and population makes significant influence on results. Consider this:
cmpthese -1,{
d => q[ my @d = '0001' .. '1000'; ],
c => q[ my @c = '0001' .. '1000'; $c[ $_ ] += 0 for 0 .. $#c; ],
a => q[ my @a = '0001' .. '1000'; $_ += 0 for @a; ],
b => q[ my @b = '0001' .. '1000'; my @new; push @new, $_ + 0 while
+ defined( $_ = shift @b ) ],
};
Results speak for themselves:
Rate b c a d
b 1267/s -- -29% -37% -63%
c 1794/s 42% -- -10% -47%
a 2000/s 58% 11% -- -41%
d 3413/s 169% 90% 71% --
Regards,
Alex. | [reply] [d/l] [select] |
|
|
|
|
|
Your code above never does anything
Oopsie, thanks for pointing out this braino. I ran my benchmarks against working code, you can be sure of that. :)
Considering array processing, well I'm not a perl guru so I can only guess; and my guess is that in perl, arrays are not really arrays internally but linked lists with external indices. That would at least explain why shifting and tossing is so blazing fast - there's very little expense in following a pointer, after all - but it cannot explain why index lookup is so shocking slow. And why for (@list) is 60% faster than index lookup but otherwise slow is beyond me. This was exactly the reason I asked my question, since my first impression was that I'm doing something wrong but couldn't pinpoint what exactly.
Regards,
Alex.
| [reply] |
Re: Unpacking and converting
by chrestomanci (Priest) on Feb 15, 2011 at 20:55 UTC
|
| [reply] |
Re: Unpacking and converting
by ikegami (Patriarch) on Feb 15, 2011 at 19:20 UTC
|
$item = $item ? $item + 0 : 0;
can be written as
$item += 0;
or just as
# This space intentionally left empty
| [reply] [d/l] [select] |
|
ikegami,
Not so. Simple addition results in warnings "Argument isn't numeric" whenever an empty string is there for a field that is nevertheless numeric.
Second option has no sense either, as I said that this data will be sent over the network. Considering that one line of data can easily be 2-4k (yes, kilo) and there can be like 5-7k lines per batch, savings from converting Unix time from 10-character strings to 4-byte integers are significant. And no, Storable does not convert the data by itself, I have checked before asking.
Anyway, the question was how to make array processing faster, not if there is any sense in doing it.
Regards,
Alex.
| [reply] |
|
$ perl -MDevel::Size=total_size -E'my @a; for (1..3000*6000/10) { my $
+item = "1234567890"; push @a, $item; } say(total_size(\@a))'
87588692
$ perl -MDevel::Size=total_size -E'my @a; for (1..3000*6000/10) { my $
+item = "1234567890"; $item = 0+$item; push @a, $item; } say(total_siz
+e(\@a))'
73188692
Anyway, the question was how to make array processing faster
So why are you complaining about a little extra memory.
| [reply] [d/l] |
|
|
|
Re: Unpacking and converting
by andal (Hermit) on Feb 16, 2011 at 08:11 UTC
|
Bandwidth matters a lot in this case, so I'd like to "compress" the data by converting numbers from string representation given by unpack() to numerical values.
I really don't understand this part. Normally unpack is used after the data is received from the network. And to compress the data function pack is used. Why do you need to convert strings to numbers?
Normally, if I want to send over network something in compact form I simply use pack. Example
my $one = "1234";
my $two = 1234;
my $d = pack("nn", $one, $two);
# See the size of packed data (it is 4 bytes).
print length($d), "\n";
# See that both string and number are packed the same way.
print unpack("H*", $d), "\n";
| [reply] [d/l] |
|
I really don't understand this part.
There is a black box of software that dumps some data in text format every three seconds. The data is actually a representation of internal process state in a report format with fixed-width fields containing numeric values or, in certain cases, empty strings instead. I need to gather this data, process it and put in a database. There is also a side requirement of placing minimal possible load on that server. This is why the database and all processing is on an external machine I can control.
The amount of text is quite significant, as I said each dump can easily contain 15-30 mb of text depending on server state, and these dumps are coming every three seconds. Some kind of compression is required even when the data is sent over fast LAN to a nearby machine, otherwise I run the risk of not having enough time to send over and process one batch before next one is ready.
My first thought was to use an agent software that would collect the data, parse it into arrays with unpack(), serialize with Storable and send over to processing machine. Now it appears to me that this approach is wrong altogether; it will be much easier to compress the text with gzip before sending. Nevertheless, the question of the fastest array iteration remains since I still need to process the data.
Regards,
Alex.
| [reply] |
|
Well. Probably you are looking the wrong way. If your programm directly connects to DB using DBI or alike, then DBI takes care of all necessary optimizations. Your attempts to convert strings to numbers just make system less efficient. If your programm simply passes data from one server to another, where another program picks it up, then again it makes sense simply take the text, zip it (or bzip2 it :) and then copy to
remote. Probably rsync with option -z would be best for this.
The point is. Conversion of perl variable from text to number does not add any compactness.
The other point is. Iterating over elements of array using foreach is faster.
| [reply] |
|
|
|
Alex,
The amount of text is quite significant ... easily contain 15-30 mb. of text..
You really have 2 problems that you are trying to solve with one script. First, you need to get the data on a separate server, and second, you need to process the data.
For the first part of the problem, I would use use IO::Compress::Gzip and then send the data to the second machine. Your mileage may vary, but I would expect your 15-30MByte file to compress to 1-3MBype. Fast and secure, and core code. Then use use IO::Uncompress::Gunzip on the second machine to get back to the original data.
For the second problem, IMO, use the power of Unix. Use multiply scripts to process the data in parallel. The data should be time-stamped, so the data going into the database will be correct,
which is more important than the fastest script. I would use cron to check on the status of the running scripts. Save your pids ($$) in a common place and use a small simple perl script to check on their running.
It's quite simple to send a "text message" to multiple admins if you discover problems with the scripts! And use 'sleep' or 'usleep' between script passes, you'll get a lot more work done in the long run.
Good Luck!
"Well done is better than well said." - Benjamin Franklin
| [reply] [d/l] [select] |
|
|