Unpacking and converting

dwalin has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unpacking and converting by BrowserUk (Patriarch) on Feb 15, 2011 at 22:18 UTC
Your code above never does anything with the modified `$item`--ie. it never pushes it into `@newdata`, which means that your loop does exactly nothing, but slowly. However, looking past that... This is weird but shifting items one by one into new array seems to be a lot faster than iterating over the array with canonical for (@list). Update: Ignore. The Benchmark is broken! Frankly, I thought you were talking through the top of your hat when you said that, until I benchmarked it. And, despite my spending a while trying to see a flaw in the benchmark, you seem to be right. Not only am I surprised that it seems to be true, but I'm utterly staggered by the difference in performance. And at an utter loss to explain why it should be the case. `our @a = @b = @c = '0001' .. '1000';; cmpthese -1,{ a => q[ $_ += 0 for @a; ], b => q[ my @new; push @new, $_ + 0 while defined( $_ = shift @b ) +], c => q[ $c[ $_ ] += 0 for 0 .. $#c; ] };; Rate c a b c 6220/s -- -37% -100% a 9893/s 59% -- -100% b 4562313/s 73247% 46016% --` [download] And the bigger the array, the more extraordinary the difference becomes: `our @a = @b = @c = @d = '00001' .. '10000';; cmpthese -1,{ a => q[ $_ += 0 for @a;], b => q[ my @new; push @new, $_ + 0 while defined( $_ = shift @b ) +], c => q[ $c[ $_ ] += 0 for 0 .. $#c; ], d => q[ my @new = map $_ += 0, @d ] };; Rate d c a b d 258/s -- -58% -72% -100% c 615/s 138% -- -34% -100% a 932/s 261% 52% -- -100% b 4651085/s 1800042% 756579% 499189% --` [download] There is something amiss here, but if it is the benchhmark I cannot see it. And if not, I'm loathed to explain why creating a new array by pushing them one at a time whilst destroying the old one, would be so much faster than iterating over the original in place. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Unpacking and converting by andal (Hermit) on Feb 16, 2011 at 08:02 UTC
I recommend to try `cmpthese -1,{ c => q[ my @c = '0001' .. '1000'; $c[ $_ ] += 0 for 0 .. $#c; ], a => q[ my @a = '0001' .. '1000'; $_ += 0 for @a; ], b => q[ my @b = '0001' .. '1000'; my @new; push @new, $_ + 0 while + defined( $_ = shift @b ) ], };` [download] Then the results are appropriate `Rate b c a b 999/s -- -22% -28% c 1279/s 28% -- -7% a 1382/s 38% 8% --` [download]	[reply] [d/l] [select]
Re^3: Unpacking and converting by dwalin (Monk) on Feb 16, 2011 at 09:14 UTC
I think in this case the array creation and population makes significant influence on results. Consider this: `cmpthese -1,{ d => q[ my @d = '0001' .. '1000'; ], c => q[ my @c = '0001' .. '1000'; $c[ $_ ] += 0 for 0 .. $#c; ], a => q[ my @a = '0001' .. '1000'; $_ += 0 for @a; ], b => q[ my @b = '0001' .. '1000'; my @new; push @new, $_ + 0 while + defined( $_ = shift @b ) ], };` [download] Results speak for themselves: `Rate b c a d b 1267/s -- -29% -37% -63% c 1794/s 42% -- -10% -47% a 2000/s 58% 11% -- -41% d 3413/s 169% 90% 71% --` [download] Regards, Alex.	[reply] [d/l] [select]
Re^4: Unpacking and converting by andal (Hermit) on Feb 16, 2011 at 10:39 UTC
Re^5: Unpacking and converting by BrowserUk (Patriarch) on Feb 16, 2011 at 15:51 UTC
Re^5: Unpacking and converting by dwalin (Monk) on Feb 16, 2011 at 14:01 UTC
Re^4: Unpacking and converting by Anonyrnous Monk (Hermit) on Feb 16, 2011 at 10:26 UTC
Re^2: Unpacking and converting by dwalin (Monk) on Feb 16, 2011 at 08:47 UTC
Your code above never does anything Oopsie, thanks for pointing out this braino. I ran my benchmarks against working code, you can be sure of that. :) Considering array processing, well I'm not a perl guru so I can only guess; and my guess is that in perl, arrays are not really arrays internally but linked lists with external indices. That would at least explain why shifting and tossing is so blazing fast - there's very little expense in following a pointer, after all - but it cannot explain why index lookup is so shocking slow. And why for (@list) is 60% faster than index lookup but otherwise slow is beyond me. This was exactly the reason I asked my question, since my first impression was that I'm doing something wrong but couldn't pinpoint what exactly. Regards, Alex.	[reply]
Re: Unpacking and converting by chrestomanci (Priest) on Feb 15, 2011 at 20:55 UTC
Are you aware of the old but extremely good node on this site that explains how to use pack? As for deep magic in processing lists, I am ignorant, other than to remind you to use Benchmark	[reply]
Re: Unpacking and converting by ikegami (Patriarch) on Feb 15, 2011 at 19:20 UTC
`$item = $item ? $item + 0 : 0;` [download] can be written as `$item += 0;` [download] or just as `# This space intentionally left empty` [download]	[reply] [d/l] [select]
Re^2: Unpacking and converting by dwalin (Monk) on Feb 15, 2011 at 20:50 UTC
ikegami, Not so. Simple addition results in warnings "Argument isn't numeric" whenever an empty string is there for a field that is nevertheless numeric. Second option has no sense either, as I said that this data will be sent over the network. Considering that one line of data can easily be 2-4k (yes, kilo) and there can be like 5-7k lines per batch, savings from converting Unix time from 10-character strings to 4-byte integers are significant. And no, Storable does not convert the data by itself, I have checked before asking. Anyway, the question was how to make array processing faster, not if there is any sense in doing it. Regards, Alex.	[reply]
Re^3: Unpacking and converting by ikegami (Patriarch) on Feb 15, 2011 at 21:05 UTC
Simple addition results in warnings "Argument isn't numeric" whenever an empty string is there for a field that is nevertheless numeric. And your code warns for the string "abc" in a numeric field. You didn't perform validation, so neither did I. savings from converting Unix time from 10-character strings to 4-byte integers are significant. A measly 16% savings. `$ perl -MDevel::Size=total_size -E'my @a; for (1..30006000/10) { my $ +item = "1234567890"; push @a, $item; } say(total_size(\@a))' 87588692 $ perl -MDevel::Size=total_size -E'my @a; for (1..30006000/10) { my $ +item = "1234567890"; $item = 0+$item; push @a, $item; } say(total_siz +e(\@a))' 73188692` [download] Anyway, the question was how to make array processing faster So why are you complaining about a little extra memory.	[reply] [d/l]
Re^4: Unpacking and converting by dwalin (Monk) on Feb 15, 2011 at 21:31 UTC
Re^5: Unpacking and converting by ikegami (Patriarch) on Feb 16, 2011 at 17:07 UTC
Some notes below your chosen depth have not been shown here
Re: Unpacking and converting by andal (Hermit) on Feb 16, 2011 at 08:11 UTC
Bandwidth matters a lot in this case, so I'd like to "compress" the data by converting numbers from string representation given by unpack() to numerical values. I really don't understand this part. Normally unpack is used after the data is received from the network. And to compress the data function pack is used. Why do you need to convert strings to numbers? Normally, if I want to send over network something in compact form I simply use pack. Example `my $one = "1234"; my $two = 1234; my $d = pack("nn", $one, $two); # See the size of packed data (it is 4 bytes). print length($d), "\n"; # See that both string and number are packed the same way. print unpack("H*", $d), "\n";` [download]	[reply] [d/l]
Re^2: Unpacking and converting by dwalin (Monk) on Feb 16, 2011 at 09:44 UTC
I really don't understand this part. There is a black box of software that dumps some data in text format every three seconds. The data is actually a representation of internal process state in a report format with fixed-width fields containing numeric values or, in certain cases, empty strings instead. I need to gather this data, process it and put in a database. There is also a side requirement of placing minimal possible load on that server. This is why the database and all processing is on an external machine I can control. The amount of text is quite significant, as I said each dump can easily contain 15-30 mb of text depending on server state, and these dumps are coming every three seconds. Some kind of compression is required even when the data is sent over fast LAN to a nearby machine, otherwise I run the risk of not having enough time to send over and process one batch before next one is ready. My first thought was to use an agent software that would collect the data, parse it into arrays with unpack(), serialize with Storable and send over to processing machine. Now it appears to me that this approach is wrong altogether; it will be much easier to compress the text with gzip before sending. Nevertheless, the question of the fastest array iteration remains since I still need to process the data. Regards, Alex.	[reply]
Re^3: Unpacking and converting by andal (Hermit) on Feb 16, 2011 at 10:50 UTC
Well. Probably you are looking the wrong way. If your programm directly connects to DB using DBI or alike, then DBI takes care of all necessary optimizations. Your attempts to convert strings to numbers just make system less efficient. If your programm simply passes data from one server to another, where another program picks it up, then again it makes sense simply take the text, zip it (or bzip2 it :) and then copy to remote. Probably rsync with option -z would be best for this. The point is. Conversion of perl variable from text to number does not add any compactness. The other point is. Iterating over elements of array using foreach is faster.	[reply]
Re^4: Unpacking and converting by dwalin (Monk) on Feb 17, 2011 at 13:30 UTC
Re^5: Unpacking and converting by andal (Hermit) on Mar 10, 2011 at 12:47 UTC
Re^3: Unpacking and converting by flexvault (Monsignor) on Feb 17, 2011 at 15:57 UTC
Alex, The amount of text is quite significant ... easily contain 15-30 mb. of text.. You really have 2 problems that you are trying to solve with one script. First, you need to get the data on a separate server, and second, you need to process the data. For the first part of the problem, I would use `use IO::Compress::Gzip` and then send the data to the second machine. Your mileage may vary, but I would expect your 15-30MByte file to compress to 1-3MBype. Fast and secure, and core code. Then use `use IO::Uncompress::Gunzip` on the second machine to get back to the original data. For the second problem, IMO, use the power of Unix. Use multiply scripts to process the data in parallel. The data should be time-stamped, so the data going into the database will be correct, which is more important than the fastest script. I would use cron to check on the status of the running scripts. Save your pids ($$) in a common place and use a small simple perl script to check on their running. It's quite simple to send a "text message" to multiple admins if you discover problems with the scripts! And use 'sleep' or 'usleep' between script passes, you'll get a lot more work done in the long run. Good Luck! "Well done is better than well said." - Benjamin Franklin	[reply] [d/l] [select]


P is for Practical
	PerlMonks