Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Unpacking and converting

by andal (Hermit)
on Feb 16, 2011 at 08:11 UTC ( #888460=note: print w/replies, xml ) Need Help??

in reply to Unpacking and converting

Bandwidth matters a lot in this case, so I'd like to "compress" the data by converting numbers from string representation given by unpack() to numerical values.

I really don't understand this part. Normally unpack is used after the data is received from the network. And to compress the data function pack is used. Why do you need to convert strings to numbers?

Normally, if I want to send over network something in compact form I simply use pack. Example

my $one = "1234"; my $two = 1234; my $d = pack("nn", $one, $two); # See the size of packed data (it is 4 bytes). print length($d), "\n"; # See that both string and number are packed the same way. print unpack("H*", $d), "\n";

Replies are listed 'Best First'.
Re^2: Unpacking and converting
by dwalin (Monk) on Feb 16, 2011 at 09:44 UTC
    I really don't understand this part.

    There is a black box of software that dumps some data in text format every three seconds. The data is actually a representation of internal process state in a report format with fixed-width fields containing numeric values or, in certain cases, empty strings instead. I need to gather this data, process it and put in a database. There is also a side requirement of placing minimal possible load on that server. This is why the database and all processing is on an external machine I can control.

    The amount of text is quite significant, as I said each dump can easily contain 15-30 mb of text depending on server state, and these dumps are coming every three seconds. Some kind of compression is required even when the data is sent over fast LAN to a nearby machine, otherwise I run the risk of not having enough time to send over and process one batch before next one is ready.

    My first thought was to use an agent software that would collect the data, parse it into arrays with unpack(), serialize with Storable and send over to processing machine. Now it appears to me that this approach is wrong altogether; it will be much easier to compress the text with gzip before sending. Nevertheless, the question of the fastest array iteration remains since I still need to process the data.


      Well. Probably you are looking the wrong way. If your programm directly connects to DB using DBI or alike, then DBI takes care of all necessary optimizations. Your attempts to convert strings to numbers just make system less efficient. If your programm simply passes data from one server to another, where another program picks it up, then again it makes sense simply take the text, zip it (or bzip2 it :) and then copy to remote. Probably rsync with option -z would be best for this.

      The point is. Conversion of perl variable from text to number does not add any compactness.

      The other point is. Iterating over elements of array using foreach is faster.

        Probably you are looking the wrong way.

        Yes, I came to the same conclusion. In my case, I will be able to use SSH with compression and this indeed solves all my pains; somehow it wasn't the first choice for me. I was dumbstruck probably, as it seems obvious in hindsight.

        For consistency sake though, I'd like to mention that conversion from text to number does indeed add compactness to serialized data. Consider this:

        use Storable qw(freeze); my @a = '0001'..'1000'; my $foo = freeze \@a; $_ += 0 for @a; my $bar = freeze \@a; print "before: ", length $foo, ", after: ", length $bar, "\n"; before: 6016, after: 4635

        I fail to understand why so many people are insistent on ignoring the obvious. Granted, today's fast and abundant hardware resources may have spoiled us but there are situations yet when every byte counts. Make the dataset in above example three orders of magnitude larger and the difference becomes quite distinct.

        Regarding array iteration, I feel this discussion was beneficial for me as it cleared some murky points. I wish all my questions were answered so productively in future. :)



      The amount of text is quite significant ... easily contain 15-30 mb. of text..

      You really have 2 problems that you are trying to solve with one script. First, you need to get the data on a separate server, and second, you need to process the data.

      For the first part of the problem, I would use  use IO::Compress::Gzip and then send the data to the second machine. Your mileage may vary, but I would expect your 15-30MByte file to compress to 1-3MBype. Fast and secure, and core code. Then use  use IO::Uncompress::Gunzip on the second machine to get back to the original data.

      For the second problem, IMO, use the power of Unix. Use multiply scripts to process the data in parallel. The data should be time-stamped, so the data going into the database will be correct, which is more important than the fastest script. I would use cron to check on the status of the running scripts. Save your pids ($$) in a common place and use a small simple perl script to check on their running. It's quite simple to send a "text message" to multiple admins if you discover problems with the scripts! And use 'sleep' or 'usleep' between script passes, you'll get a lot more work done in the long run.

      Good Luck!

      "Well done is better than well said." - Benjamin Franklin

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://888460]
[GotToBTru]: 99% of all deaths take place within 24 hours of ingesting di-hydrogen monoxide
[GotToBTru]: time for some C8H10N4O2 for me

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2016-12-06 12:59 GMT
Find Nodes?
    Voting Booth?
    On a regular basis, I'm most likely to spy upon:

    Results (104 votes). Check out past polls.