http://www.perlmonks.org?node_id=158817

Parham has asked for the wisdom of the Perl Monks concerning the following question:

currently, on projects where users get to input information, i ID that information using the following code:
my $id = "$^T$$"; print $id;
$^T is the basetime (measure of seconds since 1970)
$$ is the current process ID

I implemented this method thinking it's the most reasonable method instead of tracking and incrementing a number. Here's my question, is this a foolproof method of ID'ing user input? My concern doesn't deal with $^T (because obviously time only moves forward), but instead with $$ which i'm thinking can be reset at one point or another.

Replies are listed 'Best First'.
Re: method of ID'ing
by Juerd (Abbot) on Apr 13, 2002 at 18:10 UTC

    $$ which i'm thinking can be reset at one point or another.

    As long as you're on a system that uses incremental process ID's, you will probably be safe. Process ID's cycle when the maximum has been reached (I think it's 65535 on my system) when they're incremental, but forking that often in a single second is very unlikely.

    However, Not all PIDs are simple incremental. Some are randomly chosen, and in that case, especially with short running scripts, you have a greater chance of having two identical IDs.

    My concern doesn't deal with $^T

    It contains the start time of the program (read: interpreter), which can cause problems if your interpreter is a long running interpreter like mod_perl or one of the many fast-perl-CGI things that avoid forking interpreters. Better is to use time, which returns the current time.

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

      Good point on the $^T bit, I missed that. I'd actually think it was more of a problem with a short-lived task though such as a CGI. For example, if a CGI took a maximum of ten seconds to run then there'd only be ten possible values, increasing the possibility of a clash?

      On another note you may hit $$ problems if the same process wants multiple IDs, but by this point you'd be wise to look at something like class::singleton to guarantee a single-point ID manager.

Re: method of ID'ing
by tachyon (Chancellor) on Apr 13, 2002 at 19:37 UTC
    use MD5; my $MD5 = new MD5(); $MD5->add( $$ . rand() . time() ); my $id = $MD5->hexdigest;

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Does MD5 have a one to one relationship between the plaintext and the cyphertext? In other words, is it impossible for two different strings to map to the same MD5 string? If not, you might be introducing a potential for collisions by using it.

        So they say. Check out the unofficial MD5 homepage here or read the full RFC1321

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        It has a one to (one of 2 to the 128th) possible values. Since the output domain of MD5 is limited to a 128 bit string, it is possible for more than one value to map to the same output value. It is a very small chance that two of the given inputs would ever map to the same string (unless there were a statistically significant percentage of 2^128 worth of entries) and even if there were, I don't believe this code is being used for something which is intended to be mission critical.

        Another issue to consider with MD5 is that the input value needs to be fairly large, if you're using it for 'important' purposes. Since MD5 operates on strings of size evenly divisible by 512, and pads otherwise, it's important to make sure you have at least one full block, to retain computational protection.

        Hope that helped.

        -il cylic
Re: method of ID'ing
by gav^ (Curate) on Apr 13, 2002 at 21:12 UTC
Re: method of ID'ing
by Molt (Chaplain) on Apr 13, 2002 at 18:09 UTC

    It all depends on how quickly you're recycling numbers, I think. $$ will be reset at some point, this is true, but I seriously doubt it'll ever get reset and back to it's initial value within the one second timeframe needed to stop this being a unique ID.

    I guess that if you want to be truly paranoid you could look into how the better-coded hit counters work and use that kind of file handling to manage your ID, I think this should work in any realistic situation though.

Re: method of ID'ing
by blakem (Monsignor) on Apr 14, 2002 at 12:57 UTC
    One thing not mentioned yet is that the amount of "uniqueness" contained in $$ drops significantly when you scale beyond a single webserver. If you have a group of load balanced webservers, you no longer have to roll-the-pid to get duplicate values of $$.

    Even if your whole project is running on a single machine today, a simple timestamp+pid identifier hampers the long term scalability of your site.

    -Blake

      webserver.

      If you're using Apache and have mod_unique_id, you can use $ENV{UNIQUE_ID}, which I like a lot.

      The link above links to the module documentation, a page that also has detailed information about IDing techniques.

      - Yes, I reinvent wheels.
      - Spam: Visit eurotraQ.
      

Re: method of ID'ing
by tmiklas (Hermit) on Apr 14, 2002 at 10:13 UTC
    I think it's ok if you have less than ~64k or ~32k requests per second (and of course you have to be able to answer to all those requests really FAST) ;-) It depends on the system you use... PID counter rolls over after some value (~32k or ~64k or other - you have to check it). I dont think that i'll ever have such traffic on my sites ;) so i use this method frequently.
    Hmmm... how about using Time::HiRes to increase precision of $^T (in reasonable cases of course)?!

    Greetz, Tom.

      Hmmm... how about using Time::HiRes to increase precision of $^T (in reasonable cases of course)?!

      $^T is set before Time::HiRes can be loaded, so it won't make a difference. $^T is not a magic variable that issues time, it is set when the interpreter starts (which can cause a lot of trouble when running under mod_perl, irssi, or any other long term perl embedder).

      Don't use $^T for IDing purposes, use time instead. To update existing scripts (but it might break some that depend on $^T to not change), you could use:

      package Tie::Time; use Carp; use strict; sub TIESCALAR { bless \my $dummy, shift } sub STORE { croak 'Cannot set time this way' } sub FETCH { time } =head1 NAME Tie::Time - Have a scalar return the current time() =head1 SYNOPSIS tie my $time, 'Tie::Time'; # New variable tie $^T, 'Tie::Time'; # Override existing $^T =head1 DESCRIPTION Guess :) =head1 URL http://perlmonks.org/?node_id=158912 =cut

      - Yes, I reinvent wheels.
      - Spam: Visit eurotraQ.
      

        how about adding require Tie::Scalar, and throwing this in the code catacombs? this is a nice, simple answer to retrieving the current time.

        ~Particle ;Þ

        As i see your answers always get to the point... ;-)
        I've never used $^T for this task - always time(). Besides for some time i use unique_id provided with Apache ;-) and from this point have almost nothing to worry about ;-).

        Greetz, Tom.
Re: method of ID'ing
by roboslug (Sexton) on Apr 15, 2002 at 04:28 UTC
    Time is an illusion. Lunchtime doubly So.
    - Douglas Adams

    Parham,

    I used a similar method on a network daemon and it worked very well until one day a "backup" time server was put into place that was not set to the right time. All of the nodes running the daemon migrated to this new time server because it more correctly matched their (PST time, not PDT time) and next thing you know, IDs are getting re-used and all hell breaks loose.

    After examining time sync protocols, I also think there may be some error margin at startup, where the time fluctuates up and down as it adjusts to match time server. I could be wrong here.

    So, my notes about unique IDs are as follows:

    * If you plan to use time(), use Time::HiRes instead. It provides more uniqueness and also seems to execute faster than time().

    * If you hash (ie., MD5), I would use SHA1 instead and remember to add buffer. I really see little reason to hash unless you prefer the string format of a hash. I avoid hashing when it isn't necessary due to the calculation time involved.

    * Add an internal increment...sorry, only way I could figure out how to deal with time "slipping". After $inc == MAXINC, reset so you don't get absurdly long numbers over time. Store the $inc to a file if you need to maintain persistence or allow other instances to grab it. Load $inc; $inc++; Save $inc. Remember to flock.

    * If you want to make it survive distributed systems, (load balanced or whatever), attach a hostname, IP, or Mac Address. Mac Address will protect you from "admin" mistakes.

    * Random is an ok thing to add to your string, but you shouldn't need it and since it is only "somewhat" random, doesn't help much more than time+PID+inc.

    * And/or if you really want to make sure nothing "bad" happens, store the ID and do a check. A quick way to do this is to make a file in /tmp or similar purpose area and do something like:

    do { [ generate ID code ] } while (-e $ID) [ create empty /tmp/$ID file ]
    Of course, this gets slow after thousands of IDs have been generated, so be sure to clean house in some fashion as well.

      * If you plan to use time(), use Time::HiRes instead. It provides more uniqueness and also seems to execute faster than time().

      Time::HiRes::time indeed provides more uniqueness, but it is not faster:

      Benchmark: running Time::HiRes::time, time, each for at least 1 CPU se +conds... Time::HiRes::time: 2 wallclock secs ( 0.82 usr + 0.22 sys = 1.04 CP +U) @ 1071260.58/s (n=1114111) time: 0 wallclock secs ( 0.70 usr + 0.31 sys = 1.01 CPU) @ 18 +16838.61/s (n=1835007) Rate Time::HiRes::time time Time::HiRes::time 1071261/s -- -41% time 1816839/s 70% --

      sorry, only way I could figure out how to deal with time "slipping". After $inc == MAXINC

      Try the modulo operator %. Example increments:

      ($counter += 1) %= 5; # 0, 1, 2, 3, 4, 0, 1, 2..4, 0..4, 0..4, ... ($counter += 1) %= 256; # 0..255, 0..255, ...

      - Yes, I reinvent wheels.
      - Spam: Visit eurotraQ.
      

        Actually, you and I are both correct. ;-) What I forgot was that the time I benchmarked it, it was under NT. I just did a benchmark on NT and Linux and got the following results:

        NT:
        Time::HiRes::time() - timethis 600000: 20 wallclock secs (19.99 usr + 0.00 sys= 19.99 CPU) @ 30015.01/s (n=600000)
        time() - timethis 600000: 67 wallclock secs (66.73 usr + 0.00 sys = 66.73 CPU) @ 8991.46/s (n=600000)

        Linux:
        Time::HiRes::time() - timethis 600000: 3 wallclock secs ( 1.22 usr + 0.24 sys = 1.46 CPU)
        time() - timethis 600000: 1 wallclock secs ( 0.27 usr + 0.17 sys = 0.44 CPU)

        Anyway, enough of the thread hijacking.

        The modulo operator is a great idea, good suggestion.
Re: method of ID'ing
by kappa (Chaplain) on Apr 16, 2002 at 15:03 UTC
    Just a little comment about hashes. I want to second roboslug and say that there's no use in using cryptographically-strong hash unless there're chances for your users to enter ID (think session ID in URL). These functions are usually computationally-hard (and usually by design) and add nothing from the point of randomness (and therefore uniqueness) to your ID.

    But beware of users guessing your time()-based IDs in URLs (or even cookies).