http://www.perlmonks.org?node_id=940997

wrog has asked for the wisdom of the Perl Monks concerning the following question:

I.e., not a process ID, or a thread ID, but an Interpreter ID.

As I understand it (from my reading of perlguts/etc), whenvever a new interpreter is created or cloned, there's a whole structure created that contains interpreter-specific stuff, and it seems to me there ought to be something in there that is unique to that interpreter instance and easy to get at....

... and perhaps even accessible from pure Perl (is there a magic variable I've overlooked?)

To be sure, I imagine I could write an XS function that, say, takes the address of PL_modglobal or Perl_get_context(), returns it as an integer and use that to distinguish my interpreters, but

  1. I'm not that well versed in XS and would prefer to stay in pure Perl if I possibly can.
  2. I'd like this to be as OS-independent as possible.
  3. Would taking those addresses even work?
    I.e., if the garbage collector is allowed to arbitrarily relocate interpreter structures, then that approach is doomed. (... granted, I would have thought interpreter relocation would be Very Hard to do given what (little) I know of Perl's current architecture, but maybe not; we'll skip the Small Matter of how this all changes in Perl 6...)
  4. Or, if taking these addresses would work, and there is no pure Perl way to do this, any idea which would be better to do? (i.e., thoughts about &PL_modglobal vs. &Perl_get_context vs. Something Else?)

. . . . .

For the curious, here's the problem I'm actually trying to solve, for which being able to get at an Interpreter ID seems the obvious, easy answer at the moment:

Consider a script that is running possibly in multiple processes and/or multiple threads at the same time (think mod_perl on MPM-Worker, but I don't want to limit myself to that particular situation).

The script needs to contain a counter method, e.g.,

my @values = (0, $$, time(), other_initialization_stuff() ); sub counter { ++$values[0]; return @values; }

the idea being that every invocation of this method must return a distinct result, no matter which process or thread the invocation occurs on. (Why do I need this? Cryptography.)

It's also important to be able to do this without having to consult any kind of shared cache.

So what's the minimal data other_initialization_stuff() can be returning so that this is actually the case? Bonus points if it's something that's guaranteed to work on all OS's (though I'm happy enough if I can cover Linux and Win32).

Including $$ distinguishes different processes (**) (***); so it's really about how to deal with multiple interpreters in the same process.

Note that Thread ID does not help us because in some cases (e.g., mod_perl) there is not a fixed correspondence between threads and interpreters -- we cannot rule out multiple interpreters getting (consecutively) loaded into the same thread to run the counter initialization code.

My theory in this is that there will be a one-to-one correspondence between interpreter IDs and the distinct instances of @values within a given process, which is what I want.

  • (*) ignoring wraparound issues, which I can at least watch out for.
  • (**) yes, I'm aware process IDs eventually get re-used, that's why time() is in there.
  • (***) yes, I know Win32 Perl creates fake process IDs for $$ so that fork() "works"; not a problem here.

Replies are listed 'Best First'.
Re: How do I get a unique Perl Interpreter ID?
by ikegami (Patriarch) on Dec 01, 2011 at 04:40 UTC
    use strict; use warnings; use feature qw( say ); use Inline C => <<'__EOI__'; IV get_interpreter_id() { return (IV)PERL_GET_THX; } __EOI__ say get_interpreter_id();
    >perl a.pl 3092452

    (Also perl -MInline=FORCE,NOISY,NOCLEAN a.pl)

    The XS is simply

    #include "EXTERN.h" #include "perl.h" #include "XSUB.h" MODULE = Foo PACKAGE = Foo IV get_interpreter_id() CODE: RETVAL = (IV)PERL_GET_THX; OUTPUT: RETVAL
      okay, this, with some minor changes, now exists as Thread::IID:
      • (UV) instead of (IV).
      • left off the 'get_'
      • returns PERL_GET_THX>>11 because PerlInterpreter structures are Just Huge (2800 bytes in my world) and this reduces the returned numbers somewhat.
      Enjoy. And thanks.
Re: How do I get a unique Perl Interpreter ID?
by BrowserUk (Patriarch) on Dec 01, 2011 at 03:38 UTC
    prefer to stay in pure Perl if I possibly can.

    AFAIK, outside of going into XS to create your own interpreter; thread == interpreter.

    So, if you're staying in pure Perl -- and therefore not creating any Interpreters yourself explicitly; or implicitly other than via threads (& Windows fork emulation) -- then the combination of pid & tid should be unique.

    Other than the possibility where the 32-bit tid wraps around, but if you're creating that many threads in a single process, then you could probably use the week number or maybe even month instead of the pid :)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      As OP mentioned mod_perl under worker MPM I doubt that tid will work. Mod_perl creates its interpreters using perl_clone, without help of threads. Can't test at the moment though.
Re: How do I get a unique Perl Interpreter ID?
by BrowserUk (Patriarch) on Dec 02, 2011 at 07:59 UTC

    As zwon pointed out that mod_perl may create interpreters that do not have a tid, I had another thought.

    The address of any of perl's readonly built-in variables that are cloned for each interpreter should be unique to that interpreter.

    Here I've used $$:

    c:>perl -Mthreads -E"async{ say \$$; sleep 1e6 }->detach for 1 .. 10" SCALAR(0x3c379b0) SCALAR(0x3cbf8f0) SCALAR(0x3d3fa60) SCALAR(0x3dcd7d0) SCALAR(0x6e5b240) SCALAR(0x6f017e0) SCALAR(0x6f68a00) SCALAR(0x6ff5820) SCALAR(0x707a4f0)

    At least as long as the previous threads are still running. And if a previous thread terminates and perchance a new thread happens to reuse the exact same address for the same variable in a new interpreter, that probably doesn't matter right?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      ooooo.

      I'm really liking ikegami's XS code, which is far more simple than I expected, but this looks like a potential winner w.r.t. the original question.

      Still, I'm wondering what it is we're actually getting when we do \$$: some experimenting shows that it is not at a fixed offset from PERL_GET_THX, which suggests to me that there's a separate allocation to create the reference and we're not actually getting the address of the variable itself. (At which point I'd worry about the reference being gc'ed and the address getting reused in another thread. Then again, I suppose one could just make a point of holding onto the reference... hmm... I wonder what pack "p"... will do with this).

      I also think one would want to pick something other than $$ which, being a thread-independent constant, has no reason not to be shared across threads even if the current implementation is not doing that for whatever reason. But it's not like there aren't a whole mess of other things to choose from.

      And if a previous thread terminates and perchance a new thread happens to reuse the exact same address for the same variable in a new interpreter, that probably doesn't matter right?
      This is the same problem as process ID getting reused. I think as long as I've got time() in there we're okay (.. and I believe it's bullet-proof if I put in a sleep(1) between initializing the counter and creating the sub ...hmmm... and now we may have an argument for using microtime...)
        ...hmmm... and now we may have an argument for using microtime...

        If you've moved away from staying pure Perl, for a source of an unguessable counter, I'd use the Time Stamp Counter.

        Given that this changes by anything ranging from 1/2 a million to 10s of millions between successive calls in a tight loop, the odds of collisions even if you have 16 concurrent cores are negligible:

        #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => 'rdtsc', CLEAN_AFTER_BUILD => 0; SV *rdtsc() { return newSVuv( (UV)__rdtsc() ); } END_C my( $last, $this ) = ( 0, 0 ); print( $this = rdtsc(), ' ', $this - $last ), $last = $this for 1 .. +20; __END__ C:\test>rdtsc 95054001389914 95054001389914 95054002276396 886482 95054003052862 776466 95054004698944 1646082 95054006658865 1959921 95054008588537 1929672 95054010420586 1832049 95054012410180 1989594 95054014268572 1858392 95054016253981 1985409 95054018070946 1816965 95054050803874 32732928 95054051382070 578196 95054053061884 1679814 95054054901610 1839726 95054056870252 1968642 95054058682825 1812573 95054060659909 1977084 95054062399501 1739592 95054064304702 1905201

        Indeed, used alone with some suitable modulus operations, it would form the basis of a pretty damn good cryptographic rand() all by itself.

        Even if the bad guys had an identical system -- hardware and software -- it would be impossible to predict the next number coming from it. It is affected by every single thing that happens on the system -- interrupts from your nic; mouse movements; thermal loading; every piece of software running in the systems.

        Even if you put two totally identical systems side by side and synchronised them, I bet they would not stay in step for more than a few milliseconds.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: How do I get a unique Perl Interpreter ID?
by TJPride (Pilgrim) on Dec 01, 2011 at 07:29 UTC
    I still don't understand what the underlying requirement is. Are you just trying to generate unique values as a guid or seed for a randomizer? If so, I would think it would be sufficient to just use microtime and rand().
      Are you just trying to generate unique values as a guid or seed for a randomizer?

      Mainly guids.

      I would think it would be sufficient to just use microtime and rand().

      microtime can get screwed by a really fast machine.

      As I understand it, mod_perl clones all of its interpreters in one swell foop and if one is doing things the way they recommend by putting all of the nasty module loading and initialization into the master process so that the clone operation has almost nothing left to do beyond creating a bunch of entries in a page table with copy-on-write tags, it's not hard to imagine it eventually being possible for multiple interpreters to get cloned in the same microsecond.

      and if the seeding for rand() is likewise done in the master process so that all interpreters are proceeding from the same seed, that's also going to be a lose.

      As it happens I do need to seed a random number generator as well. /dev/urandom mostly takes care of that, however if you pull too much from there, then you reduce the entropy available to other processes making them less secure. So if we can get our distinct interpreter IDs from a different source, so much the better. There's also the small matter that /dev/urandom is somewhat broken on older versions of Linux, in which case having extra known-to-be-different junk to throw into the pot will be better than nothing.

      rand(), by the way, is completely inadequate for generating random numbers, at least not if you want to be secure about it (far too predictable...).

        Why do you think some interpreter ID (if such a thing did exist) would be fundamentally better from a randomness/predictability perspective than time+rand, or something similar?

        Do you actually need cryptographically secure randomness, or simply distinctness (as one might have inferred from your OP)? A simple counter (like a sequence in a DB) would be producing distinct values, although they're essentially 100% predictable. And as time is like a counter - automatically incremented externally of your program - it doesn't seem like a bad choice in case distinctness is all that you need (and if you're worried about being returned the same microsecond, just wait until it has advanced...)

        clone operation has almost nothing left to do beyond creating a bunch of entries in a page table with copy-on-write tags, it's not hard to imagine it eventually being possible for multiple interpreters to get cloned in the same microsecond.

        Have you tried to actually timing it? Only kernel have access to the page table, and even if perl could mark pages as copy-on-write it wouldn't help, as it has to change addresses, so no COW when cloning interpreters. Perl has to duly copy all data from the original interpreter to the new one, and the more modules loaded the more data it has to copy, the more time it takes.

        Sounds like you're overthinking the problem. Add in pid or whatever if you're that worried - the pid's can be reused but you're not going to get the same pid and the same microtime unless the process takes less than 1 microsecond to run, which I highly doubt. Short of that, figure a way to sample ambient sound. Or you could always hit yourself in the face until your IQ lowers to a point where you aren't worrying about this any more.
Re: How do I get a unique Perl Interpreter ID?
by tobyink (Canon) on Dec 01, 2011 at 20:44 UTC
    If you just need a unique identifier for each execution, wouldn't something like Data::UUID work? (I've heard Data::UUID plays havoc with multi-threaded scripts, but something like Data::UUID.)

      I peeked at Data::UUID. I see static variables within functions and I haven't yet spotted any evidence that they're doing any locking to make sure only one thread at a time is calling the functions in question => Big Multithreading Fail (which is fine if one doesn't have to play in that world, but I do, so...)

      At this point the question then becomes how you build something like Data::UUID that works in the multithreaded realm. As far as I can tell Data::UUID is relying entirely on clock gymnastics, in which case it's no wonder if they punted on multithreadedness.

      Also while universality is nice, I think I'd prefer to not have the IDs longer than they need to be. Something that'd be unique across a single server farm over some fixed interval of time like 100 days is good enough for my purposes. Granted, I'm also not sure how much effort I want to expend on optimizing that particular piece of the puzzle...

        I'd prefer to not have the IDs longer than they need to be. Something that'd be unique across a single server farm over some fixed interval of time like 100 days is good enough for my purposes.

        When I need unique IDs, I usually have a RDBMS running somewhere that has solved all of the nasty race and locking problems. What if you would simply (ab)use a RBDMS, create a sequence there, and get the sequence's nextval whenever you need an ID?

        Or, if you work across several RDBMS, you could concat a server ID (IP address or hostname, if you have no better idea) and a sequence number into a locally unique ID.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)