Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Potential hash key ids limitations

by Anonymous Monk
on Jan 19, 2023 at 01:51 UTC ( #11149677=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!
I wanted to double-check something that I have noticed with you (maybe I am wrong, please bear with me). If I have a hash with keys that have special characters like $ or | or ., is this a problem? I am working with protein sequence data, that have IDs like the following:
>tr|Q8PV56|Q8PV56_METMA PEP-CTERM sorting domain-containing protein OS +=Methanosarcina mazei (strain ATCC BAA-159 / DSM 3647 / Goe1 / Go1 / +JCM 11833 / OCM 88) OX=192952 GN=MM_2118 PE=4 SV=1

Can all the above be stored as a hash key (length-wise I mean). Will there be any issue if e.g. I try to execute a command like if exists($hash{$key})
Thank you for your guidance!

Replies are listed 'Best First'.
Re: Potential hash key ids limitations
by tybalt89 (Monsignor) on Jan 19, 2023 at 02:32 UTC

    Nope, no problem. I have hashes where the key is an entire perl program and they work just fine.

      > I have hashes where the key is an entire perl program and they work just fine.

      :)

      dear tybalt89 you are really a fun monk to follow: can you open a bit more your treasures chest to show us how and why do you have an entire perl program as hash key? Golf obfuscations or regexes as keys and some paragraph to explain them as values? Just a wild guess..

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Potential hash key ids limitations
by hv (Prior) on Jan 19, 2023 at 02:34 UTC

    Everything you suggest should work fine. Generally, a hash key is just a string, and can be anything that a string can be - it can even have null bytes in it. I don't remember right now, but it is possible that a 32-bit build of perl may be limited to strings of length 2^32-1 and therefore to hash keys of that length. A 64-bit build should in any case be able to support strings up to length 2^64-1 (if you have that much memory).

    If you provide the key as a literal in your perl code, you will need to take the same care to quote it correctly as you would using it as a string.

    If you provide something that isn't a string then it will be coerced to a string to make the key, the same as if you were printing it. If it was an object, that means you won't be able to get back the object reference from looking at the key; if it was a number, you won't necessarily get the identical number back, at least not to the same precision.

Re: Potential hash key ids limitations
by LanX (Sage) on Jan 19, 2023 at 04:01 UTC
    Short answer: any kind of string and string length can be a key. °

    > (length-wise I mean).

    Of course you can hit the memory limits and trigger the OS to do page swapping to the hard disc, which will slow down things considerably.

    But that's always the case for big data and not particular to the key's length.

    So yes, key lengths in the order of GBs will expose limitations, but so will simple strings too.

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

    °) IIRC: the way keys are internally stored is actually a bit complicated in order to make it memory efficient. If you reuse a very long key in multiple hashes, you'll notice that the corresponding string is only stored once globally and all equivalent hash-keys will point to that string.

      In case of hash keys in the GB range, i would also avoid anything like

      foreach my $key (sort keys %myhash) {

      But yeah, otherwise Perl just doesn't care what you put in hash keys (or any other kind of scalars). Last time i had to find duplicate thumbnails, i just used the png files as keys. You can't go much more "random binary junk" then compressed image data...

      PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
Re: Potential hash key ids limitations
by erix (Prior) on Jan 19, 2023 at 14:59 UTC

    >tr|Q8PV56|Q8PV56_METMA PEP-CTERM sorting domain-containing protein OS +=Methanosarcina mazei (strain ATCC BAA-159 / DSM 3647 / Goe1 / Go1 / +JCM 11833 / OCM 88) OX=192952 GN=MM_2118 PE=4 SV=1

    You know the Q8PV56 part is already a unique UniProt identifier, right? UniProt calls it an 'Accession Number'. There would seem to be no need to use the whole line as a key.

    Here is the info in case you want to excise the accession number with a regular expression: accession_numbers

    There can exist also multiple isoforms, which have similar accession numbers but postfixed with dash+integer, like so: P68250-3

      «… already a unique UniProt identifier…»

      No joke, in fact. You always experience real surprises here - unbelievable.

      «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11149677]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2023-02-01 03:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?