Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Password generator using a linguistic rule base

by ginseng (Pilgrim)
on Aug 01, 2001 at 01:58 UTC ( #101324=CUFP: print w/ replies, xml ) Need Help??

We all know the right way to select a password, right? Mix upper and lower case, pick random letters, mix in some numbers and punctuation, and try to remember the final result for longer than 30 seconds. If you assign passwords like this to users, it's a matter of minutes before the religious radio station emails you with a request that his/her password be changed to "Jesus".

So I tried to grok the problem and come up with a solution. Why not use random passwords that are actually based on real language? I mean, people are used to being able to pronounce things, more or less, and certain patterns are more comfortable to remember. It's like phone numbers. String 8562621937 together, and it's tough to remember. Format it as 856-262-1937 and most people can recall it without too much trouble. (The area code is South Jersey, but I don't know whose number that is...I wouldn't suggest you dial it ;)

The algorithm I proposed to some security-minded friends is as follows:

  • Parse a dictionary file (say, /usr/share/dict/words) into consonant-vowel patterns, and weigh each pattern by the number of times it appears in the dictionary.
  • Throw away the less frequently used combinations, and any combinations too short or too long to be used as a password.
  • While we're at it with the dictionary file, find out how frequently each letter is used, too.
  • Randomly select a pattern from the dictionary parse.
  • Randomly replace consonant-placeholders with consonants, and vowel placeholders with vowels. However, weight the selection of each letter based on the frequency with which it appeared in the dictionary.
  • Randomly pick a position to capitalize. Do so, but only if using the capital letter won't confuse things. (i.e. no capital 'O's and zeroes in the same password.)
  • Tack on a couple of numbers or punctuation marks.
They decided it was secure enough...not as good as random, much better than most. (I haven't calculated the number of possible outputs from this program. It's a lot.)

No, it's not as secure as randomly picking all characters in any order. But it is a lot easier to remember the result, since you can usually pronounce it. Often it sounds like a real word. (Sometimes this is a problem...users probably don't want a password that has "feces" in it ;)

Here are some of the passwords this program generated for me:

  • 9Vapadvi0
  • 8meleRe$
  • 2dudeRanl!
  • ?neroteTep!
  • 5revU?
  • 5teEcon$
  • 5Llanse6
  • !hAer0
  • 2rutRoar7
  • 8Tirceneri6
  • 6rescAdoler@
  • 0tinenEuh2
  • ?lavapBi%
  • @socozEh0
  • 8edRetnan8
  • 8Tenporor6
  • %tunrohcero%
  • $fepegRe0

Update 2 aug '01: Forgot my -w on the shebang line. Mea culpa. Applied said -w and was pleased there was nothing else to fix.

#!/usr/bin/perl -w use strict; # some constants useful for changing the configuration use constant MIN_LENGTH => 6; use constant MAX_LENGTH => 12; use constant MIN_SAMPLES => 750; # min samples is the minimum number o +f times # a vowel-consonant pattern appears i +n the dictionary sub parsedict { # this sub parses a dictionary file (specified at the command line # into a series of vowel-consonant patterns, weighted by the number # of times each pattern appears in the dictionary. It writes the # hash of patterns and weights to a file called "lingua". It also # tracks the frequency of use for each letter of the alphabet, and # stores that information in a file called "letters". my @consonants = split //, 'bcdfghjklmnpqrstvwxz'; my @vowels = split //, 'aeiouy'; my (%letters, %letterdist, %result, %stats); foreach (@consonants) { $letters{$_} = "c"; } foreach (@vowels) { $letters{$_} = "v"; } while (<>) { chomp; my @chars = split //, lc($_); my $mapped; foreach (@chars) { $mapped .= $letters{$_}; $letterdist{$_}++; } $result{$mapped}++; $stats{"words"}++; } open LINGUA, ">lingua"; foreach (sort { $result{$b} <=> $result{$a} } keys %result) { (length($_) >= MIN_LENGTH - 2 && length($_) <= MAX_LENGTH - 2 && $ +result{$_} >= MIN_SAMPLES) and do { print LINGUA "$_\t$result{$_}\n"; $stats{"patterns"}++; } } close LINGUA; open LETTERS, ">letters"; foreach (sort { $letterdist{$b} <=> $letterdist{$a} } keys %letterdi +st) { print LETTERS "$_\t$letterdist{$_}\n"; } close LETTERS; return "Parsed $stats{'words'} words into $stats{'patterns'} pattern +s within criteria.\n"; } sub genpass { # this sub chooses a pattern at random from the lingua file, and exc +hanges # 'c's and 'v's in the pattern with consonants and vowels, respectiv +ely, # based on a random letter selection weighted by the frequency of ea +ch # letter in the dictionary file. # first, choose a pattern from the lingua file srand; # not strictly necessary as current versions of perl do this +automatically my @pattern; open (LINGUA, "<lingua") or die "Could not open lingua file: $!"; rand($.) < 1 && (@pattern = (split /\t/)) while (<LINGUA>); close LINGUA; # second, parse the letters file and build a hash of letters and wei +ghts my (%cons, %vowels, $constotal, $voweltotal); open (LETTERS, "letters") or die "Could not open letter file: $!"; while (<LETTERS>) { chomp; my ($key, $value) = split /\t/; if ($key =~ /[aeiouy]/) { $voweltotal += $value; $vowels{$key} = $ +voweltotal; } else { $constotal += $value; $cons{$key} = $ +constotal; } } # build a couple of routines for randomly selecting vowels and conso +nants # these two routines could be combined into one, but i was too lazy +to do it # the most elegant it's like this. my $randomvowel = sub { my $index = rand($voweltotal); my $choice; foreach (sort { $vowels{$b} <=> $vowels{$a} +} keys %vowels) { $choice = $_; if ($vowels{$_} < $index) { last; } } return $choice; }; my $randomcons = sub { my $index = rand($constotal); my $choice; foreach (sort { $cons{$b} <=> $cons{$a} } ke +ys %cons) { $choice = $_; if ($cons{$_} < $index) { last; } } return $choice; }; # here's where we actually map random characters into the pattern my @tomap; my @orig = split //, $pattern[0]; foreach (@orig) { push @tomap, ($_ eq 'c') ? &$randomcons : &$random +vowel; } # good passwords will have at least one letter capitalized. choose o +ne here. # note that not all letters are given capital equivalents, making it + easier # to identify "confusing" letters. There are no capital O's, only ze +ros, # for example. my @case = split //, 'ABCDEFGHiJKLMNoPQRSTUVWXYZ'; my $ucpos = int (rand(@orig)); $tomap[$ucpos] = $case[ord($tomap[$ucpos]) - 97]; # good passwords also use some non-alpha characters, interspersed. t +his # algorithm tacks one on the front, and one on the back of the passw +ord # it just generated. not the most secure way to do it, but better th +an # not doing it (and still easy for the user to work with.) my @puncs = split //, '!?@#$%&0123456789'; my $mapped = $puncs[rand(@puncs)] . (join '', @tomap) . $puncs[rand( +@puncs)]; # finally, return the generated password. return $mapped . "\n"; } # simple enough main... # if an argument is given, parse it as the dictionary. if not, # generate a password. print @ARGV ? parsedict : genpass;

Replies are listed 'Best First'.
Re: Password generator using a linguistic rule base
by astaines (Curate) on Aug 02, 2001 at 01:40 UTC


    Very cute.

    There are some nice extensions to this idea, at least one of which gives better passwords using first, second and third order approximations to English. See an especially nice example at Tom Van Vleck's site.

    A nice feature of this type of program is that you can develop memorable, but hard to crack passwords for languages besides English, by simply stuffing in an appropriate word list.

    Thanks again

    -- Anthony Staines
      That was the intention, though I didn't comment on it, due to the usual US-centric egotism ;) and the fact that the vowel & consonant arrays aren't broken out to be easily configured. But internationalization was on my mind when I wrote it. (In fact, most of my passwords are based on Hangeul - Korean - words, since that language and culture is of interest to me.)

      I've briefly reviewed the third-order code on Tom Van Vleck's site - it's definitely an improvement. If I had bothered to research this at all before implementing it, I probably would have either integrated that idea, or decided the problem had already been solved and was becoming more complicated than I felt like dealing with ;)

      Actually, while I did not know whether it had been done before, I was not so bold as to suppose noone had ever had the idea. (I should have mentioned that in my post. "While I dreamed this program up without knowingly copying someone else's idea, it may have been done before, somewhere.") I just didn't research it at all. (Possibly I was afraid I'd find that some monopoly had a current patent on the method, and I'd be opening myself up to a lawsuit or worse, a DMCA violation and imprisonment, by posting it here ;)

      Thanks for the link. Maybe some day I'll modify the program to pay attention to frequent neighbors for each letter. Definitely a good idea.

Re: Password generator using a linguistic rule base
by bastard (Hermit) on Aug 02, 2001 at 02:53 UTC
    Congratulations, definitely a step above for the generation of passwords that are easy to remember, but hard to guess.

    One thing to keep in mind. While it is hard for a person to guess, the same concept can be used to narrow the initial selection set of passwords needed to guess. Implemented in a program (a simple modification of your code) it would have a much better success ratio of guessing the passwords. In the end though they are definitely tougher to guess than standard dictionary words, so it is definitely an improvement.

    I used to pick passwords of random characters that felt natural to type. Some combinations of characters are just awkward to enter quickly. Knowing that a password cracker to begin its search based on netural key combinations. This would likely shorten the search time.

      Knowing that letter patterns and character distribution are based on language should make the resultant passwords a little easier to guess. Obviously, knowing that a host under attack requires passwords created with this code would narrow down the universe of possible combinations. But how much would that help? I'm not good at probabilities, but it appears to me that a program guessing passwords generated by code like this would still have a huge job ahead of it. Maybe this is the wrong forum to ask for help with such analysis. I think it's a good algorithm.

      Ultimately, I am hoping for a routine where an attacker, knowing that the passwords were created with this code, would not get significant advantage from that knowledge. It's probably not there, yet, and might not get there if I don't bone up on probability.

      Thanks for your positive comments.

        Don't get me wrong, I think this is a great tool. I was just mentioning the pitfalls of something that excludes a set of passwords for the available selection set. (like my choosing of passwords that are comfortable to type quickly) When taken in the context of the entire net it will still be generating passwords that are probably an order of mangitude (or more), more difficult to crack than the average password out there. (on the other hand it is also probably an order of magnitude or more easier to guess than a truly random password).

        The only people who would really be able to take advantage of such a technique are those with some level of cryptanalytic ability. Who know a thing or two about character frequencies and the human element. Heck real cryptanalysts can take advantage of a faulty random number generator.

        back during wwii the germans broke the codes on a number of british one-time-pads. (Theoretically unbreakable).

        It happened like this.
        To create the one time pads someone would take balls with letters on them out of a spherical cage. After each ball was selected they would spin the cage (after closing the hatch). They were not supposed to be looking at the letters during the selection process. After a while they did indeed start looking at the balls. Sub-conciously they would pick letter combinations that they felt were random, but actually were not. Speculating that this was the case the germans did a bit of research, and discovered the the frequencies of combinations and were ultimately able to crack a number of the brittish one-time-pads.

Re: Password generator using a linguistic rule base
by Chrisf (Friar) on Aug 01, 2001 at 18:16 UTC
    Hey, that's pretty cool. Nicely done :)
Re: Password generator using a linguistic rule base
by I0 (Priest) on Aug 02, 2001 at 04:16 UTC
    The generated password can be guessed by guessing the srand seed
      This is true, but you would also need a copy of the script to know how the passwords are initially generated and have a password to verify the seed against, anything else would be so close to brute cracking that the time investment would be about the same, don't you think... and at least it encourages users not to pick bad passwords.

      Just Another Perl Backpacker

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://101324]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2016-06-25 06:07 GMT
Find Nodes?
    Voting Booth?
    My preferred method of making French fries (chips) is in a ...

    Results (323 votes). Check out past polls.