Vigenere Cipher

Let's begin with some plaintext:

In [1]:
p='tobeornottobethatisthequestionwhethertisnoblerinthemindtosuffertheslingsandarrowsofoutrageousfortuneortotakearmsagainstaseaoftroublesandbyopposingendthem'
In [23]:
print p
tobeornottobethatisthequestionwhethertisnoblerinthemindtosuffertheslingsandarrowsofoutrageousfortuneortotakearmsagainstaseaoftroublesandbyopposingendthem

...and a key:

In [3]:
k='hamlet'
In [4]:
print k
hamlet

The following function performs encryption:

In [5]:
def vigencrypt(key,plaintext):
    keylength=len(key)
    ciphertext=''
    for j in range(len(plaintext)):
        base=ord(plaintext[j])-ord('a')
        shift=ord(key[j%keylength])-ord('a')
        ciphertext+=chr(ord('a')+(base+shift)%26)
    return ciphertext

Let's encrypt with the plaintext-key pair above:

In [6]:
c=vigencrypt(k,p)
print c
aonpskuofesulttlxbzttpunlsftsgdhqelxytudrhilqcmgahqxmgktadyymedelxzluyklhnplvkvwezjhbtdlkxvueqskauzpskaofloxhrydezhizdxtzemzjmyogmpxzazofrvpbzwbugqyhmoey

Decryption is the same as encryption, using the inverse key:

In [9]:
def inverse_key(key):
    inv=''
    for ch in key:
        shift=ord(ch)-ord('a')
        newshift=(-shift)%26
        newch=chr(ord('a')+newshift)
        inv+=newch
    return inv
In [10]:
invk=inverse_key(k)
print invk
taopwh

In [11]:
print vigencrypt(invk,c)
tobeornottobethatisthequestionwhethertisnoblerinthemindtosuffertheslingsandarrowsofoutrageousfortuneortotakearmsagainstaseaoftroublesandbyopposingendthem

Cryptanalysis

Let's plot the distribution of letters in the plaintext:

In [16]:
%matplotlib inline
In [19]:
import pylab
def plot_distribution(letterstring):
    dist=[0]*26
    for ch in letterstring:
        dist[ord(ch)-ord('a')]+=1
    pylab.xlim(-1,27)
    pylab.ylim(0,20)
    dist.sort()
    pylab.stem(range(26),dist)
    pylab.show()

The plot shows the frequencies of letters beginning with the least frequently occurring to the most frequently occurring. Even for this small sample, it has the characteristic skewed distribution of letters in English, with three letters accounting for 35% of the plaintext, and five letters not appearing at all.

In [20]:
plot_distribution(p)

We similarly plot the distribution of the letter frequencies in the ciphertext, scaled to the same height.

In [21]:
plot_distribution(c)

You can see that it's far more uniform. Every letter appears at least once in the ciphertext, and no letter appears more than 8% of the time. This makes a straightforward frequency analysis quite a bit more difficult to carry out.

The cryptanalysis takes place in two phases. In the first phase, we count how many ciphertext letters match the letter m positions further on, for m between 3 and 10. As argued in the notes, the value of m that gives the largest number of coincidences is the likely key length.

In [26]:
def count_coincidences(ciphertext,shift):
    count = 0
    for j in range(len(ciphertext)-11):
        if ciphertext[j]==ciphertext[j+shift]:
            count += 1
    return count

                
In [27]:
for shift in range(3,11):
    print shift,':',count_coincidences(c,shift)
3 : 8
4 : 4
5 : 3
6 : 13
7 : 9
8 : 6
9 : 7
10 : 5

There's no ambiguity about the result! The likely key length is 6.

This lets us partition the ciphertext into six subtexts. Each of the texts was obtained from the corresponding portion of the plaintext by encrypting with the same key letter, and so should have a frequency distribution resembling that of English. The function below returns a list of the subtexts:

In [30]:
def extract_subtexts(ciphertext,keylength):
    subtexts=['']*keylength
    for j in range(len(ciphertext)):
        subtexts[j%keylength]+=ciphertext[j]
    return subtexts

Print the subtexts. You should see that if you read the list vertically from top to bottom and then horizontally, you get back the original ciphertext.

In [31]:
s=extract_subtexts(c,6)
for j in range(6):
    print s[j]
aulzldyiakmzhvbvaahhzyzvuo
oottshtlhtelnwtuuorieoapge
nfttfquqqadupedezfyzmgzbqy
pelptedcxdeylzlqplddzmozy
ssxuslrmmylkvjkssoexjpfwh
kubngxhggyxlkhxkkxztmxrbm

The second phase of cryptanalysis works on these subtexts. Each of them was obtained from the plaintext by shifting using a single letter of the key. For each subtext we want to find the key letter that gives the most English-like letter distribution to the resulting decryption.

We assign a score to a text by computing the sum

pafa+pbfb++pzfz,

where pθ denotes the relative frequency of the letter θ in the text, and fθ denotes its relative frequency in English text. We then decrypt each subtext under each of the 26 possible shifts to find which one gives the highest score. This should reveal each symbol of the key.

The first function below just computes a score for a given text. The array 𝚏𝚛𝚎𝚚 is derived from published statistics on letter distributions in English.

In [42]:
def score(text):
    freq=[0.0817, 0.0153, 0.0224, 0.0470, 0.121, 0.0217, 0.0210, 0.0606, 0.0724, 0.0010, 0.0090, 0.0379, 0.0315, 0.0684, 0.0773, 0.0170, 0.0009, 0.0575, 0.0605, 0.0885, 0.0283, 0.0092, 0.0260, 0.0013, 0.0226, 0.0002]
    dist=[0]*26
    for ch in text:
        dist[ord(ch)-ord('a')]+=1
    result=0
    for j in range(26):
        result+=dist[j]*freq[j]
    return result/len(text)

Just to see how this works, let's calculate scores for the original plaintext, and for some gibberish (the original ciphertext).

In [44]:
print score(p)
print score(c)
0.0668320261438
0.0405816993464

In principle, ordinary English should give a score of about ppp2a++p2z , which is 0.644. Text in which the letters are distributed uniformly should give a score of about 1260.385, so these results are in fairly close accord.

The next function takes a text, decrypts it under all possible one-letter keys, and returns both the maximum score and the key letter that led to the maximum score.

In [52]:
def maxscore(text):
    maxval=0
    maxletter='?'
    for keyletter in 'abcdefghijklmnopqrstuvwxyz':
        newscore=score(vigencrypt(inverse_key(keyletter),text))
        if newscore>maxval:
            maxval=newscore
            maxletter=keyletter
    return(maxletter,maxval)
        

Now let's apply this to each of our 6 subtexts:

In [53]:
for j in range(6):
    print maxscore(s[j])
('h', 0.06791153846153845)
('a', 0.06818461538461539)
('m', 0.06870384615384616)
('l', 0.07245599999999999)
('e', 0.05285600000000001)
('t', 0.07070800000000001)

The cryptanalysis succeeded in recovering the key 'hamlet' starting only from the ciphertext.