Markov Chains From Scratch
16th January 2018Sometimes it pays to go back to basics. Data science is a massive, complicated field with seemingly endless topics to learn about. But in our rush to learn about the latest deep learning trends, it's easy to forget that there are simple yet powerful techniques right under our noses. In this post we'll explore one such technique called a Markov chain. By building one from scratch using nothing but standard Python libraries, we'll see how simplistic they can be while also yielding some cool results.
Markov chains are essentially a way to capture the probability of state transitions in a system. A process can be considered a Markov process if one can make predictions about the future state of the process based solely on its present state (or several of the most recent states for a higher-order Markov process). In other words, the history doesn't matter beyond a certain point. There are lots of great explainers out there so I'll leave that for the reader to explore independently (this one is my favorite). It will become clearer as we step through the code, so let's dive in.
For this example we're going to build a language-based Markov chain. More specifically, we'll read in a corpus of text and identify pairs of words that appear together. The pairings are sequential such that when a word \(w1\) is followed by a word \(w2\), then we say that the system has a probabilistic state transition from \(w1\) to \(w2\). An example will help. Consider the phrase "the brown fox jumped over the lazy dog". If we break this down by word pairings, our state transitions would look like this:
the: [brown, lazy]
brown: [fox]
fox: [jumped]
over: [the]
lazy: [dog]
This set of state transitions is called a Markov chain. With this in hand we can now choose a starting point (i.e. a word in the corpus) and "walk the chain" to create a new phrase. Markov chains built in this manner over large amounts of text can produce surprisingly realistic-sounding phrases.
In order to get started we need a corpus of text. Anything sufficiently large will do, but to really have some fun (and at the risk of bringing politics into the mix) we're going to make Markov chains great again by using this collection of text from Donald Trump's campaign speeches. Our first step is to import the text file and parse it into words.
import urllib2
text = urllib2.urlopen('https://raw.githubusercontent.com/ryanmcdermott/trump-speeches/master/speeches.txt')
words = []
for line in text:
line = line.decode('utf-8-sig', errors='ignore')
line = line.encode('ascii', errors='ignore')
line = line.replace('\r', ' ').replace('\n', ' ')
new_words = line.split(' ')
new_words = [word for word in new_words if word not in ['', ' ']]
words = words + new_words
print('Corpus size: {0} words.'.format(len(words)))
Corpus size: 166259 words.
I did some clean-up by converting it to ASCII and removing line breaks but that's about it, the rest of the text is just left as it appears in the source file. Our next step is to build the transition probabilities. We'll represent our transitions as a dictionary where the keys are the distinct words in the corpus and the value for a given key is a list of words that appear after that key. To build the chain we just need to iterate through the list of words, add it to the dictionary if it's not already there, and add the word proceeding it to the list of transition words.
chain = {}
n_words = len(words)
for i, key in enumerate(words):
if n_words > (i + 1):
word = words[i + 1]
if key not in chain:
chain[key] = [word]
else:
chain[key].append(word)
print('Chain size: {0} distinct words.'.format(len(chain)))
Chain size: 13292 distinct words.
It may come as a surprise that we're just naively inserting words into the transition list without caring if that word had appeared already or not. Won't we get duplicates, and isn't that a problem? Yes we will, and no it's not. Think of this as a simplistic way of representing the transition probability. If a word appears multiple times in the list, and we sample from the list randomly during a transition, there's a higher likelihood that we pick that word proportional to the number of times it appeared after the key relative to all the other words in the corpus that appeared after that key.
Now that we've built our Markov chain, we can get to the fun part - using it to generate phrases! To do this we only need two pieces of information - a starting word, and a phrase length. We're going to randomly select a starting word from the corpus and make our phrases tweet-length by sampling until our phrase hits 140 characters (assume we're part of the #never280 crowd). Let's give it a try.
import random
w1 = random.choice(words)
tweet = w1
while len(tweet) < 140:
w2 = random.choice(chain[w1])
tweet += ' ' + w2
w1 = w2
print(tweet)
Were not going to run by the 93 million people are, where were starting. New Hampshire." I PROMISE. I do so incredible, and be insulted, Chuck.
Not bad! The limitations of using only one word for context are readily apparent though. We can improve it by using a 2nd-order Markov chain instead. This time, instead of using simple word pairings, our "keys" will be the set of distinct tuples of words that appear in the text. Borrowing from the example phrase earlier, a 2nd-order Markov chain for "the brown fox jumped over the lazy dog" would look like:
(the, brown): [fox]
(brown, fox): [jumped]
(fox, jumped): [over]
(jumped, over): [the]
(over, the): [lazy]
(the, lazy): [dog]
In order to build a 2nd-order chain, we have to make a few modifications to the code.
chain = {}
n_words = len(words)
for i, key1 in enumerate(words):
if n_words > i + 2:
key2 = words[i + 1]
word = words[i + 2]
if (key1, key2) not in chain:
chain[(key1, key2)] = [word]
else:
chain[(key1, key2)].append(word)
print('Chain size: {0} distinct word pairs.'.format(len(chain)))
Chain size: 72373 distinct word pairs.
We can do a sanity check to make sure it's doing what we expect by choosing a word pair that appears somewhere in the text and then examining the transitions in the chain for that pair of words.
chain[("Its", "so")]
['great', 'great', 'easy.', 'preposterous.', 'important...', 'simple.', 'simple.', 'horrible.', 'out', 'terrible.', 'sad.', 'much', 'can', 'easy.', 'embarrassing', 'astronomical']
Looks about like what I'd expect. Next we need to modify the "tweet" code to handle the new design.
r = random.randint(0, len(words) - 1)
key = (words[r], words[r + 1])
tweet = key[0] + ' ' + key[1]
while len(tweet) < 140:
w = random.choice(chain[key])
tweet += ' ' + w
key = (key[1], w)
print(tweet)
there. They saw it. He talks about medical cards. He talks about fixing the VA health care. They want to talk to me from Georgia? "Dear So and
Better! Let's turn this into a function that we can call repeatedly to see a few more examples.
def markov_tweet(chain, words):
r = random.randint(0, len(words) - 1)
key = (words[r], words[r + 1])
tweet = key[0] + ' ' + key[1]
while len(tweet) < 140:
w = random.choice(chain[key])
tweet += ' ' + w
key = (key[1], w)
print(tweet + '\n')
markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)
East. But we have a huge subject. Ive been with the Romney campaign. Guys made tens of thousands of people didnt care about the vets in one hour. somebody is going to put American-produced steel back into the sky. It will be the candidate. But I think 11 is a huge problem. And Im on the THAT WE CAN ONLY DREAM ABOUT. THEY HAVE A VERY BIG BEAUTIFUL GATE IN THAT WALL, BIG AND BEAUTIFUL, RIGHT. NO. NO, I DON'T KNOW WHERE THEY HAVE We need to get so sick of me. I didnt want the world my tenant. They buy condos for tens of millions of dollars overseas. And too many executive Wont be as good as you know, started going around and were going to win. Were going to happen. Thank you. SPEECH 8 This is serious rifle. This
That's all there is to it! Incredibly simple yet surprisingly effective. It's obviously not perfect but it's not complete gibberish either. If you run it enough times you'll find some combinations that actually sound pretty plausible. These results could probably be improved significantly with a much more powerful technique like a recurrent neural net, but relative to the effort involved it's hard to beat Markov chains.