A quick foray into linear algebra and Python: tf-idf

Home Office Workstation Workspace

My goal with this is to introduce you, the reader, to two things: a beautiful language called Python and an algorithm for finding which words most uniquely describe a document. Hopefully this will be easier than finding an anti-gravity repository, but who knows I hated writing in college and nobody said I was very good at it. Anyways.

Linear Algebra

I'm not sure there's any other when to say this other than to just say it...term frequency--inverse document frequency. That's the name of the game, but I'll call it tf-idf instead of its catchy full name. Tf-idf is straight up statistics for determining how unique a token or word is to a specific document. For instance "battle school" and "toon leader" are probably very unique to Ender's Game. I mean it's pretty unlikely they'll show in something like Harry Potter or Lord of the Rings. What's brilliant about this algorithm is common words such as "the" or "and" aren't unique to any one document, so their frequency in one document compared to all the rest actually approaches zero rather quickly.

Before I scare away anyone with crazy looking equations, let me break it down to you. The td-idf weight of a word is calculated by multiplying together two things, tf and idf, so let's look at tf, or term frequency first.

The term frequency is actually really simple, it's the number of times a given word occurs divided by the number or words in the document. Why do we divide by the number or words in the document? That's just a simple way to normalize the bias toward longer documents. I mean Harry Potter probably has the word "brown" more times than "the quick brown fox jumped over the lazy dog" but it's not as unique to that document. Hence we need a way to level the playing field for documents of all shapes and sizes. Maybe just sizes.

mario sticky notes

Idf, the other factor in the equation, also known as inverse document frequency, normalizes the term frequency value so that common words used by all documents are given very little importance. Examples include "the" for English documents or "for" for source code, as neither term is unique to any particular document even though they both probably appear very often in all documents.

One example application of this algorithm would be listing the top ten most interesting words for a post on a blog. This would surface words that more or less uniquely identify a post, and could be used for the keywords in the HTML header to maximize the SEO value of a permalink. But of course there's plenty of other uses.

Still confused? Let's just jump right into the code!

Putting tf-idf into action: using Python

As Randall Munroe puts it, Python is executable psuedocode. Take a for loop in PHP for example:

for ($i=0; $i<10; $i++) {
  echo $i;
}

In PHP you use curly braces to indicate control flow. It's great and explicit. But when I'm prototyping something quickly it's kind of annoying to have to use curly braces to realize an idea in my head. Sheesh. If I just wanted to write the pseudocode for it now and write the real, executable code later I would write something like this:

for i in range(10):
  print i

Oh wait, that's Python. And it's executable. Niiiiice. The whitespace is important as it contains the control flow information. Everything that's indented after that for statement will be looped through and the first non-indented statement after the for loop will indicate the end of the for loop.

OK, so let's put this tf-idf algorithm into psuedocode/Python! Let's make it object-oriented so I'll start with the tf part of the equation:

def tf(word, document):
  return (freq(word,document) / float(wordCount(document)))

Hmm, looks like we'll need to write a couple helper functions...

def freq(word, document):
  return document.split(None).count(word)

def wordCount(document):
  return len(document.split(None))

There we go. If you're curious what the split(None) does it breaks a string into chunks when it hits whitespace. Alright, half way there...now for idf:

def idf(word, documentList):
  return math.log(len(documentList) / numDocsContaining(word,documentList))

Again we'll need a helper function that counts the number of documents containing a word:

def numDocsContaining(word,documentList):
  count = 0
  for document in documentList:
    if freq(word,document) &gt; 0:
      count += 1
  return count

Finally we've got both parts of the equations so all we need is one last function!

def tfidf(word, document, documentList):
  return (tf(word,document) * idf(word,documentList))

Tada! That's it...executable pseudocode for finding out how unique a word is to a specific document. If you want to know more check out the Wikipedia entry for tf-idf.

Just copy and paste the following code save it as tfidf.py, head over to your shell of choice and type "python tfidf.py" and you're good to go!

#!/usr/bin/env python

import math
from operator import itemgetter

def freq(word, document):
  return document.split(None).count(word)

def wordCount(document):
  return len(document.split(None))

def numDocsContaining(word,documentList):
  count = 0
  for document in documentList:
    if freq(word,document) > 0:
      count += 1
  return count

def tf(word, document):
  return (freq(word,document) / float(wordCount(document)))

def idf(word, documentList):
  return math.log(len(documentList) / numDocsContaining(word,documentList))

def tfidf(word, document, documentList):
  return (tf(word,document) * idf(word,documentList))

if __name__ == '__main__':
  documentList = []
  documentList.append("""DOCUMENT #1 TEXT""")
  documentList.append("""DOCUMENT #2 TEXT""")
  documentList.append("""DOCUMENT #3 TEXT""")
  words = {}
  documentNumber = 0
  for word in documentList[documentNumber].split(None):
    words[word] = tfidf(word,documentList[documentNumber],documentList)
  for item in sorted(words.items(), key=itemgetter(1), reverse=True):
    print "%f <= %s" % (item[1], item[0])

Download or fork the source code on Github: http://github.com/timtrueman/tf-idf