I live in Silicon Valley. I'm really into machine learning and Formula 1. I'm 24 years old. More about me…

Friend me up! Here's where I'm at…

LinkedIn
Flickr
Delicious

Introduce yourself! Feel free to IM (Jabber/Gtalk) or email me.

Top Posts:
So I went to Startup School
Waking from a 64-bit nightmare
A quick foray into linear algebra and Python: tf-idf
Explaining my addiction
The importance of being abstract

Top Commentors:
Paul Stamatiou (4)
Gordon (4)
Arthur (2)
Anton Lindstrom (2)
m13 (2)

Friends
James Broad
Mike Speiser
Mike Heijmans
Annie Lausier
Jay Liew
Gabriel Serafini
Kevin Pratt
Ben Wann
Bart Chapin

Recommended Reading
Joel on Software
Rands in Repose
Paul Graham
Gary Vaynerchuk
Coding Horror
xkcd

All content © 2008 Tim Trueman unless otherwise noted.

Subscribe in a reader Article archives

The importance of being abstract

March 11, 2008 Link

powerbook

As an undergrad, I didn’t spend much time sleeping before a project was due. Then again I hadn’t spent the last two weeks working on it. Now it was 24 hours before it was due and I was hadn’t even thought about how I was going to code an mathematical expression interpreter.

The project was pretty open-ended on how it was implemented but it had to be able to parse and interpret a mathematical expression and evaluate it. The more features your software had the higher your grade, for instance having order of operations, square roots, or variables were all good ways to push your grade into the A range.

Googling turned up an interesting approach I had never heard of before, the recursive descent parser. It was a beautiful solution to the problem, but let me tell you it was hard to wrap my mind around. It’s abstract simplicity was what made it so hard to grasp how it was actually working. I had never thought of using recursion that way; layering function calls into four different levels according to the order of operation. It took hours for me to really fully understand how it worked but finally the “AH-HAH!” moment came to me. I submitted my program, drearily sleep-walked through the snow to my dorm room like a zombie, and fell asleep.

It turns out that recursive descent parser I wrote in a night let me blaze three weeks ahead of schedule in a ten week project to build a Modula-2 compiler in C++.

fluent cluster

Almost a year after graduation I find I still don’t get much sleep, although I guess that’s my own fault. I blame it on the absolute beauty of abstraction. When I started studying computer science it never dawned on me the sheer importance of abstraction. As we abstracted from assemblers to memory-managed languages, we increased the ease, manageability, and speed of development, allowing much more advanced software to be written as fast as Saturn V at escape velocity.

Anyways I sidetracked, let me get to the point I was really trying to share with you. It’s a pair of abstractions called map and reduce that go together like peanut butter and jelly.

So let’s imagine you’ve got the hankering to sum up the first ten perfect numbers. Normally you’d write something like this (in Python, as usual):

sum, array = 0, [2,3,5,7,11,13,17,19,23,29]

for num in array:
  sum = sum + (pow(2,(num-1)) * (pow(2,num) - 1))

What sucks about this is that it’s not very abstract. Why would that matter? Just bear with me and I’ll explain just how awesome this concept actually is. Let’s start by abstracting out the perfect number formula into a separate function:

array = [2,3,5,7,11,13,17,19,23,29]

def perfect(num):
  return (pow(2,(num-1)) * (pow(2,num) - 1))

perfectNums = map(perfect,array)

What’s happened here is we’ve defined a function for calculating perfect numbers, and now we’re mapping that function onto our data set, producing a resulting array of the first ten perfect numbers. But what’s the best way to sum up that array? This is where reduce comes into play:

array = [2,3,5,7,11,13,17,19,23,29,31]

def perfect(num):
  return (pow(2,(num-1)) * (pow(2,num) - 1))

def sum(x, y):
  return x + y

reduce(sum,map(perfect,array))

google datacenterNow we’ve mapped our reduce function onto our previous results, and we finally have the sum of the first ten perfect numbers. Awesome. But wait, isn’t that more code, that’s possibly even more difficult to understand than Lock, Stock and Two Smoking Barrels? Ahh, yes. Why would someone use the map/reduce abstractions? Look no further than Google, who refactored their search algorithm for map/reduce because it offered a leviathan leap forward in speed. Can you guess why?

Parallelization (what a tough word to say!). The beauty of map/reduce is that mapping a function to a data set can be broken up into many tasks. It doesn’t matter if you calculate the first perfect number then the second, or if you split it into two halves being simultaneously executed in parallel. Now imagine your data set is the web and your function is the PageRank algorithm. Apply that to your massively parallel cluster or commodity hardware and you have what is the heart of Google.

There you have it, functional programming’s gift to man: map and reduce. If you’re still confused or have questions, leave a comment!

Frigtarded Leopard DNS issue

February 28, 2008 Link

Audi R8So ever since Leopard shipped in October, I’ve had serious problems with using the Internet. I had just moved to Silicon Valley and the process of new OS, new ISP, and a new router complicated tracking down the issue. Let me assure you, AT&T is by no means a reliable (or fast or cheap) service. Their Uverse service requires you to use their 2Wire “residential gateway”. I’ve got no nice things to say about 2Wire but this time I think the ultimate blame goes to Leopard.

I’ll be the first to admit Apple probably let this one slip through their QA process and that I’m kind of disappointed (perhaps even upset). The problem lies in a newly implemented Request For Comments (RFC), numero 2308. This RFC outlines “Negative Caching of DNS Queries”. Negative caching of DNS queries is a proposal for a way to speed up the user’s experience when network problems arise. Instead of waiting for a timeout to occur, the negative cache resolves near-instantaneously and lets the request know that it’s not going to work. Brilliant idea…right?

Not so fast. Remember my shitty Internet connection? AT&T’s DNS servers are spotty at best and from time to time a lookup will take longer than expected, timing out and creating a negative cache entry. For the next hour. Even though I could perform the look up again and this time it would work, the negative cache entry stops the second lookup from ever happening. So that’s it, I’m hosed right? Nope, Apple has provided a handy command-line application that can manage these negative cache entries. It’s called dscacheutil and it can “gather information, statistics and initiate queries to the Directory Service cache”. For instance to see if there are any negative cache entries run this command:

dscacheutil -cachedump -entries

There’s a column labelled “Neg” and if you see a “YES” then you may be able to solve your problems by flushing the cache manually (instead of waiting an hour):

dscacheutil -flushcache

Thanks to this discussion on Apple’s support website for helping me out! Now I just want Apple, AT&T, and 2Wire to take note and nip this one in the bud.

Tell me if you’ve had this problem or another similar one!

What’s a homepage?

February 25, 2008 Link

I’m pretty crazy…I mean I have the URL structure for Wikipedia memorized. But seriously, I can’t think of the last time I used a homepage. I either type in the exact URL or I type a search into the location bar (in Firefox 2+ just make sure there’s a space and it automatically does a Google search). Does anyone actually set a homepage? What do you do?

Kickin’ ass: Firefox 3

February 23, 2008 Link

I’m sitting here writing this with twelve open tabs. If this was Firefox 2, I’d be somewhere between 300 and 500-something megabytes of RAM. I actually tried to switch to Safari (and WebKit nightly) unsuccessfully. What’s wrong with Safari? Honestly, there’s nothing wrong with it. I’m attached to command-option-arrow-key to switch tabs and I can’t retrain my brain to use command-shift-arrow-key. Oh and one more thing. I’m a developer. I am married to Firebug, which incidentally actually works in Firefox 3 if you get the 1.1 beta. So where was I? Oh yeah. So here I am with twelve open tabs. Twelve open tabs and it’s eating 137MB of RAM. It’s like I just doubled my entire system’s RAM. Holy shit it’s so much snappier and it’s got native-looking UI in Leopard (Although it’s not truly native, if it was I could command-control-d-mouse-over a word to look it up–try that in any native app). Anyways, if you’re running Firefox 2 and thinking of switching, do it now…to Firefox 3.
Dictionary

Updated: slight fix for location bar behavior courtesy of Richard Crowley

Why Linux needs .app

February 17, 2008 Link

A little over two months ago I started using RedHat Enterprise Linux 5 as my primary desktop. I won’t lie, I prefer Ubuntu, but I didn’t really have a choice. It was my work desktop. It was a neat experiment, learning to live with Linux everyday.

McLaren SLR

My biggest frustration was installing, updating, and removing applications. It’s 2008, why is this so hard? Well I’ll tell you. I don’t even know where the application lives, let alone any of its files or dependencies. Did I mention it took me three days to get Firefox 2 installed?

The problem here is that software is complex. There’s a million languages. There’s too many dependencies. Where should the help files go, and which format are they in? Honestly, it’s out of control. Linux it’s time for you to nip this one in the bud.

I know a lot of people suffer from the not-invented-here syndrome, but I’m curious, has anyone solved this problem? I mean, who’s taken UNIX-like operating system and made it easy to manage software? Anyone? Oh yeah…there’s that one Cupertino, CA company that came up with something. They use folders. Any folder with “.app” in its name. There’s a standard for where things go in that folder for all types of resource. Nice and simple. But there’s a problem. My application depends on Java 1.4.2 or Python 2.4.

Viva Las Vegas!

Well 1 Infinite Loop has a solution for that too. Any folder ending in “.framework” in the frameworks folder in the root of the system. These framework folders have a folder for each version. For example Java 1.4 and 1.5 each have their own folders. As minor versions are released symbolic links are used, for instance: Java 1.4.2 is the current version and the 1.4.2 folder has the latest release, while 1.4.1 and 1.4 are just symlinks to 1.4.2.

Don’t even get me started on Apple’s elegant solution to storing software settings. I’m not saying copy Mac OS X. Wait, yes I am. Seriously folks, just do it. Can you imagine how nice it would make life in Linux? I could actually delete an application and know it’s gone. I could easily restore application settings to the defaults. Applications could use different versions of dependent software, without making me care or worry about it. Updating applications could be trivial and automatic. Nobody friggin’ cares about doing software updates. This is why the web does it so well: software updates are completely transparent to the user. Tell me you want anyone to spend even a second thinking about software upgrades and you’re wrong.

I remember each of the last few years being touted as the year of Linux on the desktop. Sadly until the complexity and sheer ugliness of its current state is changed, this will never be the case.

Foosball, the official sport of Yahoo!

P.S. I think there’s a an opportunity here for the least likely party to win with Linux. Microsoft. I’m not even kidding. Just take a deep breath and think about it. Remember Mac OS 9? Remember what a piece of shit it was? No true multitasking, no protected memory, etc. Do you remember how Apple solved that problem? They just took FreeBSD and put their GUI on top of it.

Microsoft won the desktop OS war long ago. Today more and more software is web-based. This means the desktop OS is dying slowly. Stop wasting time and money Microsoft, just take Linux and make it usable, because honestly, all users want to do with their OS is launch the browser and hit the web. The best way to do that is probably make a few smart acquisitions  here and there (and let them keep their brands and attitudes) and redistribute your resources accordingly (by maybe spinning off certain products that have no long-term future as separate companies, Wall Street will totally fall for it).

Tab sizes

February 14, 2008 Link

So according to Bix, the preferred tab size is 2 spaces. I totally agree. I think that has something to do with something Claude Shannon said about information entropy. My hypothesis is 1 space is difficult to read and anything bigger than 2 is a waste of space. Efficiency of information FTW!

A quick foray into linear algebra and Python: tf-idf

February 10, 2008 Link

Home Office Workstation

My goal with this is to introduce you, the reader, to two things: a beautiful language called Python and an algorithm for finding which words most uniquely describe a document. Hopefully this will be easier than finding an anti-gravity repository, but who knows I hated writing in college and nobody said I was very good at it. Anyways.

Linear Algebra

I’m not sure there’s any other when to say this other than to just say it…term frequency–inverse document frequency. That’s the name of the game, but I’ll call it tf-idf instead of its catchy full name. Tf-idf is straight up statistics for determining how unique a token or word is to a specific document. For instance “battle school” and “toon leader” are probably very unique to Ender’s Game. I mean it’s pretty unlikely they’ll show in something like Harry Potter or Lord of the Rings. What’s brilliant about this algorithm is common words such as “the” or “and” aren’t unique to any one document, so their frequency in one document compared to all the rest actually approaches zero rather quickly.

Before I scare away anyone with crazy looking equations, let me break it down to you. The td-idf weight of a word is calculated by multiplying together two things, tf and idf, so let’s look at tf, or term frequency first.

The term frequency is actually really simple, it’s the number of times a given word occurs divided by the number or words in the document. Why do we divide by the number or words in the document? That’s just a simple way to normalize the bias toward longer documents. I mean Harry Potter probably has the word “brown” more times than “the quick brown fox jumped over the lazy dog” but it’s not as unique to that document. Hence we need a way to level the playing field for documents of all shapes and sizes. Maybe just sizes.

Huge fucking Storage Area Network porn

Idf, the other factor in the equation, also known as inverse document frequency, normalizes the term frequency value so that common words used by all documents are given very little importance. Examples include “the” for English documents or “for” for source code, as neither term is unique to any particular document even though they both probably appear very often in all documents.

One example application of this algorithm would be listing the top ten most interesting words for a post on a blog. This would surface words that more or less uniquely identify a post, and could be used for the keywords in the HTML header to maximize the SEO value of a permalink. But of course there’s plenty of other uses.

Still confused? Let’s just jump right into the code!

Putting tf-idf into action: using Python

As Randall Munroe puts it, Python is executable psuedocode. Take a for loop in PHP for example:

for ($i=0; $i<10; $i++) {
  echo $i;
}

In PHP you use curly braces to indicate control flow. It’s great and explicit. But when I’m prototyping something quickly it’s kind of annoying to have to use curly braces to realize an idea in my head. Sheesh. If I just wanted to write the pseudocode for it now and write the real, executable code later I would write something like this:

for i in range(10):
  print i

Oh wait, that’s Python. And it’s executable. Niiiiice. The whitespace is important as it contains the control flow information. Everything that’s indented after that for statement will be looped through and the first non-indented statement after the for loop will indicate the end of the for loop.

OK, so let’s put this tf-idf algorithm into psuedocode/Python! Let’s make it object-oriented so I’ll start with the tf part of the equation:

def tf(word, document):
  return (freq(word,document) / float(wordCount(document)))

Hmm, looks like we’ll need to write a couple helper functions…

def freq(word, document):
  return document.split(None).count(word)
def wordCount(document):
  return len(document.split(None))

There we go. If you’re curious what the split(None) does it breaks a string into chunks when it hits whitespace. Alright, half way there…now for idf:

def idf(word, documentList):
  return math.log(len(documentList) / numDocsContaining(word,documentList))

Again we’ll need a helper function that counts the number of documents containing a word:

def numDocsContaining(word,documentList):
  count = 0
  for document in documentList:
    if freq(word,document) > 0:
      count += 1
  return count

Finally we’ve got both parts of the equations so all we need is one last function!

def tfidf(word, document, documentList):
  return (tf(word,document) * idf(word,documentList))

Tada! That’s it…executable pseudocode for finding out how unique a word is to a specific document. If you want to know more check out the Wikipedia entry for tf-idf.

Just copy and paste the following code save it as tfidf.py, head over to your shell of choice and type “python tfidf.py” and you’re good to go!

#!/usr/bin/env python

import math
from operator import itemgetter

def freq(word, document):
  return document.split(None).count(word)

def wordCount(document):
  return len(document.split(None))

def numDocsContaining(word,documentList):
  count = 0
  for document in documentList:
    if freq(word,document) > 0:
      count += 1
  return count

def tf(word, document):
  return (freq(word,document) / float(wordCount(document)))

def idf(word, documentList):
  return math.log(len(documentList) / numDocsContaining(word,documentList))

def tfidf(word, document, documentList):
  return (tf(word,document) * idf(word,documentList))

if __name__ == '__main__':
  documentList = []
  documentList.append("""DOCUMENT #1 TEXT""")
  documentList.append("""DOCUMENT #2 TEXT""")
  documentList.append("""DOCUMENT #3 TEXT""")
  words = {}
  documentNumber = 0
  for word in documentList[documentNumber].split(None):
    words[word] = tfidf(word,documentList[documentNumber],documentList)
  for item in sorted(words.items(), key=itemgetter(1), reverse=True):
    print "%f <= %s" % (item[1], item[0])

Note: I am a subscriber to the 2-space soft tab school of thought, however I guess you could use whatever as long as you’re consistent.

If this was (or wasn’t) helpful to you, or you have questions, comment or contact me!

Normally I’m patient

As I not-so-patiently wait for the iPhone SDK to ship later this month (that O’Reilly Rough Cut is so tempting), remember the pain I felt at the beginning of this year. First there was New Years. Wow, hot damn, it’s 2008. That’s crazy. Then the void came.

For two weeks life was boring and meaningless. Two weeks! I almost didn’t make it. Steve Jobs, the most I can wait for your keynote after New Years is 3-5 days. Take note, and don’t make me wait two weeks next year, OK?

P.S. Seriously stop with the 3G iPhone rumors people, it’s ridiculous. They’ll announce it for June 2009 in January 2009, maybe later and certainly not a second sooner.

Data detectors

January 18, 2008 Link

When Apple shipped Leopard they released a tiny feature contains furious potential. This barely-publicized feature of Mail.app hopefully is a sign of things to come. All it does is allow one to click on any written form of date and create a todo or event in iCal. Pretty minor feature, right?

liquid

Nifty feature Apple, but why only Mail.app? I think Giles Turnbull was on to something when he suggested creating a system-wide CoreTodo service. Imagine this: I’m in Adium, chatting with a friend and he tells me Paul van Dyk is coming to play in San Francisco on February 9th. I want to be able to just click on that date and create a Facebook event and invite all my friends to go with me. How cool would that be? Oh wait, I can only create an iCal todo or event. Which is why I seriously think there’s one addition to CoreTodo that would create a killer app.

iCal is great. No, seriously. I use it every day. But there’s other things I wish I could do with that data. Why can’t I add it to Google Calendar, check out that date on Upcoming.org, send an event invitation in Gmail, or create a Facebook event? Here’s what I’m saying: create CoreDate (the more logical name for its functionality) and let applications register with CoreDate as a possible action. Even if it’s a web application. Then that drop down menu won’t be so lonely.

Using Perforce for multiple projects

January 9, 2008 Link

Having used Perforce for multiple projects, I just wanted to share this little trick about using Perforce for multiple projects. You will want to create multiple clients for every logical project you are doing. To get started put this in your ~/.bash_profile:

export P4CONFIG='.P4CONFIG'

cluster

What that does is tells p4 to check for its configuration in whatever directory you execute the p4 command. It then keeps moving up in the directory structure until it finds a .P4CONFIG file. For instance, imagine you’re in a web application project and you are exploring the javascript files in a folder called “js” in the root of the project files and run a p4 sync from there. The p4 sync would look for a .P4CONFIG in the current “js” folder, fail to find one and then move up to the root where it would find the project specific .P4CONFIG. Now what shall we put in that project specific .P4CONFIG?

Save the follow snippet into a .P4CONFIG file inside a folder you want to check all the code out to, then run p4 client, edit it to include just the directories you want, then run p4 sync and you’re golden!

P4EDITOR=vim
P4CLIENT=jschmo-foo
P4USER=jschmo
P4PASSWD=iknowkungfoo
P4PORT=perforce.someinternalserver.domain.com:1666

I use that client name because it’s easy to track down who made a change and for what project. If this doesn’t work for you try to pull in your bash profile manually with this command:

source ~/.bash_profile


« Newer Posts