Randow stuff

let's build something beautiful

A First Look At the AOL data

Among all the controversy, the AOL data is still a unique source of very interesting data. Here is the account of my first explorations.

Exploiting the data

In order to see anything useful from a huge pile of data like that (around 500MB compressed), it is necessary to search for some repetitive patterns. For my first exploration, I have decided to explore the relationships between frequent search keywords. Using python and the graph package pydot (graphviz), it is possible to quite simply plot some interesting relations.

The algorithm selects some of the most popular keywords, excluding unintersting ones like "the" "for" or "and". It then constructs a weighted graph with edges denoting a link between two keywords. There is a link between two keywords if they were searched together. For example, if a user searched for "new york", there is a link between "new" and "york". The weight is the number of times the keywords were searched together.I then select only a number of the highest weighted edges, and draw the graph with pydot. Source code is included at the end of this post, I hope people will play with it and post other results.

Here is the kind of results you can obtain:

Huge high-res

This is still kind of raw and rather unexploitable. Fortunately, it is possible to vary the number of words and edges. Playing a little bit with the parameters, it is possible to to cluster the data. I found that 90% of searches fall down into three categories. Local US search (entertainement, real estate, universities, etc.), Porn, and Music. Here is a closer look at all three clusters:

US related stuff

US relatedhigh-res.

Porn cluster

Music and lyrics cluster

qsdfhigh-res

Central words

When you explore the data a little more, it becomes apparent that some words play the role of anchors: "county", "lyrics", or "new". The other words combine with them to specify there meaning: "one (more) time lyrics", "new york", "county court house". Here is a really huge map illustrating these concepts: (be patient, it will take a little while to load, imageshack is really slow)

Free Image Hosting at www.ImageShack.us

Source code

The code assumes you pass it a text file in the following format:

keyword1 keyword2 keyword3
keyword43 keyword1 
keyword7 keyword34 keyword23 keyword9
....

It's very easy to obtain a file in that format using the aol text files and standard unix tools like sed, grep, and cut

#!/usr/bin/python

import sys
import operator

def increment_dict(dict,key):
    if key in dict:
        dict[key]+=1
    else:
        dict[key]=1

def sort_freq(dict):
    """Sorts a frequency dict, returns sorted key/freq tuples"""
    tuples = dict.items()
    tuples.sort(key=operator.itemgetter(1),reverse=True)
    return tuples

def combo(list):
    """takes a list, and generates sorted combinations: [a,b,c] ==> [(a,b),(a,c),(b,c)]"""
    if len(list) == 2:
        list.sort()
        return [tuple(list)]
    else:
        l=[]
        for i in list[1:]:
            l.append((min(list[0],i),max(list[0],i)))
        l.extend(combo(list[1:]))
        return l

def tuples2graph(tuples,elements):
    """Returns a graph matrix from a list of frequency tuples, and a list of the elements contained in the pairs"""
    matrix=[]
    for i in range(len(elements)): #matrix creation and initialization
        matrix.append([])
        for j in range(len(elements)):
            matrix[i].append(0)
    for pair,count in tuples:
        a,b=pair
        try:
            x=elements.index(a)
            y=elements.index(b)
        except:
            pass
        else:
            matrix[x][y]=count
    return matrix

def print_matrix(matrix):
    for row in matrix:
        for line in matrix:
            for column in line:
                print column,
            print

def extract_links(lines,authorized_words):
    """Extract links between auhorized words"""
    word_links={}
    for line in lines:
        line = line[:-1] # get rid of \n
        words=line.split(" ")
        words =  [ word for word in words if word in authorized_words ]  #only consider authorized words
        if len(words) >= 2:
            for word in words:
                for combination in combo(words):
                    if combination[0] != combination[1] : increment_dict(word_links,combination) #remove links from word to word
    return word_links

def label_graph(graph,labels):
    """labels a numbered graph with text labels"""
    labeled_graph=[]
    for n1,n2 in graph:
        if not (n1 == -1 or n2 == -1):labeled_graph.append((labels[n1],labels[n2]))
    return labeled_graph        

def clean_freq_data(data,min_freq):
    """removes all entries with freq < min_freq"""
    for key,freq in data.items():
        if freq < min_freq: data.pop(key)

def opposite_matrix(matrix):
    """prim's algorithm finds the cheapest path, so we must change our matrix to use it"""
    max_value=0
    for line in matrix:
        line_max=max(line)
        if max_value < line_max : max_value = line_max
    for i in range(len(matrix)):
        for j in range(len(matrix[0])):
            matrix[i][j]=max_value-matrix[i][j]
    return matrix

def extract_n_edges(weighted_graph,n):
    """selects the n highest weighted branches and returns the corresponding unweighted graph"""
    weights=[]
    for edge,weight in weighted_graph:
       weights.append(weight)
    weights.sort(reverse=True)
    lowest_weight=weights[n]
    result=[]
    for edge,weight in weighted_graph:
        if weight >= lowest_weight: result.append(edge)
    return result

def analyze(filename):
    """analyze data file, give a file object"""
    excluded_words=["http","the","and","for","from","how","are","about","with","all","that","what","that","get","can","your","make","you","free","does","when","day"]
    number_of_words = 60
    number_of_edges = 40
    word_freq = {}
    fobj=open(filename)
    lines=fobj.readlines()
    fobj.close()
    for line in lines: #Extract word frequencies
        line = line[:-1] #remove \n at end
        words=line.split(" ")
        for word in words:
            increment_dict(word_freq,word)
    tuples=sort_freq(word_freq)
    authorized_words=[]
    i=0;n=0
    while(n < number_of_words): #Chose interesting words
        #print "%d:\t%d: %s" % (i,tuples[i][1],tuples[i][0])
        word=tuples[i][0]
        if len(word) >= 3 and word.isalpha() and word not in excluded_words: #only consider longer words
            authorized_words.append(word)
            n+=1
        i+=1

    #print len(tuples) #number of distinct words

    #Analyze word links
    word_links = extract_links(lines,authorized_words)
    #print word_links
    links=sort_freq(word_links)
    #print links
    #print links[:1000]
#    matrix=tuples2graph(links,authorized_words)
#    #print_matrix(matrix)
#    import prim_graph
#    tree = prim_graph.easyprim(opposite_matrix(matrix))
#    labeled_tree = label_graph(tree,authorized_words)
    import pydot
    #clean_freq_data(word_links,50)
    g=pydot.graph_from_edges( extract_n_edges(links,number_of_edges) )
    g.write_jpeg("aol_word_tree.jpg")
if __name__ == '__main__':

    if len(sys.argv) != 2:
        raise "Usage : extractdata.py filename"
    analyze(sys.argv[1])

Comments

YOUR IMAGES DONT WORK

They do, it's just very slow. Anyone know of better image hosting services?

Looks like the AOL users are coming here to try and find out just how bad this breach of their privacy is.

  • Thorsten, Privat.is

Just to clarify for AOL users reading my post, I do not mean Joel's posting of this analysis of the data, but I was referring to the breach of your privacy by AOL itself.

  • Thorsten, Privat.is

I prefer photobucket as a host.

thanks

Very cool, combining this with some data crunching using other python analytic tools (Matlplotlib, SciPy) would be interesting. There is a nice bundle here: http://code.enthought.com/enthon/

If you do anything, please tell me about it! joelthelion {at} laposte.net

Yes.. try out the AOL search database yourself.. It is just fun to look at some of the search data..

http://data.aolsearchlogs.com/log/random.cgi

Very interesting. I've just added you to my AOL Search Data resource page: http://sergiorebelo.com/twodotfive/?page_id=25

thanks :)

How do spiders always get in my bath?

stinky

Hello! Good Site! Thanks you! itlpmdhkav

<a href='Police> Arrest Bronx Man in Killing of 2 New York Times</a>

<a href='China> Warns Banks on Lending Wall Street Journal</a>

<a href='BBC> head of audiences joins Clearcast Guardian Unlimited</a>

<a href='vod:> Five left on the sidelines Guardian Unlimited</a>

<a href='Holmer> says senior dialogue with China must be 'more than talking... Forbes</a>

<a href='BB&T> to host free payments webinar for business owners</a>

<a href='garden> gift ideas for glam gals</a>

Everybody is very recommend to visit the portal interesting sites: http://farise.cn

Hi! I found a nice online pharmacy! Viagra 10 pills x 25 mg

Add a comment

you're not logged in