Randow stufflet's build something beautiful A First Look At the AOL dataAmong all the controversy, the AOL data is still a unique source of very interesting data. Here is the account of my first explorations. Exploiting the dataIn order to see anything useful from a huge pile of data like that (around 500MB compressed), it is necessary to search for some repetitive patterns. For my first exploration, I have decided to explore the relationships between frequent search keywords. Using python and the graph package pydot (graphviz), it is possible to quite simply plot some interesting relations. The algorithm selects some of the most popular keywords, excluding unintersting ones like "the" "for" or "and". It then constructs a weighted graph with edges denoting a link between two keywords. There is a link between two keywords if they were searched together. For example, if a user searched for "new york", there is a link between "new" and "york". The weight is the number of times the keywords were searched together.I then select only a number of the highest weighted edges, and draw the graph with pydot. Source code is included at the end of this post, I hope people will play with it and post other results. Here is the kind of results you can obtain: This is still kind of raw and rather unexploitable. Fortunately, it is possible to vary the number of words and edges. Playing a little bit with the parameters, it is possible to to cluster the data. I found that 90% of searches fall down into three categories. Local US search (entertainement, real estate, universities, etc.), Porn, and Music. Here is a closer look at all three clusters: US related stuffPorn clusterMusic and lyrics clusterCentral wordsWhen you explore the data a little more, it becomes apparent that some words play the role of anchors: "county", "lyrics", or "new". The other words combine with them to specify there meaning: "one (more) time lyrics", "new york", "county court house". Here is a really huge map illustrating these concepts: (be patient, it will take a little while to load, imageshack is really slow) Source codeThe code assumes you pass it a text file in the following format:
It's very easy to obtain a file in that format using the aol text files and standard unix tools like sed, grep, and cut
last updated 2 years ago # CommentsLooks like the AOL users are coming here to try and find out just how bad this breach of their privacy is.
Just to clarify for AOL users reading my post, I do not mean Joel's posting of this analysis of the data, but I was referring to the breach of your privacy by AOL itself.
Very cool, combining this with some data crunching using other python analytic tools (Matlplotlib, SciPy) would be interesting. There is a nice bundle here: http://code.enthought.com/enthon/ Yes.. try out the AOL search database yourself.. It is just fun to look at some of the search data.. Very interesting. I've just added you to my AOL Search Data resource page: http://sergiorebelo.com/twodotfive/?page_id=25 <a href=' Everybody is very recommend to visit the portal interesting sites: http://farise.cn RZEU7f <a href=" RZEU7f <a href=" YYkJ7B <a href=" a3BQyv <a href=" Add a commentyou're not logged in |
YOUR IMAGES DONT WORK
2 years ago # reply