Crawling the Gnutella Network

If you’ve ever used Limewire, Bearshare, morph500, Frostwire, or even eTomi, you’ve been on the Gnutella Network. If you used any of these services last week, chances are I found you while I was crawling it.

Before I discuss my project, I’d like to give a little background on the Gnutella Network. In its current state, the Gnutella Network is a very diverse, expansive Peer to Peer filesharing network, like the kind you always hear about the RIAA busting people for using due to illegal music downloads.

The network and service itself aren’t illegal, it’s when one person distributes a copyrighted file to a person or a group of people without permission from the copyright holder. I’m not here to debate one side or the other, that’s just how it is right now.

The way the Gnutella Network functions is that all users are divided into two categories: Ultrapeers and Leaves. The picture below illustrates what this means.

(click to enlarge)

Ultrapeers are like always-on servers. They are reliable users who are willing to share some of their bandwidth. Leaves are what most users are, unreliable users who pop in and out of the network and generally don’t share much (if any) bandwidth. Leaves connect to Ultrapeers and request files. Ultrapeers track what files (or chunks of files) the peers its connected to have, as well as what requests are coming in. If no other leaf connected to an Ultrapeer has the file one user requests, the Ultrapeer contacts its neighboring Ultrapeers who may or may not also have the file. That’s the quick and dirty of it.

My project was to build a bot that would connect to a cache of Ultrapeers (lists of caches are available at sites like Jon Atkins’ [ed: no longer maintained]). Using this initial list of Ultrapeers to bootstrap the search, it would begin a breadth-first search of the network. It does this by connecting to an ultrapeer, requesting a list of neighboring ultrapeers, and putting those neigboring ultrapeers in a list to contact later, and so forth. It also requested lists of leaves attached to the Ultrapeer, but those I did not attempt to connect to, merely recorded for later processing.

For my crawl, I attempted to contact 250,000 Ultrapeers, but mostly due to my University’s NAT (EDIT: I have since discovered that the University was actively throttling peer-to-peer packets at this time which explains why it took ~48 hours to connect to as many peers as I did), I only successfully connected to about 52,000. From these Ultrapeers, I gathered a list of over 2.7 million peers – ultrapeers and leaves alike.

The projects task was to build a distribution of Gnutella Network users by the service they used to access the network (their ‘User Agent’), what domain was hosting their internet access, and what country the user was from. Now, the User Agent was obtained in the responses from contacted Ultrapeers, but to find domains and countries, I had to perform a reverse DNS Lookup, which works like this:

A normal DNS lookup translates a domain (e.g. “google.com”) into an IP Address (72.14.204.103)

A reverse DNS lookup goes the other way, i.e. it translates “72.14.204.103” into “google.com”.

I’ll spare you some of the specific details and jump right into the pretty pictures (taken directly from my report).

Here’s a list of the top domains found (click all images to enlarge):

And here’s a comparison of the countries users are connecting from:

And because I sort of cheated and lumped a lot of country codes into USA, I also build a graph for just those country codes (.com, .net, etc.):

Next up, we have our User-Agents, or what software the users who using to connect to the network:

As you can see, Limewire pretty much dominates with over 95% of all User Agents. Whatever happened to BearShare? Oh yeah.

Because there were so many versions of Limewire lumped in together in that pie chart, I made another pie chart showing what versions of Limewire people were using:

As of this time of writing, Limewire version 5.4.6 is the most recent release. Most users seem to stay pretty up-to-date on their installations. And that’s pretty much it!

Some other notes:

  • The project is multithreaded to make simultaneous connections and reduce downtime between connections.
  • My crummy, crummy laptop took over 100 hours to perform the crawl. A quad-core desktop with much better specs could perform it in an hour or less.
  • Parsing is by far the most tedious task of something like this. You need to follow Gnutella Protocol to send and receive messages, and then parse, parse, parse to get the info you want from your received messages. I can see now why Regular Expressions were invented.
  • If you’re super-interested in what exact domains were found in the search, here you go:
Advertisements

Tags: , ,

2 Responses to “Crawling the Gnutella Network”

  1. Danilo Lanphier Says:

    Hello there! This post couldn’t be written any better! Reading through this post reminds me of my good old room mate! He always kept talking about this. I will forward this write-up to him. Pretty sure he will have a good read. Thank you for sharing!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: