If you’ve ever used Limewire, Bearshare, morph500, Frostwire, or even eTomi, you’ve been on the Gnutella Network. If you used any of these services last week, chances are I found you while I was crawling it.
Before I discuss my project, I’d like to give a little background on the Gnutella Network. In its current state, the Gnutella Network is a very diverse, expansive Peer to Peer filesharing network, like the kind you always hear about the RIAA busting people for using due to illegal music downloads.
The network and service itself aren’t illegal, it’s when one person distributes a copyrighted file to a person or a group of people without permission from the copyright holder. I’m not here to debate one side or the other, that’s just how it is right now.
The way the Gnutella Network functions is that all users are divided into two categories: Ultrapeers and Leaves. The picture below illustrates what this means.
Ultrapeers are like always-on servers. They are reliable users who are willing to share some of their bandwidth. Leaves are what most users are, unreliable users who pop in and out of the network and generally don’t share much (if any) bandwidth. Leaves connect to Ultrapeers and request files. Ultrapeers track what files (or chunks of files) the peers its connected to have, as well as what requests are coming in. If no other leaf connected to an Ultrapeer has the file one user requests, the Ultrapeer contacts its neighboring Ultrapeers who may or may not also have the file. That’s the quick and dirty of it.
My project was to build a bot that would connect to a cache of Ultrapeers (lists of caches are available at sites like Jon Atkins’ [ed: no longer maintained]). Using this initial list of Ultrapeers to bootstrap the search, it would begin a breadth-first search of the network. It does this by connecting to an ultrapeer, requesting a list of neighboring ultrapeers, and putting those neigboring ultrapeers in a list to contact later, and so forth. It also requested lists of leaves attached to the Ultrapeer, but those I did not attempt to connect to, merely recorded for later processing.
For my crawl, I attempted to contact 250,000 Ultrapeers, but mostly due to my University’s NAT (EDIT: I have since discovered that the University was actively throttling peer-to-peer packets at this time which explains why it took ~48 hours to connect to as many peers as I did), I only successfully connected to about 52,000. From these Ultrapeers, I gathered a list of over 2.7 million peers – ultrapeers and leaves alike.
The projects task was to build a distribution of Gnutella Network users by the service they used to access the network (their ‘User Agent’), what domain was hosting their internet access, and what country the user was from. Now, the User Agent was obtained in the responses from contacted Ultrapeers, but to find domains and countries, I had to perform a reverse DNS Lookup, which works like this:
A normal DNS lookup translates a domain (e.g. “google.com”) into an IP Address (126.96.36.199)
A reverse DNS lookup goes the other way, i.e. it translates “188.8.131.52” into “google.com”.
I’ll spare you some of the specific details and jump right into the pretty pictures (taken directly from my report).
Here’s a list of the top domains found (click all images to enlarge):
And here’s a comparison of the countries users are connecting from:
As you can see, Limewire pretty much dominates with over 95% of all User Agents. Whatever happened to BearShare? Oh yeah.
Because there were so many versions of Limewire lumped in together in that pie chart, I made another pie chart showing what versions of Limewire people were using:
Some other notes:
- The project is multithreaded to make simultaneous connections and reduce downtime between connections.
- My crummy, crummy laptop took over 100 hours to perform the crawl. A quad-core desktop with much better specs could perform it in an hour or less.
- Parsing is by far the most tedious task of something like this. You need to follow Gnutella Protocol to send and receive messages, and then parse, parse, parse to get the info you want from your received messages. I can see now why Regular Expressions were invented.
- If you’re super-interested in what exact domains were found in the search, here you go: