I have written a crawler for a pair of major file-sharing DHTs (Azureus/Vuze and Mojito (used by Limewire)) for a graduate-level distributed systems project, and I'm at a bit of a loss as to why it's failing only on EC2 and only for Mojito.
The crawling process is very similar to the crawler described in http://www.eurecom.fr/util/publidownload.en.htm?id=2495 . Briefly, it works by breadth-first search: the crawler sends 8-16 find-node packets to a node it has not yet interrogated; the response to such a packet is a list of more nodes in the network. It adds the nodes that it has not yet seen to its list of nodes to crawl, and repeats the process. The scan is limited to sending 9000 packets per second, and terminates when there have been no packets sent for 30 seconds. netstat -su shows <100 UDP receive errors on the offending EC2 box.
On a local box, it sees 1,163,777 nodes, and gets responses from 300,638 of them (the difference can be attributed to stale routing table entries, NATed nodes, etc.). On EC2, where I was hoping for greater network visibility, it sees only ~200,000 nodes, and gets responses from only ~2700. It is in fact the case that EC2 seems to have greater visibility for Vuze.
I am running on two high-CPU extra-large instances, one for Vuze and one for Mojito. They are both in us-east, but different availability zones. The Vuze one has been working great for hours; I am wondering if there is some kind of EC2 global bandwidth limit or DPI that could be whitelisting the Vuze traffic but not Mojito, or perhaps some other EC2 networking quirk? Any help would be appreciated, thanks!