I disagree. Twitter is pushing upwards of 3000 tweets per second [1], it'd be ne...

ahi · on Nov 17, 2010

You don't need to scrape for each tweet. Scrape every active user every couple hours/days depending upon their tweet frequency. Not much point in having a database up to date every second since processing it in real time is almost impossible anyway.

Twitter would notice a major scraping operation, but if it's done correctly they wouldn't be able to distinguish between user IPs and bot IPs.

edit: Barracuda already did more than 10% of users just for a white paper: http://www.barracudanetworks.com/ns/news_and_events/index.ph...

150,000,000 registered users only takes 170 days at 10 users a second for a first pass. Focus on frequent tweeters for subsequent scrapes. Even among the ~20% of twitter accounts that are active, most don't need to be scraped daily, and the most active accounts are likely spam.

njharman · on Nov 17, 2010

> Scrape every active user

translates to "Scrape every user" unless you know of some magical way to get list of "active" users.

Guess howmany active users there are? Guess how many servers you need running to get through those in 1 hour. My guess is something much more than $360,000 worth / yr.

borism · on Nov 18, 2010

> unless you know of some magical way to get list of "active" users

look for users who tweeted in last X days? also look for their repliers+buddies, since they too are likely to be active. doesn't seem hugely complicated to me?

njharman · on Nov 29, 2010

> look for users who tweeted in last X days?

Requires you to check aka scrape every user to see which ones tweeted in last X days.

mrduncan · on Nov 17, 2010

You may be able to get away with scraping a portion of the site, but I still would find it very hard to believe you'd be able to scrape even 50%.

Let's do some math, these are all based on numbers from this past June which have likely only gone up since then [1]:

  65 million tweets per day / 20 tweets per page = 3.25 million page views per day

Just to keep up with the stream, you'd need to do about 3.25 million page views per day or a little over 1.5 million to get half of it. Again, I'd find it very hard to believe that nobody at Twitter would catch on.

[1] http://techcrunch.com/2010/06/08/twitter-190-million-users/

ahi · on Nov 17, 2010

They would catch on to the fact that they were being scraped, but they wouldn't be able to identify which 3.25 million were scrapes.