Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I disagree. Twitter is pushing upwards of 3000 tweets per second [1], it'd be nearly impossible to scrape at that rate and go without notice, regardless of how many IPs you've got at your disposal.

[1] http://mashable.com/2010/06/25/tps-record/



You don't need to scrape for each tweet. Scrape every active user every couple hours/days depending upon their tweet frequency. Not much point in having a database up to date every second since processing it in real time is almost impossible anyway.

Twitter would notice a major scraping operation, but if it's done correctly they wouldn't be able to distinguish between user IPs and bot IPs.

edit: Barracuda already did more than 10% of users just for a white paper: http://www.barracudanetworks.com/ns/news_and_events/index.ph...

150,000,000 registered users only takes 170 days at 10 users a second for a first pass. Focus on frequent tweeters for subsequent scrapes. Even among the ~20% of twitter accounts that are active, most don't need to be scraped daily, and the most active accounts are likely spam.


> Scrape every active user

translates to "Scrape every user" unless you know of some magical way to get list of "active" users.

Guess howmany active users there are? Guess how many servers you need running to get through those in 1 hour. My guess is something much more than $360,000 worth / yr.


> unless you know of some magical way to get list of "active" users

look for users who tweeted in last X days? also look for their repliers+buddies, since they too are likely to be active. doesn't seem hugely complicated to me?


> look for users who tweeted in last X days?

Requires you to check aka scrape every user to see which ones tweeted in last X days.


You may be able to get away with scraping a portion of the site, but I still would find it very hard to believe you'd be able to scrape even 50%.

Let's do some math, these are all based on numbers from this past June which have likely only gone up since then [1]:

  65 million tweets per day / 20 tweets per page = 3.25 million page views per day
Just to keep up with the stream, you'd need to do about 3.25 million page views per day or a little over 1.5 million to get half of it. Again, I'd find it very hard to believe that nobody at Twitter would catch on.

[1] http://techcrunch.com/2010/06/08/twitter-190-million-users/


They would catch on to the fact that they were being scraped, but they wouldn't be able to identify which 3.25 million were scrapes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: