Twitter to Sell 50% of All Tweets for $360,000 a Year Through Gnip

smithbits · on Nov 17, 2010

Let's see, at the 1000 tweets per second figure from the article that's 365 * 24 * 3600 * 500 = 15,768,000,000 tweets for $360,000. Or 0.002 cents per tweet. So 500ish tweets are worth about a penny, and your $360K would be for 15 gigatweets which would weigh in at about 2.2 TB[1]. So my tweets aren't worthless, they're just very very cheap.

[1] That's a hard drive manufacturer terabyte, not a real one

bennysaurus · on Nov 17, 2010

Sweet, so my account is worth about a whole $1 in tweets

invisible · on Nov 17, 2010

That is $1 per sale :). Maybe your account is worth $10-$20 total.

aresant · on Nov 17, 2010

I'm trying to find my "privacy outrage" / "who actually owns your content" soapbox but having trouble.

In fact, I find it kind of refreshing that Twitter is just flat out saying all your base belong to us, and anybody that wants it gets access for $360,000 a year.

This in contrast to Google, Facebook, Yahoo etc that muddy the waters whenever it comes to how they're actually using / sharing your data.

tptacek · on Nov 17, 2010

I'm not sure how you'd manage "privacy outrage" over Twitter messages, since they're almost always public.

aresant · on Nov 17, 2010

The concept of "Privacy" means to me that the average user is aware of how their profile, their content, etc is going to be used by the host.

There's a difference between the perception of your feed being public, and Twitter selling your feed data to a corporation to utilize for targeting, advertising, reselling to employment agencies in the future long after your stupid teenage profile is deleted, whatever.

When a solid 10% of the users are kids (and a much much larger percentage is entirely clueless) it's worth questioning and the people that do know what's actually going on have a responsibility to ask questions.

That's my best attempt to get fired up about privacy on twitter, you forced my hand - oh won't you please think of the children?

Ref twitter age - http://royal.pingdom.com/2010/02/16/study-ages-of-social-net...

tptacek · on Nov 17, 2010

I think you have the concept of "Privacy" confused with the concept of "Transparency". And, my point is, you specifically don't have privacy on Twitter.

andreyf · on Nov 17, 2010

I'd rather not get into a discussion of what "privacy" means, but Twitter (and Facebook) are almost certainly violating most of their users' perceptions of what Twitter does with their tweets. That said, if they anonymize the tweets and hold the buyers contractually obligated not to de-anonymize them, I think most users would be OK with that. Each user owns their tweets, as does Twitter, but only Twitter owns all of the tweets.

tptacek · on Nov 17, 2010

It's a broadcast medium. People literally keep score with each other about how many people they can get to follow them, and how many people they can get to RT their messages. I think you're simply dead wrong about this.

pyre · on Nov 18, 2010

1. Reality and perceptions don't always collide. Like the people that post messages on Facebook about calling into work sick when they aren't... only to have their boss read the message and fire them over it.

2. What about the people whose Twitter accounts are private?

3. The people 'racing' with each other for followers or retweets are by definition more public than most other people. Unless you are going to claim that all or most Twitter users fall into that category. Trying to use them to categorize the user base of Twitter as a whole seems a bit off.

4. Twitter is a broadcast medium, but what we are talking about are the perceptions of the people using it, not the reality of the situation. There are plenty of people that broadcast stuff publicly that they wouldn't want their parents to read. Why would they do so? "My parents aren't on Twitter." I'm sure the same thing applies to bosses and the workplace.

timf · on Nov 17, 2010

In what sense do you think each user "owns" their tweets?

In most cases it's not at all, here's a good page on the topic: http://www.canyoucopyrightatweet.com

cbo · on Nov 17, 2010

Twitter has no privacy model, so there can't be any privacy outrage. All tweets are public, it's just a matter of whether or not they show up on your feed.

And if Twitter wants to make an extra (maybe morally gray) dime off of any "privacy outrage", they can offer certain users to pay a fee to have their tweets NOT included in these dump.

chunkbot · on Nov 17, 2010

Twitter is public by design, so it makes a lot more sense.

By comparison, Google's data of my searches and Facebook's data of me and my friends is much more intimate than Twitter's database of my tweets.

Ryan Singer nails the distinction here:

http://37signals.com/svn/posts/2618-twitters-ux-separate-the...

"Public by default is better than public-by-surprise."

protomyth · on Nov 17, 2010

You might have some outrage on DM's are those private accounts, but twitter is such a public yelling-in-the-town-square model that it is hard to get mad.

aheilbut · on Nov 17, 2010

We really need (and will inevitably get) an open, distributed protocol for status updates. It's insane for everything to be routed through one (or 2, or 3) companies.

jackolas · on Nov 17, 2010

Well theres statusnet: http://status.net/ . Which I assume would be trivial to sync using pubsubhubbub or wave-style xmpp extensions.

Edit: wiki says they already support this: "Supports Federation, which provides the ability to subscribe to notices by users on a remote service through the OpenMicroBlogging protocol."

mmavnn · on Nov 18, 2010

http://identi.ca/

Federated twitter clone, powered by status.net

seanalltogether · on Nov 17, 2010

And when we get it no one will use it.

rottendevice · on Nov 17, 2010

Who would host the servers though? A standards organization? I doubt they could afford it.

That, or it will be like email, where everyone has @example.com appended to their username.

Neither situation strikes me as desirable.

chunkbot · on Nov 17, 2010

In the future, it's quite likely that a standards organization could afford it.

Processor speed, disk capacity, network bandwidth, and available software are all growing much more rapidly than online populations.

In some years' time I might be able to run an operation like Google, Facebook, or Twitter from my bedroom.

qq66 · on Nov 18, 2010

There are theoretical lower bounds on the amount of heat generated by the kind of information processing that Google does, and this heat needs to come from energy sources, which are growing more scarce as software/bandwidth/disk capacity/CPU speed grows. Google is the largest non-manufacturing electricity buyer in the world and would never fit in any bedroom :)

eru · on Nov 17, 2010

The current Google perhaps, but not the Google of the future.

chunkbot · on Nov 23, 2010

I'm not sure I want the Google of the future.

bluelu · on Nov 17, 2010

1000 tweets (maybe with metadata 1 KB) are juts 100 KB/s traffic (if compressed). You could stream this from one of the smaller ec2 instances.

aheilbut · on Nov 17, 2010

I think it will be a lot like email - what is undesirable about appending a domain? That's how the internet works.

alextgordon · on Nov 17, 2010

You mean, with all the spam?

wmf · on Nov 17, 2010

Fortunately tweeting has the concept/expectation of following (whitelisting) and email doesn't.

alextgordon · on Nov 17, 2010

Only if you've protected your profile/using private messages. I regularly get @replied by people who I don't follow, and I don't particularly want to lose those messages, so it's probably not a realistic way of stopping spam (although it would certainly be effective).

arnabdotorg · on Nov 17, 2010

The key value is not as much in the data itself, as much as the _timeliness_ of the data. Access to the halfhose allows you to answer a _very_ valuable question:

"What's happening right now?"

This question is worth a lot of money, and something that doesn't have a good algorithmic solution(e.g. Google News.) Twitter is probably the only company that has a privacy-compliant solution to this, hence making it a very monetizable product.

djb_hackernews · on Nov 18, 2010

But Twitter already provides an API for that. You can ask what are the current top 10 trending topics at any time.

harscoat · on Nov 17, 2010

All these platforms,..., they all realize that the data they have is extremely valuable to everyone from API partners to marketers .... I think all these companies could see that there's more money in data services than there could be for them in advertising.

The value would be back into the API and not into weird sponsored trendic topics. Twitter seems to go back towards Alex Payne's vision (data hose platform) and away from Biz Stone's (twitter as a media with celebrities etc...). They could also set up separated Twitters Hoses: like there could be automatic sensors data input for the Internet of things for instance, separated from human input. Any link where Twitter guys are speaking of this?

DumbledoreSnipe · on Nov 17, 2010

What are the technical measures to prevent people to just scrape 100% of the tweets, if any?

dasil003 · on Nov 17, 2010

Simple rate-limiting / DOS prevention would be plenty given the volume of tweets.

ahi · on Nov 17, 2010

Not that hard to get around. IPs are easy to come by.

mrduncan · on Nov 17, 2010

I disagree. Twitter is pushing upwards of 3000 tweets per second [1], it'd be nearly impossible to scrape at that rate and go without notice, regardless of how many IPs you've got at your disposal.

[1] http://mashable.com/2010/06/25/tps-record/

ahi · on Nov 17, 2010

You don't need to scrape for each tweet. Scrape every active user every couple hours/days depending upon their tweet frequency. Not much point in having a database up to date every second since processing it in real time is almost impossible anyway.

Twitter would notice a major scraping operation, but if it's done correctly they wouldn't be able to distinguish between user IPs and bot IPs.

edit: Barracuda already did more than 10% of users just for a white paper: http://www.barracudanetworks.com/ns/news_and_events/index.ph...

150,000,000 registered users only takes 170 days at 10 users a second for a first pass. Focus on frequent tweeters for subsequent scrapes. Even among the ~20% of twitter accounts that are active, most don't need to be scraped daily, and the most active accounts are likely spam.

njharman · on Nov 17, 2010

> Scrape every active user

translates to "Scrape every user" unless you know of some magical way to get list of "active" users.

Guess howmany active users there are? Guess how many servers you need running to get through those in 1 hour. My guess is something much more than $360,000 worth / yr.

borism · on Nov 18, 2010

> unless you know of some magical way to get list of "active" users

look for users who tweeted in last X days? also look for their repliers+buddies, since they too are likely to be active. doesn't seem hugely complicated to me?

njharman · on Nov 29, 2010

> look for users who tweeted in last X days?

Requires you to check aka scrape every user to see which ones tweeted in last X days.

mrduncan · on Nov 17, 2010

You may be able to get away with scraping a portion of the site, but I still would find it very hard to believe you'd be able to scrape even 50%.

Let's do some math, these are all based on numbers from this past June which have likely only gone up since then [1]:

  65 million tweets per day / 20 tweets per page = 3.25 million page views per day

Just to keep up with the stream, you'd need to do about 3.25 million page views per day or a little over 1.5 million to get half of it. Again, I'd find it very hard to believe that nobody at Twitter would catch on.

[1] http://techcrunch.com/2010/06/08/twitter-190-million-users/

ahi · on Nov 17, 2010

They would catch on to the fact that they were being scraped, but they wouldn't be able to identify which 3.25 million were scrapes.

irons · on Nov 17, 2010

How would you even identify every scrapeable endpoint? Follow all the twitter users? I don't know if pre-Snowflake status IDs were reliably consecutive, but the current ones definitely aren't.

LiveTheDream · on Nov 17, 2010

Does this include private tweets and DMs?

irons · on Nov 17, 2010

In the past, firehose access has meant public tweets only. I presume Gnip is getting the same data as the Library of Congress and other firehose consumers, but details haven't been spelled out yet.

blhack · on Nov 17, 2010

I would imagine that it won't include private stuff or DMs. My guess is that, even if there is an "all your base" (as somebody gracefully put it aboe) clause in the ToS, private tweets and direct messages carry a reasonable expectation of privacy.

djb_hackernews · on Nov 18, 2010

I work for a media aggregator, and we do a lot of different media (microblogs like twitter included). We index close to 100M documents a month. We also charge a fraction of 360k/yr. I'd like to know who the target market is?

Anyone here from people genuinely ready to spend that kind of cash for 50% of Twitter?

robryan · on Nov 18, 2010

I think a distinction needs to be made, I agree that a company using this many of the Tweets as part of a commercial effort should pay.

I also think that a research effort with the ability to process 100% of Tweets can most likely afford to pay. Something like the 5% though, I think a great deal of research can be done on the Twitter platform may be prevented because of the cost. For research maybe you could charge the bandwidth required to deliver the stream, no idea what kind of ballpark this would be in.

neilc · on Nov 18, 2010

I also think that a research effort with the ability to process 100% of Tweets can most likely afford to pay

Processing 100% of all tweets is not actually very hard or expensive: 2000 messages/sec is small potatoes if you want to do it in realtime, and even easier if you are just doing batch analysis queries. You could do it (with reasonable performance) for much, much less than $360,000/year (let alone 2x or 3x that for the 100% feed).

robryan · on Nov 18, 2010

Not that amount, but afford to pay some amount for access, obviously for the massive full price little if any research that can't be monetized will be conducted. It's a net loss for us all and no gain for Twitter.

twymer · on Nov 17, 2010

This is pretty interesting. There are a lot of interesting projects and data analysis that come from analyzing tweets but the price tag seems far beyond what any of these would be willing to pay.

Previously some serious time would be spent scraping content from the feeds but given that it's only 2% of the content and would take months (at least) to gather a significant amount makes it less than ideal. Although I would assume for a vast majority of cases, months are worth less than $360k.

kin · on Nov 17, 2010

isn't there already a free stream for new tweets that could be stored and analyzed? is this just a sale of the 50% of tweets that have already passed? why so pricey for historical market research data when current data is free?

Timothee · on Nov 17, 2010

From my understanding, the current data is not free. Yes, it's fully accessible, but to get access to the whole thing you would need to either scrape the site which Twitter will most likely disallow, or use the API, whose access Twitter controls.

wmf · on Nov 17, 2010

That hasn't been free for a while.

jrockway · on Nov 17, 2010

jrockway to post 50% less content to Twitter for free.

xstaticdev · on Nov 18, 2010

That's a lot of tweets about Justin Bieber!

zrgiu · on Nov 17, 2010

i'm selling my gtalk, Y!m and Skype statuses. $10000/year. Anyone interested?

pointillistic · on Nov 17, 2010

I can't believe anyone would pay coin for that crap. Might as well get one year of the NYT crossword puzzles for the same price, it's a better value. Or randomly generate "I am having a coffee with my cat" in all possible combinations.

P.S. if this just a "display" does it mean there is no any value to the stale links?

sfphotoarts · on Nov 17, 2010

This is so wrong. There is enormous value in data when you aggregate so much. Even if everything is about cats that tells you lots of things. That the economy is is good shape for example, that people are having so little bad stuff going on that they can focus on lunch plans. You'd actually have to do work on the data to really see it's value. Even if you only used it to generate markov chains it's valuable. Entity extraction, sentiment, brand feedback, etc etc.

sp4rki · on Nov 17, 2010

You missed the point of the whole thing. This is not about displaying tweets, it's about the ability of a analyzing trends, news, popularity, etc. via tweets. It's a huge deal to be able to get 50% (let alone 100%) of all the tweets on a given moment.

steveklabnik · on Nov 18, 2010

Not only is it not about displaying tweets, it's specifically disallowed:

> Customers will only be allowed to analyze the messages, not display them

user24 · on Nov 17, 2010

if you're trying to train a system for sentiment analysis, it's quite a useful corpus.

AndrewMoffat · on Nov 17, 2010

one man's trash is another man's treasure :)

isaacsu · on Nov 17, 2010

Anyone keen to discuss this issue further? Join the live chat at http://twich.me/twitter360k