Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Twitter to Sell 50% of All Tweets for $360,000 a Year Through Gnip (readwriteweb.com)
102 points by hornokplease on Nov 17, 2010 | hide | past | favorite | 63 comments


Let's see, at the 1000 tweets per second figure from the article that's 365 * 24 * 3600 * 500 = 15,768,000,000 tweets for $360,000. Or 0.002 cents per tweet. So 500ish tweets are worth about a penny, and your $360K would be for 15 gigatweets which would weigh in at about 2.2 TB[1]. So my tweets aren't worthless, they're just very very cheap.

[1] That's a hard drive manufacturer terabyte, not a real one


Sweet, so my account is worth about a whole $1 in tweets


That is $1 per sale :). Maybe your account is worth $10-$20 total.


I'm trying to find my "privacy outrage" / "who actually owns your content" soapbox but having trouble.

In fact, I find it kind of refreshing that Twitter is just flat out saying all your base belong to us, and anybody that wants it gets access for $360,000 a year.

This in contrast to Google, Facebook, Yahoo etc that muddy the waters whenever it comes to how they're actually using / sharing your data.


I'm not sure how you'd manage "privacy outrage" over Twitter messages, since they're almost always public.


The concept of "Privacy" means to me that the average user is aware of how their profile, their content, etc is going to be used by the host.

There's a difference between the perception of your feed being public, and Twitter selling your feed data to a corporation to utilize for targeting, advertising, reselling to employment agencies in the future long after your stupid teenage profile is deleted, whatever.

When a solid 10% of the users are kids (and a much much larger percentage is entirely clueless) it's worth questioning and the people that do know what's actually going on have a responsibility to ask questions.

That's my best attempt to get fired up about privacy on twitter, you forced my hand - oh won't you please think of the children?

Ref twitter age - http://royal.pingdom.com/2010/02/16/study-ages-of-social-net...


I think you have the concept of "Privacy" confused with the concept of "Transparency". And, my point is, you specifically don't have privacy on Twitter.


I'd rather not get into a discussion of what "privacy" means, but Twitter (and Facebook) are almost certainly violating most of their users' perceptions of what Twitter does with their tweets. That said, if they anonymize the tweets and hold the buyers contractually obligated not to de-anonymize them, I think most users would be OK with that. Each user owns their tweets, as does Twitter, but only Twitter owns all of the tweets.


It's a broadcast medium. People literally keep score with each other about how many people they can get to follow them, and how many people they can get to RT their messages. I think you're simply dead wrong about this.


1. Reality and perceptions don't always collide. Like the people that post messages on Facebook about calling into work sick when they aren't... only to have their boss read the message and fire them over it.

2. What about the people whose Twitter accounts are private?

3. The people 'racing' with each other for followers or retweets are by definition more public than most other people. Unless you are going to claim that all or most Twitter users fall into that category. Trying to use them to categorize the user base of Twitter as a whole seems a bit off.

4. Twitter is a broadcast medium, but what we are talking about are the perceptions of the people using it, not the reality of the situation. There are plenty of people that broadcast stuff publicly that they wouldn't want their parents to read. Why would they do so? "My parents aren't on Twitter." I'm sure the same thing applies to bosses and the workplace.


In what sense do you think each user "owns" their tweets?

In most cases it's not at all, here's a good page on the topic: http://www.canyoucopyrightatweet.com


Twitter has no privacy model, so there can't be any privacy outrage. All tweets are public, it's just a matter of whether or not they show up on your feed.

And if Twitter wants to make an extra (maybe morally gray) dime off of any "privacy outrage", they can offer certain users to pay a fee to have their tweets NOT included in these dump.


Twitter is public by design, so it makes a lot more sense.

By comparison, Google's data of my searches and Facebook's data of me and my friends is much more intimate than Twitter's database of my tweets.

Ryan Singer nails the distinction here:

http://37signals.com/svn/posts/2618-twitters-ux-separate-the...

"Public by default is better than public-by-surprise."


You might have some outrage on DM's are those private accounts, but twitter is such a public yelling-in-the-town-square model that it is hard to get mad.


We really need (and will inevitably get) an open, distributed protocol for status updates. It's insane for everything to be routed through one (or 2, or 3) companies.


Well theres statusnet: http://status.net/ . Which I assume would be trivial to sync using pubsubhubbub or wave-style xmpp extensions.

Edit: wiki says they already support this: "Supports Federation, which provides the ability to subscribe to notices by users on a remote service through the OpenMicroBlogging protocol."


http://identi.ca/

Federated twitter clone, powered by status.net


And when we get it no one will use it.


Who would host the servers though? A standards organization? I doubt they could afford it.

That, or it will be like email, where everyone has @example.com appended to their username.

Neither situation strikes me as desirable.


In the future, it's quite likely that a standards organization could afford it.

Processor speed, disk capacity, network bandwidth, and available software are all growing much more rapidly than online populations.

In some years' time I might be able to run an operation like Google, Facebook, or Twitter from my bedroom.


There are theoretical lower bounds on the amount of heat generated by the kind of information processing that Google does, and this heat needs to come from energy sources, which are growing more scarce as software/bandwidth/disk capacity/CPU speed grows. Google is the largest non-manufacturing electricity buyer in the world and would never fit in any bedroom :)


The current Google perhaps, but not the Google of the future.


I'm not sure I want the Google of the future.


1000 tweets (maybe with metadata 1 KB) are juts 100 KB/s traffic (if compressed). You could stream this from one of the smaller ec2 instances.


I think it will be a lot like email - what is undesirable about appending a domain? That's how the internet works.


You mean, with all the spam?


Fortunately tweeting has the concept/expectation of following (whitelisting) and email doesn't.


Only if you've protected your profile/using private messages. I regularly get @replied by people who I don't follow, and I don't particularly want to lose those messages, so it's probably not a realistic way of stopping spam (although it would certainly be effective).


The key value is not as much in the data itself, as much as the _timeliness_ of the data. Access to the halfhose allows you to answer a _very_ valuable question:

"What's happening right now?"

This question is worth a lot of money, and something that doesn't have a good algorithmic solution(e.g. Google News.) Twitter is probably the only company that has a privacy-compliant solution to this, hence making it a very monetizable product.


But Twitter already provides an API for that. You can ask what are the current top 10 trending topics at any time.


All these platforms,..., they all realize that the data they have is extremely valuable to everyone from API partners to marketers .... I think all these companies could see that there's more money in data services than there could be for them in advertising.

The value would be back into the API and not into weird sponsored trendic topics. Twitter seems to go back towards Alex Payne's vision (data hose platform) and away from Biz Stone's (twitter as a media with celebrities etc...). They could also set up separated Twitters Hoses: like there could be automatic sensors data input for the Internet of things for instance, separated from human input. Any link where Twitter guys are speaking of this?


What are the technical measures to prevent people to just scrape 100% of the tweets, if any?


Simple rate-limiting / DOS prevention would be plenty given the volume of tweets.


Not that hard to get around. IPs are easy to come by.


I disagree. Twitter is pushing upwards of 3000 tweets per second [1], it'd be nearly impossible to scrape at that rate and go without notice, regardless of how many IPs you've got at your disposal.

[1] http://mashable.com/2010/06/25/tps-record/


You don't need to scrape for each tweet. Scrape every active user every couple hours/days depending upon their tweet frequency. Not much point in having a database up to date every second since processing it in real time is almost impossible anyway.

Twitter would notice a major scraping operation, but if it's done correctly they wouldn't be able to distinguish between user IPs and bot IPs.

edit: Barracuda already did more than 10% of users just for a white paper: http://www.barracudanetworks.com/ns/news_and_events/index.ph...

150,000,000 registered users only takes 170 days at 10 users a second for a first pass. Focus on frequent tweeters for subsequent scrapes. Even among the ~20% of twitter accounts that are active, most don't need to be scraped daily, and the most active accounts are likely spam.


> Scrape every active user

translates to "Scrape every user" unless you know of some magical way to get list of "active" users.

Guess howmany active users there are? Guess how many servers you need running to get through those in 1 hour. My guess is something much more than $360,000 worth / yr.


> unless you know of some magical way to get list of "active" users

look for users who tweeted in last X days? also look for their repliers+buddies, since they too are likely to be active. doesn't seem hugely complicated to me?


> look for users who tweeted in last X days?

Requires you to check aka scrape every user to see which ones tweeted in last X days.


You may be able to get away with scraping a portion of the site, but I still would find it very hard to believe you'd be able to scrape even 50%.

Let's do some math, these are all based on numbers from this past June which have likely only gone up since then [1]:

  65 million tweets per day / 20 tweets per page = 3.25 million page views per day
Just to keep up with the stream, you'd need to do about 3.25 million page views per day or a little over 1.5 million to get half of it. Again, I'd find it very hard to believe that nobody at Twitter would catch on.

[1] http://techcrunch.com/2010/06/08/twitter-190-million-users/


They would catch on to the fact that they were being scraped, but they wouldn't be able to identify which 3.25 million were scrapes.


How would you even identify every scrapeable endpoint? Follow all the twitter users? I don't know if pre-Snowflake status IDs were reliably consecutive, but the current ones definitely aren't.


Does this include private tweets and DMs?


In the past, firehose access has meant public tweets only. I presume Gnip is getting the same data as the Library of Congress and other firehose consumers, but details haven't been spelled out yet.


I would imagine that it won't include private stuff or DMs. My guess is that, even if there is an "all your base" (as somebody gracefully put it aboe) clause in the ToS, private tweets and direct messages carry a reasonable expectation of privacy.


I work for a media aggregator, and we do a lot of different media (microblogs like twitter included). We index close to 100M documents a month. We also charge a fraction of 360k/yr. I'd like to know who the target market is?

Anyone here from people genuinely ready to spend that kind of cash for 50% of Twitter?


I think a distinction needs to be made, I agree that a company using this many of the Tweets as part of a commercial effort should pay.

I also think that a research effort with the ability to process 100% of Tweets can most likely afford to pay. Something like the 5% though, I think a great deal of research can be done on the Twitter platform may be prevented because of the cost. For research maybe you could charge the bandwidth required to deliver the stream, no idea what kind of ballpark this would be in.


I also think that a research effort with the ability to process 100% of Tweets can most likely afford to pay

Processing 100% of all tweets is not actually very hard or expensive: 2000 messages/sec is small potatoes if you want to do it in realtime, and even easier if you are just doing batch analysis queries. You could do it (with reasonable performance) for much, much less than $360,000/year (let alone 2x or 3x that for the 100% feed).


Not that amount, but afford to pay some amount for access, obviously for the massive full price little if any research that can't be monetized will be conducted. It's a net loss for us all and no gain for Twitter.


This is pretty interesting. There are a lot of interesting projects and data analysis that come from analyzing tweets but the price tag seems far beyond what any of these would be willing to pay.

Previously some serious time would be spent scraping content from the feeds but given that it's only 2% of the content and would take months (at least) to gather a significant amount makes it less than ideal. Although I would assume for a vast majority of cases, months are worth less than $360k.


isn't there already a free stream for new tweets that could be stored and analyzed? is this just a sale of the 50% of tweets that have already passed? why so pricey for historical market research data when current data is free?


From my understanding, the current data is not free. Yes, it's fully accessible, but to get access to the whole thing you would need to either scrape the site which Twitter will most likely disallow, or use the API, whose access Twitter controls.


That hasn't been free for a while.


jrockway to post 50% less content to Twitter for free.


That's a lot of tweets about Justin Bieber!


i'm selling my gtalk, Y!m and Skype statuses. $10000/year. Anyone interested?


I can't believe anyone would pay coin for that crap. Might as well get one year of the NYT crossword puzzles for the same price, it's a better value. Or randomly generate "I am having a coffee with my cat" in all possible combinations.

P.S. if this just a "display" does it mean there is no any value to the stale links?


This is so wrong. There is enormous value in data when you aggregate so much. Even if everything is about cats that tells you lots of things. That the economy is is good shape for example, that people are having so little bad stuff going on that they can focus on lunch plans. You'd actually have to do work on the data to really see it's value. Even if you only used it to generate markov chains it's valuable. Entity extraction, sentiment, brand feedback, etc etc.


You missed the point of the whole thing. This is not about displaying tweets, it's about the ability of a analyzing trends, news, popularity, etc. via tweets. It's a huge deal to be able to get 50% (let alone 100%) of all the tweets on a given moment.


Not only is it not about displaying tweets, it's specifically disallowed:

> Customers will only be allowed to analyze the messages, not display them


if you're trying to train a system for sentiment analysis, it's quite a useful corpus.


one man's trash is another man's treasure :)


Anyone keen to discuss this issue further? Join the live chat at http://twich.me/twitter360k




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: