Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A big problem for this would be the transfer of data to and from the API. Imagine an algo to analyze gigabytes or terabytes of data.

Also, protection of the data as it is being transfered, stored, and analyzed is an issue. This is both data integrity and also protection for privacy or confidentiality reasons.



This is true. A possible solution is if the service was run in EC2, you could leverage the speed on the amazon network if the data was already stores in S3.


I was thinking that as I was typing my comment. Another solution, which S3 uses, is to ship hard disks by courrier. I guess the real metric here is cost per GB transfered in a unit time, say $/GB-hr.


At what point does it become ridiculous to move the data, which may be measured in TB or PB, when the algorithm itself would be measured in KB or MB?


Hush. Not in front of the VCs.


In clusters working on large amounts of in memory data, the approach is often to load the data, then move the code (e.g. a java class implementing some data procesing interface) to the data as required, rather than move the data to the code.


There is always the stuff that goes the other way though like how Seti@Home does FFT's which is computationally expensive and benefits from a distributed system but the file size is quiet small.


Yes, BOINC projects are cases where it is not ridiculous to move the data, because it is computation power that is the scare resource and the work units are typically only in the hundreds of kilobytes to single-digit megabytes.


A local client collects summary statistics to send to The Algorithm over the net, kind of like how google's mobile voice search works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: