Scientific Data Has Become So Complex, We Have to Invent New Math

micro_cam · on July 20, 2014

Ayasdi's notion of topological data analysis has to be the most overhyped piece of "math" yet this century. Take a simplistic notion of topological invariance, gloss over how you actually infer things like neighborhood structure to get the topology, slap a slick UI on it, show it works on one out of date breast cancer data set that is tiny by modern standards, charge through the nose and see how far you can ride it.

It is kind of shocking that they are now selling it as beneficial in heterogeneous data. Actually learning any structure from highly dimensional heterogenous messy data in a way that doesn't just overfit is the crux and, last time I looked, their public papers and other materials suggests they are doing nothing beyond some standard distance metrics. There is lots of other exciting work going on in this area. Topological invariance may provide some assurances what you are seeing is real but kind of just becomes a fancy name for doing parameter sweeps.

There are really exciting things going on in manifold learnings and application of (smooth and topological) manifold theory to modern data. This includes applications of discrete exterior calculus to abstract simplical complexes that may not even represent a manifold allowing efficient algorithms for problems like ranking from graph flows (e.g. page rank) and other cool things. Maybe Ayasdi has some new proprietary magic up their sleeves but everything I've seen from them so far is just a showy wrapper around really basic math.

abak · on July 21, 2014

Disclaimer: I am in the research group at Ayasdi.

What we do is simple in many ways. But before I make specific comments I'd like to point out that simplicity is a virtue not a demerit. To pull from another area of Mathematics - the derivative is a simple idea - it's just the slope of a tangent line. Any yet, much of modern physics is the result of understanding and applying this idea. With simplicity the difficulty is understanding how and why you use the technique - rather than describing the technique itself.

When you say "simplistic notion of topological invariance" I think you may be talking about persistent homology which is not currently the primary product sold by Ayasdi. I disagree that homology is simplistic - it's one of the central tools in modern mathematics.

In any case we sidestep the whole notion of inferring the neighborhood structure focus on creating meaningful (open) covers of the data.

Instead of finding a neighborhood structure (which you can think of as a particular choice of covering of your data by metric balls) we create (open) covers that summarize some important aspect of the data. I mean summary in a technical sense that is beyond the scope of this comment. (I have some videos on youtube that address this and other issues. I think the most recent is the best but there is some important material in the earlier videos that I don't repeat in the later ones https://www.youtube.com/results?search_query=anthony+bak I apologize for the vanity post)

I can briefly describe one technique for generating covers that we use instead of neighborhood structure but the details of how this fits into the bigger picture are best seen from materials on the Ayasdi resources page or from the videos above.

In the simplest (and most common) case we use a function on the data to get a map from the data to the real line. Using the function we induce a cover on our data from a cover on the real line by taking inverse images of the sets in the real line cover. This data cover is too coarse for most purposes so we break large sets into smaller sets by clustering within each inverse image set. In this way we build a useful cover of the space. To finish, we calculate the "nerve" of this cover to convert our (complex, high dimensional) space into a combinatorial object called a simplicial complex that "remembers" geometric and topological features of the original space (while forgetting others).

Why this is the right thing to do is covered in the video and on the resources page. It's ok to be confused about the "why?".

I don't see how this is a parameter sweep and I suspect again that you're talking about persistent homology not the nerve construction described here.

Stepping back for a moment, topological spaces comprise a far richer set of spaces than manifolds. In my experience real world data almost never looks like it's sampled from a manifold and the tools we need to describe what is happening need to far less rigidity than those coming from manifold learning. In particular, we want tools that make as few assumptions (manifold, homogeneity, statistical) as possible. As far as shape goes topology has (arguably) the most relaxed notion of shape in mathematics - so the fewest assumptions are needed to study the shape of the data.

As an aside I'll mention that in the video I show an example of using Topology a la Ayasdi to do manifold learning. We find a Klein bottle glued inside a Sphere along a singular set. One of the reasons this is a nice example is that it was also solved using tools from manifold learning - but the methods required knowing the local structure of the singularity. None of those assumptions were used in the topological reconstruction - we didn't need to know ahead of time what we were looking for.

I go through a bunch of examples in the videos of using these ideas on actual data. I believe I show some examples of telco churn data, insurance fraud, and mobil phone parkinsons detection. These comprise a small selection of what you can find poking around the Ayasdi resources page but go well beyond the NKI cancer data set you refer to (although, I personally find that example compelling).

Finally, I also like the other approaches you mention but I see them as complimentary not oppositional to the topological approach - and I think generally speaking there is a fair bit of overlap between the various communities. In particular, the Hodge theoretic analysis on simplicial complexes is really nice and Yuan Yao, who is a coauthor on the Hodge Ranking paper, was a postdoc with Gunnar Carlsson (Ayasdi cofounder) and I just was visiting him in Beijing. I regularly talk with collaborators of Larry Wasserman and his graduate students. Looking further a field, Vin de Silva, one of the inventors of isomap, was also a postdoc with Gunnar Carlsson and is very active in the Topological Data community. In fact, on a technical level, like I mentioned above, the topological framework can use manifold learning techniques (such as isomap) to help create the topology used as part of the nerve construction. So the fields cross fertilize technical results as well. Yes, like you say, there is exciting things going on, and we are constantly integrating those ideas into our product.

What all of these methods share is a desire to bring richer geometric (broadly speaking) toolset to modern data problems. I see particular value in the topological approach but support other work on bringing geometry to data.

micro_cam · on July 21, 2014

Thanks for the answer. I'll take a look at more ayasdi papers and things but part of my distaste is that so much of the public material is non technical making marketing stuff that claims revolutionary broad applicability and makes it hard to tell what you guys are actually selling. Telling that you can prove it should work and I confused about the why is total BS and ensures I will never pay for your product...your just using advanced math to obfuscate.

The history of data analysis indicates exactly the opposite. Methods are shown often not shown to work theoretically for years after they are accepted to work well in real world data analysis. Lots of popular methods may not even come with theoretical guarantees and lots of theoretical guarantees are useless or misleading because they depend on assumptions about the data which are rarely true or have other issues.

But I have to say your still just sidestepping how you actually do the neighborhood learning now by calling it open covers (I just used the term neighborhood structure to keep it less jargony for this audience). How do you map the text documents mentioned in the article to real space? If you are just integrating isomaps and other standard techniques and the added value is the simplicial complex vis that is fine but you aren't developing any new math.

The procedure you are describing is similar to hierarchical clustering and will suffer from similar sensitivity to selection of the initial splits and any parameters. Manifold forests, for example, are also hierarchical but used a bagged ensemble to partially addresses this. I'd like to see more public work from you guys on combating this sort of overfitting and sensitivity...I just picked on persistent homology because it is one of the only things topology seems to add in this area.

This stuff is really cool and has generated useful results. If you guys just want to be the main consulting company for doing manifold learning that is great but marketing articles like this that try to claim you're the exclusive purveyors of some new math is turning a large portion of the community away.

abak · on July 22, 2014

The process I outlined is sometimes called "Mapper" and is from published literature:

http://comptop.stanford.edu/u/preprints/mapperPBG.pdf

The more general concept of calculating the nerve of a covering is standard topology material where the technique is used to create combinatorial representations of topological spaces. http://en.wikipedia.org/wiki/Nerve_of_an_open_covering

TDA is a framework not an example or a method. It applies anywhere that you can define a notion of distance or similarity along with a function (not nec. continuous) on your data. It's hard to think of a data situation where that doesn't apply. That's why we can handle so many different data types. The method has very few assumptions or requirements.

If you are saying that for some specific example you saw somewhere (or maybe it's in the article?) we didn't tell you the metric and function I'm afraid I can't help you - I am not myself familiar with the details of the text analysis you're referring to. But generally speaking we are open about exactly what metric and function we've used.

I am not obfuscating how we create covers (what you're calling neighborhood learning) - in fact I spelled it out exactly. We pull back the cover from the real line (period!). On the real line we typically cover so that either all covers are the same size or contain the same number of points (In the software you pick the number of sets in the cover, their percentage overlap, and if you want to have them same size on the real line or contain the same number (approx) of data points). How we cover the real line is a choice and isn't part of what TDA is. There are other reasonable choices.

I think compared to most companies we are in fact open and transparent. The part of the software that isn't open primarily deals with how we've scaled this algorithm to deal with large datasets. The "mathy" parts are all either published or we give customers the formulas in our documentation.

I agree that in the real world we use things that aren't proven to work. Typically there is some range of applicability but we can't pin down the requirements for when a technique will work and when it won't - for instance, we can prove something for gaussian noise but in practice it works more broadly.

We don't use advanced math to obfuscate this issue and we certainly confront this issue on a daily basis. The details of exactly where our methods apply is complicated (what function?, how general a space can we use?) and to some degree unknown, but since topology itself has very flexible foundations we believe that lots (most) data problems lend themselves to being understood using these methods. This is clear without needing a formal proof of applicability.

I think one of the reasons that there's confusion is that TDA doesn't neatly fit into existing analytical boxes. It's not clustering, dimensionality reduction, manifold learning or feature selection, It's different enough that when you say it sounds like hierarchical clustering I think you're missing the point of what a topological summary is. We are also not doing manifold learning (we do not need underlying manifold assumptions)

The way I think of it is as a framework for analytics. In the simplest case It takes a metric space and a real valued function and produces a geometric summary. You can create the topology using any method you like (similarity spaces, standard metrics, manifold learning, metric learning, social network graph distance, some other method you know or invent) and you can use any function you might know or invent (mean of point coordinates, distance from separating hyperplane in a SVM, local curvature estimates, age of person, page rank of a node in a graph etc.). As long as the metric and the function were "sensible" for your data you'll extract some geometric/topological truth about your data.

It's not that we have a new method like a new kind of SVM or a different kind of neural network. We have a very generic way for extracting the (up till now) overlooked geometric/topological content of existing methods. This geometric information adds fidelity to existing methods (like all good frameworks, it not only uses but improves existing methods). This one way to think about what Ayadi does.

(As an aside I'll point out that higher fidelity with existing methods also means that you can build more accurate models and we are currently automating the process of model improvement using TDA).

The kinds of methods you describe (manifold forests, hierarchical clustering etc.) are all methods and for the most part you can input them into the topological framework as well. TDA doesn't fix all of your problems (parameter selection, over fitting etc.) but instead gives you more fidelity and geometric information about what you've chosen to do. We've built in many standard methods into the software but allow you to extend them with your own custom ones if you choose.

In terms of "new math" or not I guess I don't really understand the complaint. Articles on TDA are published in peer reviewed math and science journals and are done by mathematicians (Some prominent mathematicians doing and publishing on TDA: Gunnar Carlsson, Stanford Math. Robert Ghrist, UPenn Math, Shmuel Weinberger UChicago Math, John Harer, Duke Math. Robert MacPherson, Institute for Advanced Study. Konstantin Mischaikow, Rutgers Math.). People from Ayasdi publish and present results to other mathematicians who consider it new. For example, Gunnar just published a review article describing TDA in Acta Numerica: http://journals.cambridge.org/action/displayJournal?jid=ANU (subscription required). Some of this he invented, some others published elsewhere, but people in the community consider it new.

Using topology to understand point cloud data is new - methods aren't copied from existing mathematical literature even if we are inspired by results in geometry, topology, algebraic topology, and algebraic geometry. Just because we work by analogy doesn't make this not new math - in fact - I would argue that's how most "new math" is created across all of mathematics.

As a postscript I would add that we also come up with new metrics and functions to solve problems we come across. But this is not the core of what TDA is.

Edit: Added point on data types in 3rd paragraph.

barakm · on July 21, 2014

Any links for the curious?

micro_cam · on July 21, 2014

If you want to read up on ayasdi go straight to the Source: http://www.ayasdi.com/resources/ (click on publications)

There are probably some talks by Gunnar Carlsson around on the web too.

There is also a great overview from Larry Wassermann on his outstanding blog: http://normaldeviate.wordpress.com/2012/07/01/topological-da...

Other examples of manifold inspired math in data analysis that I find more useful as they describe ways to solve complex problems with simple systems:

http://web.stanford.edu/~yyye/hodgeRank2011.pdf http://www.cs.jhu.edu/~misha/Fall09/Hirani03.pdf

A really interesting method for learning manifolds from heterogeneous data (decision trees can be extended for categorical data, text data, almost anything) that doesn't boil down to just choosing a metric: http://research.microsoft.com/apps/pubs/default.aspx?id=1555...

Edit: Also they do have a really slick UI and I like the math involved. I wanted to like their product however I worry that the math just serves to lend legitimacy when the results aren't as exciting/promising as advances being made in other areas of data analysis on similar datasets.

Also them keeping it proprietary makes it hard for people to do legitimate comparisons. We implemented some of our own versions of their stuff to tory on other cancer datasets but it just didn't work very well because of numerous issues with lack normalization, batch effect, bias and noise.

mathgenius · on July 21, 2014

Here is the mother load: http://www.math.upenn.edu/~ghrist/notes.html

I would recommend diving in with wild abandon, Ie. don't be afraid of the heavy sounding math, Ghrist does a great job of holding your hand.

This field is called "Applied Topology" and is distinct from manifold learning, IIUC.

The photo from the article has a mention of "barcodes": http://www.math.upenn.edu/~ghrist/preprints/barcodes.pdf

The fascinating this about all of this is how topological (connectivity) information is extracted from messy data.

rgrieselhuber · on July 21, 2014

I'm usually not that guy but thought it might be helpful: the correct phrase is "mother lode."

dalke · on July 20, 2014

"Today’s big data is noisy, unstructured, and dynamic rather than static."

As was yesterday's big data. I've been reading through the literature from the 1960s and 1970s, and it complains almost exactly the same issues. Including coming up with new mathematical techniques.

Actually, "10 million words recorded during just under 200,000 trials" was completely within the techniques of the 1960s, which was a combination of human curation and computer data management. See for example the work at Chemical Abstracts and ISI in handling chemical publication data. I happen to know that the ISI organization in turn was based on citation techniques first developed in the late 1800s for tracking legal judgements - which is the topic discussed here.

j2kun · on July 20, 2014

This is a really fascinating subject, and as of last year I had read through essentially all the major papers in the area, implemented the central algorithms, and given a few more-or-less casual talks on it.

There are two things that strike me as very different about topological data analysis. The first and primary thing is that it's not a silver bullet by any means. It's not like most machine learning and data mining, where you pick some parameters and wham-bam-thank-you-ma'am you have 80% accuracy. No, this kind of analysis gives you qualitative features of your data, and after all the topology is done there's still years of work before you arrive at a mathematical model that admits an algorithm that has accuracy comparable to the state of the art.

An interesting case study in this is Carlsson's work on texture classification. They ran their 3x3 image-patch database through their topological analysis algorithms and it said (essentially) "your data looks like a Klein bottle!" That sounds interesting and fun, but how can you actually use that to do anything? This followup paper [1] then gave an actual model of image patches as a Klein bottle, but even then there is still a ton of math to trudge through (specifically, functional analysis and differential geometry) before they actually got to an algorithm, and it still wasn't strictly better than the leading methods. The real benefit seems to be entirely in the science part of everything. They have a novel hypothesis for how the world acts which is demonstrably accurate. It's not like a support vector machine with well-chosen features where it's as useful as a black box in terms of understanding the world.

The second thing is that this field has legitimate and nontrivial results that come from category theory, eventually leading to algorithms or proofs that you can't hope for an algorithm. Carlsson has a nice survey for starers [2]. So everyone who loves to talk about category theory and programming on the internet now has a less abstract quiver of arrows (pun intended, if you're familiar with this research area).

[1]: http://comptop.stanford.edu/u/preprints/KleinBottleTextureAn... [2]: http://www.ayasdi.com/_downloads/Topology_and_Data.pdf

cr4zy · on July 20, 2014

Sensitivity analysis - http://en.wikipedia.org/wiki/Sensitivity_analysis is an interesting tradeoff between generating a specific algorithm and a block box, in that it tells you which parameters had the biggest impact on the output given some model like an SVM.

skierscott · on July 20, 2014

I'm an undergrad in a lab that focuses on compressed sensing (and we're getting more into sparse machine learning).

Sparsity is a way of saying that most of the data you collect is related in some way. One advantage of sparsity (and there are many more) is sampling at sub-Nyquist rates[0] and still being able reconstruct the signal precisely. Why? Because the signal is sparse in the Fourier domain -- there are only a few frequencies present in the signal.

Reading this article, I had some issues with this quote:

> According to Candes, you could take half the samples (16) and rerun the test. If it is positive, the infected person is in this group; if negative, the culprit is in the other half. You can continue to whittle down the possibilities by once again dividing the group in half and running the test again.

That's just a simple binary search[1]. Instead, why not collect other information? Collect some genetic information, lifestyle habits, living location, etc. Then I believe (that's a strong I believe) you could optimize even further by testing patients that have some of these traits in common -- you'd have groups of unequal sizes based on who you think has the disease. This corresponds to sparsity. Most traits don't matter but a select few do.

To see real-world results of sparsity, we're able to reconstruct precise approximations after we assume our input is sparse in the wavelet basis (corresponding to many areas of the same color). We're able to push this sampling rate down to ~6% for simple images by doing a "sample, compute" method. The reconstructed approximation and sampled locations are shown in this image[2]

[0]:https://en.wikipedia.org/wiki/Binary_search

[1]:https://en.wikipedia.org/wiki/Nyquist_rate

[2]:http://imgur.com/PmYpze6

jlewis_st · on July 20, 2014

I'm the lead frontend developer at Ayasdi, and I figure I should take this opportunity to let the HN community know that we're actively hiring in engineering :)

If you're a frontend engineer with an interest in machine learning and data visualization, Ayasdi is a great place to build those skills. We use Backbone and D3 as our core stack on the client side, and we're pushing at the edge of what's possible when building rich data analysis applications for the web. (Incidentally, we're talking about our approach at the next Bay Area D3 User Group meeting http://www.meetup.com/Bay-Area-d3-User-Group/events/19268574...)

Feel free to contact me directly (contact info in profile) if you're interested in learning more about Ayasdi!

j2kun · on July 20, 2014

I'm a PhD student in pure mathematics (studying theoretical computer science), and I have a list of industry companies I might want to work at in the event I don't get a satisfactory post-doc.

To what extent do the folks at Ayasdi engage in the research side of the picture? Would you be interested in hiring someone with both strong programming skills and mathematical knowhow to work on both the research and development sides?

jlewis_st · on July 20, 2014

Definitely — our R&D team works on developing new mathematical approaches (for example, how to best leverage TDA for classification problems) and also applying existing approaches to new problem domains. The team also does plenty of prototyping (coding) and then works with engineering to move new algorithms to production. Drop me a line and I can put you in touch with someone who can talk in depth about what we're up to on the research side.

abak · on July 20, 2014

I'm in the research group at Ayasdi and can say that we actively do research in both TDA and Machine Learning and have developed unpublished but cutting edge fusion between the two fields - using TDA to enhance traditional machine learning models (not yet released in public).

I regularly attend academic conferences, give university colloquia and also talk at industry events (eg. In the last month I gave two workshop sessions at ICML in Beijing, gave a lecture at a drug design conference in NJ and presented at a Big Data conference in NYC).

We do basic research but as a small company (and small group) we have to keep our eyes on the commercial application and financial implications of our work. For myself I've found this environment stimulating and if anything has helped my research productivity.

It's worth pointing out that Gunnar Carlsson - who is mentioned in the article - is both one of the inventors of TDA and someone who participates in the daily development of the company and product. We have an open seating plan and he currently sits diagonally to my desk and next to one of our most junior employees (not in the research group). In that sense, there's opportunity for people to step up as they desire with access to people who aren't just leaders in TDA, but who invented it.

As a company we consider someone like yourself in a variety of roles - Data Science, Research or Engineering. In all of those groups people develop novel techniques that could be qualified as "research" and all groups contain people with PhD's in technical fields - some coming straight out of grad schools, some after postdocs and some after holding faculty positions. We are broadminded regarding academic backgrounds and there are people with undergraduate degrees making novel contributions as well. One of our presales engineers (who has a PhD) just had a paper published in Nature - which I mention to show how we have research capabilities spread very broadly through the organization.

If you want to talk more feel free to reach out to me.

j2kun · on July 20, 2014

Sounds great! I actually met Gunnar briefly at a conference at UChicago two years ago (doubt he remembers me). I'll reach out to you on LinkedIn.

wodenokoto · on July 20, 2014

From the article I'm having a hard time figuring out if this new technique is a clustering algorithm, a feature reduction algorithm or dimensionality reduction algorithm

zeratul · on July 20, 2014

I would say that this is true for unstructured data that does not have availability of experts who participated in the creation of that data.

I noticed that by talking with the domain experts, for example radiologists that read head MRIs, you can greatly simplify the problem. It usually comes at the cost of experts annotating the data, which might be too expensive for some businesses. But then, I think, the data is analyzed just by scholars until they built-in the expert knowledge into their algorithms.

That's at least the pattern in healthcare data. The first one to notice a paper or source code like that dominates the market - this is just my conjecture.

thaumasiotes · on July 20, 2014

This identical headline could have been run hundreds of years ago. The process has never stopped. We always need new math.

MaysonL · on July 20, 2014

Heck, the Egyptians originally developed geometry to help confirm the boundaries of the fields after the Nile's annual floods.

rotten · on July 21, 2014

That is certainly an early application of geometry. I'm not sure if we know if geometry was first developed as an applied math, or a pure one, or a mystical one. That they were able to prove Geometry came in handy which assuredly helped fund further research and development.

gone35 · on July 20, 2014

Recht and Candes may champion approaches like compressed sensing, while Carlsson and Coifman align themselves more with the topological approach, but fundamentally, these two methods are complementary rather than competitive.

Minor nitpick, but I think the article (and this quote in particular) might be mischaracterizing the difference. Every form of manifold learning relies first and foremost on the ability to compute some kind of low-dimensional embedding that preserves structure, for which Compressed Sensing --and its more recent generalizations: see for instance the work of Wakin et al [1]-- clarifies the conditions and transformations that make them feasible.

[1] http://papers.nips.cc/paper/3191-random-projections-for-mani...

spingsprong · on July 21, 2014

"10 million words recorded during just under 200,000 trials"

So an average of 50 words per trial? I would have expected a higher number.

gurupradhan · on July 21, 2014

Any suggestions, links, or research on how TDA or compressed sensing can be used to make sense of spatiotemporal datasets (gridded climatology time series data, change/anomaly detection in time-lapsed satellite imagery etc.)?