I think we only claim to be able to *preprocess* a matrix at "up to" 100GB/s/cor...

I think we only claim to be able to preprocess a matrix at "up to" 100GB/s/core. The overall matrix product will take longer and depend on the matrix shapes.

To simplify Section 1.1, we help when:

1) You need to perform a matrix product more quickly and can tolerate approximation error

2) You have a training set for the larger matrix

3) The smaller matrix is either a) fixed or b) skinny relative to how tall the larger matrix is.

Re: "an impressive, but dangerous, tool to people who don't know what they're doing."

I believe you are overestimating the usability of my code :). But more seriously, I suspect that people attempting to use our method in contexts it wasn't designed for will quickly discover that they either can't actually call the API the way they wanted to, or that the method is no faster for their purposes. We also characterize our method at least as thoroughly as any approximate matrix multiplication algorithm I'm aware of, and have a variety of (admittedly loose) theoretical guarantees. So I hope that at least those who thoroughly read the paper will have a clear idea of what it can do. Overall, I guess my current thinking is that 1) I'm not sure how to introduce a method any more responsibly, but 2) if I can be of help in ensuring that it gets used well, feel free to reach out.