Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not even that. With KV-caching, it's linear with the size of the context; and if someone figured out a way to have e.g. NlogN complexity, I imagine with KV-caching it may go down to logN complexity. (If the new algorithm permits that.)


When people say that attention is quadratic, they mean that the cost to process n tokens is O(n²), so the amortized cost per token is indeed O(n). KV-caching is a way to maintain that amortized cost when appending tokens one at a time instead of ingesting the whole sequence at once. But in the end people want to be able to generate multiple tokens, so we're back at O(n²) total time again.

IIRC there are some FFT-based attention alternatives where encoding has complexity O(n log n), but there's no feasible way to cache anything and after appending a single token it costs O(n log n) again, so if you generate n tokens in sequence, the cost is actually O(n² log n).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: