Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When quantized to 4 bits, yes. You lose some quality by doing that, though, as compared to the full f16.


From what I've gathered when reading up on this topic, if RAM is your constraint the common thought has been that higher parameter models quantized down to smaller sizes will outperform lower parameter models running at higher quantization, i.e. it may still be preferable to use the 70B Llama model quantized to 4-bits than something like an unquantized f16 Falcon 40B or the "coming soon" f16 33B Llama2.


Yes, that is true! But you lose enough performance that comparisons to GPT-3.5 stop working.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: