When quantized to 4 bits, yes. You lose some quality by doing that, though, as c...

evilduck · on July 27, 2023

From what I've gathered when reading up on this topic, if RAM is your constraint the common thought has been that higher parameter models quantized down to smaller sizes will outperform lower parameter models running at higher quantization, i.e. it may still be preferable to use the 70B Llama model quantized to 4-bits than something like an unquantized f16 Falcon 40B or the "coming soon" f16 33B Llama2.

lolinder · on July 27, 2023

Yes, that is true! But you lose enough performance that comparisons to GPT-3.5 stop working.