From what I've gathered when reading up on this topic, if RAM is your constraint the common thought has been that higher parameter models quantized down to smaller sizes will outperform lower parameter models running at higher quantization, i.e. it may still be preferable to use the 70B Llama model quantized to 4-bits than something like an unquantized f16 Falcon 40B or the "coming soon" f16 33B Llama2.