The model being 32B could run in <20GB VRAM with Q4 quantization (minimal loss of quality), or 80GB unquantized at full fidelity. The quoted 160GB is for their recommended evaluation settings.
There's a few pre-quantized options[0] or you can quantize it yourself with llama.cpp[1]. You can run the resulting gguf with llama.cpp `llama-cli` or `llama-server`, with LM Studio or with Ollama.
Anna doesn't have any advertising. Their income is purely driven off donations, most of which are part of subscription packages that offer faster downloads.
reply