Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

7B models are so exciting. So much is happening with those smaller models.


BTW, for anyone who might not be aware of it, this model trained by Intel based on the Mistral architecture is probably the single best general 7B model available currently:

https://huggingface.co/Intel/neural-chat-7b-v3-2 (also see https://huggingface.co/Intel/neural-chat-7b-v3-1 from the previous version for more details)

It's licensed Apache 2.0 and unaligned (uncensored).


How is it better than the model from the team that made the dataset? https://huggingface.co/Open-Orca/Mistral-7B-SlimOrca


The Intel one had supervised fine-tuning with the SlimOrca dataset, and then DPO alignment on top of that using a preference dataset.

The technique for generating the preference data is what’s so interesting about that one. Instead of having human labelers choose a preferred response, they generated a response from a small model and a large model, and then always selected the large one’s as the preferred response.


I haven't personally tried that one, but on the HuggingFace LLM Leaderboard:

Open-Orca/Mistral-7B-SlimOrca - AVG: 60.37, ARC: 62.54, HellaSwag: 83.86, MMLU: 62.77, TruthfulQA: 54.23, Winogrande: 77.43, GSM8k: 21.38

Intel/neural-chat-7b-v3-2 - AVG: 68.29, ARC: 67.49, HellaSwag: 83.92, MMLU: 63.55, TruthfulQA: 59.68, Winogrande: 79.95, GSM8k: 55.12


I wish they were a smidge smaller since 7B LLMs just barely run on a 16GB VRAM GPU (like a T4 server GPU) without quantization shenanigans.

Fortunately the learnings from finding better 7B models will trickle down, or more will be done with distillation (e.g. Gemini Nano)


What's wrong with quantization?


There's still a (subjective) generative quality loss, even with recent tricks to minimize it.


Q8 is generally less than 1% degradation, Q5KM is around 3%. After that is when it starts to really degrade.


Can you explain for a noob why?


Easier to train, easier to experiment with. Most research and prototyping happens on the scale that is just barely out of the "toy" category.


Quantization means reducing the number of bits used to encode each floating point number constituting a parameter in the model So instead of having billions of possible values per weight, you might have just 255. The model has to have its weights crammed into a much smaller number of possible values, which reduces its ability to produce good outputs.


Sorry, my question is, why are the 7B models so exciting?


They don't require really expensive and power-hungry components to run, i.e. a mid-range GPU can run a (4-or-5-bit quantized) 7B model at +50 tokens/second, so it's completely feasible to run on a small budget. They are easier to fine-tune, because they are smaller, and you can even just do CPU inference if you really want. There are good OSS implementations like llama.cpp and exllama. And there is a lot of belief that 7B models are not yet tapped out in terms of efficacy, so they will keep improving.


A 7b quantised model is also about the biggest you can run on an M1 MacBook too. It's nowhere near that speed but it does work.


To add some numbers to sibling's comment, if a parameter is originally fp16 (a half precision float, I think this is what LLaMA was trained on) you need 16bit*7*10^9 ~= 13GiB of RAM to fit a whole 7B model in memory. Current high-end consumer GPUs (4090) top at 24GB, so these small models fit in GPUs you can have at home.

For comparison, the next largest size is usually 13B which at fp16 already takes ~24GiB (some of which you'll be using for your regular applications like your browser, the OS, etc.)

7B also faster since the critical path of the signal flow is smaller.

Training requires even more RAM (and the more RAM you have the faster you can train).

You could quantize 13B to make it fit in consumer cards without large losses (see e.g. charts for k-quants LLaMA inference[0]) but training on quantized models impacts more than inference (couldn't find charts here, I'm on mobile). But this means you could also quantize 7B models to run them on even less powerful GPUs like low-end consumer GPUs or even eventually mobile phones (which are also power-sensitive due to running on batteries).

[0] https://github.com/ggerganov/llama.cpp/pull/1684




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: