Thanks for this tip, I don’t have a lot of VRAM just 64GB of regular RAM, but I don’t mind waiting for output :)
But anyway, all non-Llama model weren’t so good and using RAG in plug-and-play mode, probably I should’ve spent more time working on system prompt and jinja as well as RAG curation to squeeze all juices, but I wanted something quick and easy to setup and for this needs Llama 3.2 8B Instruct was the best. I used default setup for all models and same system prompt.
Also, new Qwen reasoning model was good, it was faster in my setup, but was too “independent” I guess, it tended to ignore instructions from system prompt and other settings, while Llama was more “obedient”.
I noticed from my tries that RAGs do not affect AI output as much. When I put text into prompt AI tends to quote from it or ignore what I said completely lol. RAGs are more like telling AI: “here’s a document/documents that you have to look through every time you generate output” and it just does it