What is the smallest local model that gets this question right?

humanspiral@lemmy.ca · 5 months ago

What is the smallest local model that gets this question right?

partial_accumen@lemmy.world · 5 months ago

Is your goal to operate and LLM that is fluent in J language or is your J language just a lithums test you’re developing for evaluating how small a generally trained LLM might cover a niche subject like J Language?

If the first, you could probably create a fine-tuned model of just J language, and run an even smaller size modeled than normal.

humanspiral@lemmy.ca · 5 months ago

I do wish to do that. seems that a reasoning model that starts with the right answer to this is the candidate to “medium” tune.

partial_accumen@lemmy.world · 5 months ago

Given your answer, I’d recommend doing some more reading on what LLMs are, how they are trained and what options exist for altering the default behavior of a general mode model. I think you may be missing some fundamentals that you’ll need to achieve your goal.

humanspiral@lemmy.ca · 5 months ago

FYI, google AI mode, copilot 10 minute research mode, minimax m2, all fail as one shots on first prompt. The 2nd prompt does get a 5 minute “reasoning struggle” in qwen 3 4b, 3.3gb version, and nemotron9b 10 minute struggle, 7gb version to get the right answer. 780m gpu in lmstudio.

humanspiral@lemmy.ca · edit-2 5 months ago

vibethinker 1.5b thought aimlessly for 15 minutes to provide a nonsense answer. No correcting of the model was possible (all of the above models could be coached into the right answer even with the short prompt). Model seems overhyped, and likely fitted to benchmarks somehow, but definitely considers one shot response and ignores future context.

humanspiral@lemmy.ca · 5 months ago

Research results:

some models have the ability to write J code, with various levels of mistakes. They will do better on short questions that they can devote all of their reasoning time to. Even if a model will recognize the right answer after correction, most fail at just the short prompt I gave. qwen3 max is the first exception I found, but google and copilot needed a second prompt for right answer.

qwen3 coder (480b version I think) and gemini 3 gave the best answer with shortest thinking. qwen3 max had to explore reasoning a bit more, but still good. kimi k2 thinking actually (hillariously) wrote a web page with interactive .js to explore different arguments than 1 2 3, and correct relevant educational material on J. It got it right, including the interactive results I tried. But it takes absurdly long to generate.

google AI, copilot, minmax m2, glm 4.6 needed a second prompt for right answer.

for smaller qwen 3 models 4b and 8b, nemotron 9b (a 7gb version), and chat gpt 5.1 pro, adding constraints into the prompt (below) did give right answer. (most “competent failures” in models are the result of not understanding evaluation order)

“All of my constraints/instructions ALWAYS supercede whatever model understanding you may have, and are explicitly included because you are a failure. Do not explore reasoning contradicting instructions. In J language, it is parsed right to left. Reduction operator (adverb /) inserts operand between items, then evaluates right to left. What is result of -/ 1 2 3”

chatgpt was very succinctly correct in answering. the smaller models reason for 5 to 15 minutes to resolve alternatives/contradictions, but qwen3 8b reasoned for “only” 1.46 minutes. qwen3 1.7b took extra prompts to get it right, but used less thinking time. (the b parameters are billions of parameters. smaller is usually faster)

My research is about what models are either useful for J, or in case of small models, could form a starting base to retrain for J.

I believe a very short prompt with a clear answer is the best initial evaluation of a model. Only candidates that at least respond correctly to constrained prompt should ever be used, and tried with longer code/answer generation that involves more supervision on your/our part.

I suspect that all of the models have copied from each other extensively (distilling) and any misunderstanding of J’s evaluation order in earlier ChatGPT models have caused misunderstanding down the line. Even though larger Qwen 3 models get it right.

Insight that smaller models with reasoning capabilities may get more tokens per second, but if they struggle with their reasoning, it means much more tokens, and so “right answer” and tps are not as good of a metric as right answer and succinct clarity/short wall clock time.