Real talk

I'm not going to try to sell you a story of this perfect solution. There is limitations, I had to overcome a few challenges to get it to the point it's at now. GPU is still faster, this in fact would improve GPU performance beyond anything it can do for CPU but the real gain is there.

If you can currently talk to a 14B at 3tok/sec, you may see slight improvement on CPU performance, but overall the actual processing speed is limited to reality. The part where this system shines is on subsequent turns and on the models overall behavior.

By saving the KV in memory the model doesn't have to reprocess the part of the conversation it has already processed, on top of this the increased attention tends to leave the model with almost perfect recall. The model doesn't have to process a wall every time and then figure out which part it's playing, what's going on and how to respond like current AI systems. The model only has to process the new input so instead of on the second turn that 14B you're trying to use because it's actually smart enough to use, having to process hundreds or thousands of extra tokens, subsequent turns can actually at times be faster due to the model only having to process the fresh tokens, no history, no system prompt or banner.

That is the real thing that has killed personal AI until now, reprocessing everything is killer. With this we have warm states. So you can actually preprocess a system prompt or banner and save your model at that exact point, then every single time you start a conversation you can load that ready to go model. Because we save the actual memory, it's literally ready to go. That's the same way we save conversations too, we save the KV, so when we load it into memory, even on a totally different computer, we don't have to reprocess the conversation.

Now lets talk limitations because everything is great so far. We do have some limitations.

- Saving KV is substantially larger saves than current text based saving. A Q4 14B at 35,000 tokens is going to be about a 4GB save.

- The system naturally keeps the model at high attention in it's context window. This means it is substantially smaller than a spread attention large context window. I am having conversations with 3B's getting into 10,000 - 15,000 tokens, 8B's getting into 15,000 - 25,000 tokens, and 14B's getting into 30,000 - 40,000 token ranges before the model starts to kind of fall apart, repeating itself, producing nonsense, etc. Other studies have found similar usable context windows with models, the claimed larger context spreads usable attention, essentially the model only sees part of the context where in nexus it sees everything.

Off the top of my head, I think those might be the big limitations and plans are in place to help with at least the context window and attention. I have a system built into the engine that will allow me to mark certain sections of KV and remove or insert them at will. This will allow storing things outside active memory to increase context, also manually adjusting attention slightly, but that has proven to be tricky when I tried with music generation. Overall we have the start of something that can grow into something amazing. The limitations are nothing compared to the improvements.