Kimi K2.6 vs GLM-5.1: I ran both for hours so you don't have to

If you've been around the channel, you saw me run Kimi K2.6 against Claude Opus 4.7 last week. Kimi held its own. Opus still won, but K2.6 made it a fight, which a year ago would have been unthinkable for an open-weight model.

So when GLM-5.1 dropped two weeks before K2.6 with that flashy "tops SWE-Bench Pro, beats GPT-5.4 and Opus 4.6" headline, the question wrote itself. Which open-weight king is actually king?

I went in expecting a close fight. Up front: I think GLM might have been nerfed, or something happened on serving. It didn't perform anywhere near where the benchmarks suggested it should. Read on, see the receipts, draw your own conclusions.

The setup

Harness: OpenCode, the same harness I use day to day.
Format: One-shot. No re-rolls. What you get is what you get.
Tests: 5 prompts picked to stress different muscles. Physics, design fidelity, complex UI, motion graphics, layout reproduction.
Spoiler: I tried to run more tests. OpenCode + OpenRouter just refused to play nice with GLM for some of them. We work with what we got.

A note on cost and speed before we dive in. GLM was consistently faster and cheaper per run than Kimi. That's worth saying out loud. If GLM had won on quality, cost-per-quality would have been a slam dunk. It didn't. That matters too.

01. Particle fluid simulation

The prompt. Build a particle-based fluid simulation with controls for gravity, viscosity, attract/repel, and edge containment. Should feel like fluid, not like dots flying around.

This is a physics test. Looks easy. Isn't. Getting particles to behave like a fluid means the model has to think about velocity damping, neighbor interactions, and how forces compose without going unstable.

Kimi. ~5 minutes to generate. The simulation works. Gravity responds when I crank it up, particles fall harder. Pull it down, they float. Attract pulls them in. Repel pushes them out. Not as polished as Opus's version (I tested that earlier) but it works as advertised.

GLM. ~1 minute to generate. Faster, cheaper. But the particles lose their minds. Gravity barely registers. Attract and repel basically don't function. Viscosity is the only control where you can sort of see something happening at the edges. It isn't a fluid. It's chaos with a UI strapped on.

I genuinely sat there toggling GLM's gravity slider from -10 to +10 wondering if it was wired to anything at all. The slider moved. Nothing changed on screen. That was the moment I started getting suspicious that something was off with this model today.

Round 1: Kimi.

02. NASA-style solar system mission control

The prompt. Build a NASA-style mission control UI for a solar system. Click planets to see orbital parameters, classification, equatorial radius. Should feel like real software, not a tutorial.

I'd run this one before in the Opus comparison, so I had a strong reference point.

Kimi. 11 minutes. Slower than I'd like. But the result, chef's kiss. The planets had texture. Earth looked like Earth. Mercury looked like Mercury. Venus had that yellowish atmospheric haze. You could click any planet, get the orbital period (Earth ≈ 365 days, accurate), classification, physical properties. Dismiss the panel, fly back, click another. It felt like software.

GLM. 1 minute. Lightning fast. But the planets were literally just colored dots. Earth was a blue dot. Mars was a red dot. That's it. No texture, no atmosphere, no character. The planets also moved so fast around the sun that clicking on them was basically an aim trainer. Orbital data was there if you could catch one, but the experience felt unfinished in a way that 1 minute of generation time explains.

Kimi's Earth had visible continents. Not photorealistic, but you could see the idea of land masses on a sphere. GLM's Earth was a blue circle. That contrast is the whole story of this round. Both models had the same instructions. One reached for richness. The other shipped the minimum viable globe.

Round 2: Kimi. Score: 2 to 0.

03. Dashboard from a Dribbble screenshot

The prompt. Pasted a complex dashboard UI from Dribbble. "Build this." No further guidance.

A layout fidelity test. Can the model look at a designed interface and reproduce its structure, hierarchy, and visual language one-shot?

Kimi. Got most of it right. Messed up one section, a chart area where the styling fell apart. Layout structure was correct, color palette was close, typography was decent. That one botched section dragged the result down because dashboards are supposed to feel whole.

GLM. Nailed the structure. The whole thing held together. No section blew up. It wasn't a perfect 1:1 of the Dribbble shot (neither model was) but as a usable, functional dashboard, GLM's version shipped better. With one more pass and a tighter prompt, you could deploy it.

Kimi might have won this with a re-roll. We're not doing re-rolls. That's the deal. One shot, what you get is what you get. GLM showed up clean on the first try and that wins the round.

Round 3: GLM. Score: 2 to 1.

04. Remotion component

The prompt. Build a Remotion component with a date duration display and a clean motion design.

I've said this before and I'll say it again: Opus owns Remotion. Open-weight models all miss something with motion graphics. They land somewhere between passable and fine, but never quite "yes, ship that."

Kimi and GLM. The output was almost identical. Same date format approach. Same general design language. Same "looks fine but you wouldn't put it on a client deliverable" vibe. I could not pick a winner. Both showed the exact same blind spot.

For comparison, when I ran this with Opus 4.7, the output was visibly different. Different date format, different design philosophy, different visual rhythm. There's something open-weight models converge on for motion design that I haven't fully figured out yet. Probably a training-data thing. Remotion tutorials online tend to look the same, and these models likely learned from the same well.

Side observation worth its own video: I want to do a dedicated Remotion comparison across all the open-weight models, Kimi vs GLM vs Qwen vs DeepSeek, to see if any of them break the convergence.

Round 4: Tie. Score: 2 to 1, one tie.

05. Pathfinding maze visualizer

The prompt. Build a maze with a pathfinder. Click "race" and watch a dot navigate to its goal. If no path exists, say so.

Kimi. Built it. Maze worked, pathfinding worked, race button worked, "no path" message displayed correctly when I blocked all routes. Functional, complete, ships.

GLM. I could not get this test through. OpenCode + OpenRouter kept failing for hours when I tried to run GLM on this prompt. I don't know if it was a routing issue, a rate limit, or something specific to the model serving. After multiple retries across hours, it never completed. That's a result in itself. When you're shipping, "the model doesn't run reliably through my harness" is a real strike, even if it isn't the model's fault directly.

I gave GLM the benefit of the doubt on the other tests. Kimi shipped this one. GLM didn't.

Round 5: Kimi.

The scoreboard

The verdict

Kimi K2.6 takes it. Not by a small margin: three clear wins, one tie, one loss.

Here's the part the score doesn't tell you: GLM-5.1 should have been more competitive than this. Two weeks ago it topped SWE-Bench Pro at 58.4, beating GPT-5.4, Opus 4.6, and Gemini 3.1 Pro on that benchmark. The benchmarks say frontier-class. The output, today, on these tests, did not.

A few things could explain it.

The benchmarks measure a different thing. SWE-Bench Pro tests real GitHub issue resolution. My tests are creative-execution-heavy: physics, visual richness, layout fidelity. GLM might be tuned harder for the bug-fixing direction.
Harness compatibility. OpenCode is a Claude-Code-style harness. GLM was designed with Claude Code compatibility in mind, but maybe OpenCode's small differences cost it. The pathfinder test that just refused to run is a real signal here.
It might have been an off day on Z.ai's serving infrastructure. I don't know. I can only report what I saw.

What I do know: for day-to-day work right now, in Hermes Agent / OpenClaw / OpenCode workflows, Kimi K2.6 is my open-weight pick. Not even close. It's faster to trust, more vivid in its outputs, and it ships when you tell it to ship.

GLM-5.1 is still a frontier-class model on paper, and I want to test it again in a different harness before I write it off. But if you're picking one open-weight model to put into your daily flow today, you pick Kimi.

Kimi-piloted. My open-weight pick until something dethrones it.

Can you actually run either of these?

This is a canirunthis.ai post, so we have to ask the question.

Short answer: no. Not on a normal machine.

Kimi K2.6 is 1T total / 32B active. Even at INT4, you're looking at ~250GB on disk and a multi-GPU rig (4× H100 territory) to serve it.
GLM-5.1 is 744B total / 40B active. Similar deal. INT4 still puts you in serious server hardware land.

For both, the realistic path is the API. K2.6 via Moonshot, GLM-5.1 via Z.ai. Both ship vLLM/SGLang support if you have the infra and need to self-host for data residency reasons, and GLM's MIT license makes that legally cleaner.

If you want to know what models your hardware can actually run, that's literally what canirunthis.ai is for. Plug in your specs, get an honest answer.

What's next

A few comparisons I want to run.

The dedicated Remotion showdown. Kimi vs GLM vs Qwen vs DeepSeek, just on motion design. Why do open-weight models converge here?
Local-inference comparison. Models that do fit on a Mac Mini M4 Pro. Different game, same rigor.
GLM-5.1 round two. Direct API instead of OpenRouter, fresh harness, see if today was a fluke.

If there's a comparison you want to see, the YouTube version of this post is the best place to drop it.