Posts

Jul 3, 2026 · 14 min read

Loop Engineering: Your Agent Should Not Grade Its Own Homework

Loop engineering treats the agent loop, not the prompt, as the thing you design. I distilled the idea into a Claude Code skill called /epic-loop. Here is the principle behind it, why it's worth trying even if you don't want autonomous agents, and the full skill to copy.

#ai-agents #claude-code #loop-engineering #skills #automation #llm

May 12, 2026 · 7 min read

Behavioral Firewalls for AI Agents: Compiling Tool-Call Telemetry into a Finite Automaton

LLM agents invoke external services through tool-call protocols like MCP. Today's firewalls intercept these calls, validate schemas, and scan signatures. Each call is judged alone. The paper challenges that assumption. An adversary injecting instruct

#research-paper

May 12, 2026 · 11 min read

Filtering Non-Actionable Static Analysis Warnings with Sentence Transformers: STAF Architecture and Cross-Project Performance

Between 35% and 91% of warnings produced by static code analysis (SCA) tools are non-actionable — a range documented across multiple prior studies and tools. Non-actionable includes both false positives, where the tool flags something that is not an

#research-paper

May 12, 2026 · 8 min read

Recursive Agent Optimization: training a model to call itself

Recursive Agent Optimization is a reinforcement learning method for training language model agents that delegate sub-tasks to new instantiations of themselves. The question: can a model be taught to call itself, decide what to hand off, and combine t

#research-paper

Apr 10, 2026 · 13 min read

Claude Certified Architect – Foundations: Visual Exam Cheat Sheet

A visual cheat sheet for the Claude Certified Architect Foundations exam, optimized for pattern recognition. Covers agentic orchestration, tool design, prompt engineering, context management, and the decision patterns the exam rewards.

#claude #certification #ai-agents #architecture #cheat-sheet

Apr 9, 2026 · 20 min read

Inside AI Agent Architecture: How the Core Dispatch Loop Manages State, Tool Calls, and Error Recovery

A technical breakdown of how AI coding agents work internally: the dispatch loop, context window management, tool call execution, error recovery strategies, and what separates demo-quality agents from production-ready ones.

#ai-agents #architecture #llm #production #tool-use

Mar 23, 2026 · 13 min read

150 Claude Code Agents Got the Same Data. They Produced Different Results.

A deep analysis of a new paper that deployed 150 autonomous Claude Code agents on identical NYSE data. The agents diverged on methodology, exhibited model-specific empirical styles, and converged through imitation rather than understanding when shown exemplar papers.

#ai-agents #research #reproducibility #claude-code #llm-evaluation

Mar 23, 2026 · 3 min read

Human-Certified Module Repositories: Trust Infrastructure for AI-Assembled Code

A paper proposes Human-Certified Module Repositories — vetted, provenance-rich package ecosystems for an era where AI agents assemble code from open source components.

#supply-chain #research #open-source #package-management #ai-security

Mar 23, 2026 · 3 min read

SWE-CI: Can AI Agents Actually Maintain a Codebase Over Time?

The first benchmark built on the continuous integration loop. 100 tasks across real repos with 233-day evolution histories and 71 consecutive commits. What it reveals about agent maintenance capabilities.

#ai-agents #research #ci-cd #code-maintenance #benchmarks

Mar 23, 2026 · 13 min read

Vibe Code Bench: Best AI Model Hits 58% on Real Web App Development

A deep dive into Vibe Code Bench, the first benchmark that tests whether AI models can build complete web applications from a text spec. 16 frontier models tested, 100 app specs, 10,131 substeps. The best model passes 62% of user workflows. The data reveals exactly where AI-generated apps break.

#ai-code-quality #research #vibe-coding #benchmarks #web-development

Mar 23, 2026 · 12 min read

VIBEPASS: AI Models Can Write Code But Cannot Find Their Own Bugs

Salesforce researchers tested 12 frontier LLMs on self-diagnosing and repairing their own coding faults. The results: 86% of generated test inputs are syntactically valid, but only 61% actually expose bugs. Self-generated tests that miss the fault make repair worse than not testing at all.

#ai-code-quality #research #code-review #vibepass #llm-evaluation

Mar 16, 2026 · 8 min read

The Wrong Benchmark: Why "Human-Level" Misses What Actually Matters in AI Refactoring

A new paper called CodeTaste asks whether LLMs can produce refactorings that match what skilled humans would do. The question makes sense for academic evaluatio...

#research

Mar 13, 2026 · 10 min read

[Needs Title] Research: A Practical Guide for Establishing a Technical Debt Manageme

What Technical Debt Management Actually Looks Like When Your Tool Does the Work Most advice about technical debt management assumes you have the bandwidth to ad...

#from-signal #sig-sig-1639fb6c #research

Nov 25, 2025 · 17 min read

Handling Tokenization and Structured Inputs in LLMs

Large Language Models don't "see" JSON, CSV, XML, or Markdown the way we do. At the API layer, you send something that looks nicely structured. Inside the model...

#ai #llm #tokenization #prompt-engineering

Nov 13, 2025 · 11 min read

Deploying Vesuvian to Production

Deploying Vesuvian Vesuvian is finally ready to be released out into the wild. It's by far not complete, and if it were up to me, I'd still need to spend anothe...

#kubernetes #k3s #hetzner #devops #self-hosting

Oct 24, 2025 · 7 min read

CGR Labs Day 4: LLM optimization, kind of

I've been trying to self-host some AI models on my Hetzner server, and it's been an experience. My first idea was to run big models like gpt-oss-120b or Kimi-De...

#kubernetes #k3s #hetzner #llama-cpp #llm #devops #self-hosting

Oct 20, 2025 · 9 min read

Exploring Factory Droid

Terminal IDEs or coding agents like Claude Code are popping up like mushrooms lately. I don't know if it is because it's easier than to build a full-fledged fat...

#ai #coding-agents #llm

Oct 8, 2025 · 6 min read

CGR Labs Day 3: Observability

I'm building a system to experiment with local LLM models and host my apps. I'm using a dedicated Hetzner server with a Kubernetes cluster on it. I've always wa...

#kubernetes #k3s #hetzner #monitoring #devops #self-hosting

Oct 6, 2025 · 5 min read

CGR Labs Day 2: Setting up services for AI inference

What do you do when you want to be part of the local AI gang but can't afford GPUs? You run CPU inference and hope to hit 10 tokens per second. Before I start w...

#kubernetes #k3s #hetzner #llama-cpp #ai #devops #self-hosting

Oct 5, 2025 · 10 min read

CGR Labs Day 1: Building a K3s Cluster for AI and SaaS

What do you do when you keep reading about open source models, people getting excited about what comes out of the Alibaba labs, and z.ai release GLM 4.6, but yo...

#kubernetes #k3s #hetzner #devops #self-hosting

Oct 5, 2025 · 7 min read

Hetzner, Let's Encrypt, k3s - How I spent 2 hours debugging

There are maybe 2 things you can learn from this post: 1. The dangers of vibe coding and DevOps in a fairly complex environment. I'd say I know what I'm doing m...

#kubernetes #k3s #hetzner #cert-manager #letsencrypt #devops