Exploring Factory Droid | Computer Generated Reality

Terminal IDEs or coding agents like Claude Code are popping up like mushrooms lately. I don’t know if it is because it’s easier than to build a full-fledged fat client. Or because people are sick of seeing another VS Code fork. Or maybe it’s just another way to fit into the developer experience as an add-on rather than a change. Either way, they are increasingly popular and for this post I’m looking at Droid.

It took me three days to write this post because of how conflicted I’ve been about the experience. It’s truly been a roller coaster - as I was changing models, use cases and prompts, things would go from good, to bad, to straight up disaster and then I would try a different use case and it would do amazingly well. I guess the lesson here is to use Droid for the right things and with the right model.

Before I go into details and tell you about my ups and downs, here’s a summary of what I think about it:

What I liked:

The branding and how it looks - the IDE has a vintage-futuristic vibe that reminds me of the Star Wars movie, Andor
The conversation compression - during long chats, Droid would automatically compress the conversation to save tokens and keep the context manageable, most agents do that to some extent, but I was happy to see the model remembered the initial plan we had even after addressing other tasks
The ability to configure reasoning effort for models like GPT-5 and Claude Sonnet 4.5
Viewing past sessions and navigating through them
Referencing files with @filename

What I didn’t like:

It’s still a terminal. No matter how well it’s made, I don’t think a terminal can truly beat a fat client when it comes to complex codebases
It’s hugely dependent on the model - that part is obvious, but from my experiments, even using the same model in GitHub Copilot and Droid gave me considerably different results
The tools it’s using and the way it’s reading files needs work. It happened many times that I would mention a class that needed to be changed and it failed to find it or it read the wrong parts of a file
Context management is quite strange sometimes - when I tried it with a new framework, even though I kept providing documentation, it still wanted to treat the implementation as if it were just React but kind of different

Overall I don’t think I used all the features Droid has to offer like custom commands, MCPs, VS Code integration and others, but I got a pretty good idea on how I can use it.

Trying Droid + Nue.js

Recently I’ve been introduced to Nue and I was, let’s say, apprehensively sold on it. The idea sounds great, it’s got quite a good number of people excited and the developers are working hard to ship a quality product.

I needed a landing page for my soon to release app, Vesuvian, I had 20M tokens from the Droid free trial and a free weekend - time to get building.

The initial setup is very simple, install the package and run nue create full which gives you a complete, full stack app with authentication, authorization, blog and documentation page. Of course it’s all boilerplate, but it’s a decent starting point. After that I asked Droid to start browsing the Nue documentation so it understands what needs to be done.

Droid browsing Nue.js documentation — Droid browsing the Nue.js docs to align on conventions.

As I went through the development I started realizing that things are not really going in the right direction and that the model was really not using the Nue syntax and tools available, rather, it was trying to make the square peg go through the round hole. Droid was able to understand my documentation for the content of the page, the overall style and the copy guidelines I provided, it just couldn’t figure out how to do things like center content or create sections.

Initial landing page attempt — Initial generated layout: structurally off and missing basic layout handling.

I tried a variety of prompts, from specific instructions to broad requests like “the page needs more color and structure.” Droid would confirm it had completed the task, but the results were never satisfactory. I was left disappointed, wondering whether the issue was Droid itself, the GLM 4.6 model I had high hopes for, the Nue framework, or simply my own prompting skills.

My main question was whether the model’s failure was due to a lack of understanding of Nue, or if Factory Droid and GLM 4.6 were simply a disappointing combination.

To get a clearer picture, I ran the same initial prompt in GitHub Copilot with Claude 4, and the results were considerably better.

Claude via GitHub Copilot first pass — Claude via GitHub Copilot produced a more framework-aligned layout.

Obviously this landing page wouldn’t even convert my grandma, but Claude was able to, more or less, understand what I wanted and seemed to be able to make that happen within the confines of how Nue operates. Something funny happened as I was running this “experiment”, I just received an email from the Factory CEO asking how it’s going and if I have any feedback on Droid. You know what they say, “don’t judge an IDE by its open source model”, so I thought I’d give Droid another chance but this time using Claude Sonnet just as with GitHub Copilot. That did improve things a lot, so I suppose the big issue here was actually GLM and not Droid.

I later found out that what Claude did differently was to brute-force CSS into the application until I was kind of happy with the results, but I soon abandoned the idea of using Nue, I don’t believe current LLMs are ready for Nue (or the other way around). And just to reinforce this, I did try Crush + DeepSeek as well with similar results and about $2 worth of tokens.

I think Nue will be great once it’s matured enough and once models are trained on codebases that use it, until then, it’s going to be the kind of tool I use for “hand-crafted” applications.

Droid + React

Because I wasn’t ready to give up on either Droid or GLM 4.6, I decided to try and build a standard React page instead. I also got more documentation, copy samples and other assets produced to help the model. Things went decently up until a point when what I thought was a simple prompt, “remove the old artifacts and sections x, y, z”, broke the entire site. The model was very excited about the great work it did and didn’t think anything is wrong, but visually it was a disaster.

Broken site after refactor prompt — A simple cleanup prompt unintentionally broke the layout.

This is where I realized something big seems to be missing from Droid - I don’t think it’s using an LSP service, or not fully - because I’ve seen a lot of linting and syntax errors in code it generates and it did not seem aware of it.

The other thing which is missing is any kind of automated checkpointing, Copilot and Cursor allow you to go back to the previous state whenever things break. With Droid you are kind of left with a broken site and unless you commit chages between prompts, the only solution is to keep iterating until it’s fixed.

Overall the experience here was ok, again I was left disappointed by GLM 4.6. In my smaller tests it seemed to be doing really well and according to benchmarks and general feedback on the internet, it’s supposed to be great, things on the ground seemed to be quite different though.

Droid + GPT-5

For my last experiment, I fired up Droid, selected the GPT-5 model with high reasoning effect and went rummaging into my full stack codebase. I was so happy! It went through my task list like an eager engineer during their first week on the job. Every single request was properly researched, planned and implemented. I burned through about 10M tokens in a day, but I had fully functional features in my app.

What I found especially funny is that when I’m using GPT-5 with GitHub Copilot, the results are generally good but really nothing to write home about. GPT-5 with Droid is the engineer I always wanted in my team. I’m pretty sure that has to do with the reasoning settings. Even better, with Droid, GPT-5 has 0.5x token usage as opposed to 1x for Copilot, so you get more, better and cheaper.

Conclusion

What I learned is that when it comes to tokens, you really get what you paid for - the 0.25x consumption rate of GLM is good for basic things, but when it comes to using a completely new framework that requires clear reasoning capabilities it’s not quite there yet.

I like Droid, I think it definitely has a place in anyone’s development stack and I’ll definitely use it more once I sort out my on-device AI model. I don’t think I will replace GitHub Copilot or Claude Code with it though. I think the amount of value I get from GitHub Copilot for $10/month is huge and Droid (or similar products) just can’t beat that. It also doesn’t really make sense as an add-on to Copilot since I already have a Claude subscription which gives me access to Claude Code. I might replace Claude with ChatGPT and get Codex after the last session though.