A postmortem on enstil.ai - a text-to-image site by Michael Truell and Aman Sanger.
It’s Monday, August 22nd. Michael and I have been working on a startup for a few months, but at the moment, we’re pretty blocked by some ML experiments running (they’ll take a few weeks to finish). Meanwhile, we see stable diffusion has just been released. We play with it that evening, and it’s really good. What if this is a unique opportunity to build a new text-to-image company?
Here’s the play-by-play of us building, then deciding to shut down enstil.ai. Hopefully, our learnings can be helpful to other people trying to build using stable diffusion. And hopefully, our users can forgive us for shutting down the site.
Tuesday, August 23: We built a scrappy site, bought some domains, and rented one A100 GPU for running stable diffusion. Out of the box, we could generate 3 images in 7.4 seconds from a single prompt. Once the site’s up anyone can use it. We post it on Reddit and start getting a few users.
Today's Cost: ~$100
Today's Images Generated: 3,663
Today's Users: 17
Wednesday, August 24: We used OpenAI’s API to generate prompt suggestions. We were inspired by the prompt used in Anthropic’s paper for aligning AI assistants. We modified it with several examples of prompt improvements.
We did a ton of research to find out ways to speed up the model - torchscript or tensorrt seem like good options, but we’re busy building up the site infra since we’re seeing decent usage. OpenAI Davinci is also costing us quite a bit per query.
We finish the day adding an invisible recaptcha so we wouldn’t get screwed by bots.
Today's Cost: ~$100
Today's Images Generated: 18,675
Today's Users: 337
Thursday, August 25: We are inspired by Lexica.art, so we also index 100K images/prompts from the stable diffusion discord for people to search through. We need to increase the load our backend can handle, so we rent 8 A100 GPUs. With the help of Exafunction, we also get a 1.6x speedup on inference time for our models via torchscript. This means 4.86 seconds to generate 3 images per GPU.
Today's Cost: ~$800
Today's Images Generated: 47,067
Today's Users: 1,671
Friday, August 26: We posted the site on r/internetisbeautiful last night.
Suddenly, our image generation times go from 20 seconds to 30 minutes. Our 8 GPUs are getting throttled, so we spin up some new ones. AND our prompt suggestion is using OpenAI’s biggest model, costing us 6 cents per suggestion. We switch to curie quickly before our costs get too high.
Trying to continue the free service (even if we tried to do advertising) wouldn’t be sustainable. We have to build a premium service or we’d risk burning through tons of money.
Today's Cost: ~$2000
Today's Images Generated: 190,320
Today's Users: 10,363
Saturday & Sunday (August 27-28), Both Michael and I are traveling this weekend, so not much work gets done. We just put out a waitlist for a premium version promising <10-second generation, img2img, sketching, etc…
Avg Daily Cost: ~$1600
Avg Daily Images Generated: 213,315
Avg Daily Users: 2044
Monday, August 29: We take a second to think about what the future of the space looks like. We started this on a whim, but is there a long-term business to be built here?
The math doesn’t yet work out for ads. We would need a LOT of premium users to offset the cost of free users. Could we compete against more established companies like Dalle, MidJourney, or DreamStudio’s API? Would our margins just be competed away? Should we build with a particular user group in mind? No satisfying answers but we schedule a ton of calls with artists/designers to better understand potential paid users. In the meantime, we heed the classic startup advice to follow first derivative of growth.
Today's Cost: ~$800
Today's Images Generated: 265,353
Today's Users: 2,091
Tuesday, August 30: We decide to put up the pro service. This means building user authentication, payments, and a separate pro interface for image-to-image, inpainting, sketch-to-image, and fast generation. We spin up 12 more A100s for the pro service and open it in a “closed beta”.
We have over 300 people on our waitlist, but we email just 2 people to test it out. In the meantime, we start setting up GPU clusters on Coreweave since their pricing is insanely good.
Today's Cost: ~$2000
Today's Images Generated: 271,518
Today's Users: 1,606
Wednesday, August 31: We email everyone from the waitlist, but we get really mixed results. With $30 pricing not many people are signing up, we switch to $15. Then, we officially make the switch to Coreweave with cheaper GPUs (V100s) and autoscaling to dramatically bring down costs.
We get some feedback for the pro version: “I need a manual to use this”. We simplify it a ton and take out image-to-image.
Today's Cost: ~$450
Today's Images Generated: 269,589
Today's Users: 1,483
Thursday, September 1: From the waitlist, we get just 8 paid users. Far fewer than we expected. We continue reaching out to designers and reading the literature on diffusion models. Maybe we could finetune Stable Diffusion on useful subsets of images (for photorealism)? Maybe we should try subbing in a T5-XXL encoder for the CLIP text encoder and with a bit of finetuning, dramatically improve the quality?
Today's Cost: ~$450
Today's Images Generated: 318,741
Today's Users: 1,378
Friday-Monday September 2-5: On Friday, we think this may not be worth pursuing. We have 8 paying users and we’re absolutely bleeding cloud credits. We’ll leave enstil running in the background and open the pro service to the general public in an “open beta”.
Since our original ML experiments from before enstil are almost finished, we continue working there where we left off.
Avg Daily Cost: ~$500
Avg Daily Images Generated: 386,967
Avg Daily Users: 2,085
Tuesday, September 6: Looking at our Stripe dashboard, we see… 70 paying users! Not too bad. Maybe enstil can be saved. We decide to add a few more features and leave enstil on the back-burner. We talk to our users and decide the best things to add would be upsampling, custom aspect ratios for generation, and a free trial. But we’re still lukewarm about enstil given how much it costs to run.
Today's Cost: ~$500
Today's Images Generated: 390,558
Today's Users: 2,240
Wednesday, September 7: Some pretty bad news. Doing some analytics, we are paying over $18/month per pro-user in GPU costs. $15/month pricing gives us negative margins for just the pro users! We need to increase the price to $30/month. And, it turns out we’ll need to hit almost $500,000 ARR in order to offset the cost of our free tier. This means we’ll only break even past 1.4K paying users!
Reaching this number will be an absolute grind. We decide to push one last set of features and look at growth in a few days.
Today's Cost: ~$500
Today's Images Generated: 396,345
Today's Users: 3,954
Thursday, September 8: It’s kind of sad, but I start to really enjoy the product here. Our load times are scary fast for pro users. You can generate images in 5 seconds, upsample and edit in 1-2 seconds, and find prompts for inspiration instantly. I feel hooked. Maybe other people will too? We build out a free trial offering and release it.
Then, we take a step back and think again about the future. Do we have the conviction in the long-term success and vision of enstil to keep working on this? Even if it means burning lots of money for several months? It’s hard to say. After reading this article, the answer needs to be a resounding yes if we want to get through the hard times.
And the pricing model is very difficult to sustain. Right now, we were seeing a bit under 15 minutes of usage a day from our average pro user. In order to break even, we would need pro users to use the site for under 24 minutes a day. This is in direct conflict with building a product our users love and want to spend time on.
So, despite some middling growth (90 total pro users), we decide that it would take very serious numbers to change our minds. Most of our paid users were literally losing us money. And increasing the price to $30 dramatically stunted growth.
Our last hope was the free trial. If the premium version really has enough appeal, some of our thousands of free users would try it out, and we should see a boost in growth. If not we would shut down the site on Friday.
Today's Cost: ~$500
Today's Images Generated: 404,442
Today's Users: 3,660
Friday, September 9: Sadly, the free trial doesn’t have any effect on growth, and over the last 24 hours we see 1 new pro user. If our thousands of free users get to try out premium and decide it’s just not worth it for them, it will be very difficult to scale to $500K ARR.
We finally shut down the site and refund all paying users.
Today's Cost (half day): ~$250
Today's Images Generated (half day): 272,709
Today's Users (half day): 3,000
Saturday, September 10:
Images Generated: 0
- Cost of enstil.ai: $11,700 (don’t worry, almost all of this was cloud credits)
- Prompts submitted: ~ 1.6M
- Images Generated: ~4.8M
- Cost per Image: ~$0.0023
I think there is probably something to build in place of enstil. With techniques like knowledge distillation, more efficient diffusion sampling (like k-Euler), highly optimized model code, and really good autoscaling, I could see the price per image drop from $0.002 to $0.0002. A 10x decrease in costs would mean advertisements for free users are possible and unlimited generations for pro users is a fine business model.
In fact, with $30/month pricing and 5-second generation time, pro users would have to spend over 2 hours a day every single day generating images for you to be losing money. Giving them 10-second generation time means that number increases to 4 hours. The pricing model can work!
But we’re not there just yet, and we didn’t have the appetite to eat $450 a day in losses until we highly optimized the model code.
AND Michael and I have been working on something else entirely for several months. We weren’t ready to abandon it for something we didn’t have a ton of conviction in.
Future of Text-To-Image
Maybe the model matters
In these last few weeks, we worried a lot about the future of text-to-image. Maybe we are very far from the ceiling of how good these models are. If so, I worry a lot about OpenAI as a competitor. Dalle is effectively as good as GLIDE, a paper OpenAI released last December. And papers are usually released several months after the actual model is trained.
Who knows how much better their current models are? A simple recipe they could be taking to improve Dalle is massively scaling the text encoder, scaling the diffusion models, then running RL with human feedback.
If the ceiling of performance/quality is very high, there may be just 1-2 companies that win the space, and they will be the ones that can train and build the biggest and best models.
If Stability.AI can keep up with this pace, it’s still worrying for enstil if the text-to-image model is all that matters. Anyone can serve stability.ai’s models and there will be little differentiation between products.
Or maybe the packaging matters
Maybe performance differences don’t really matter to the average consumer. Or we’re near the limit of how good these models can get. Then, whoever can package these models with the best product will win.
I think this is a more likely outcome, but it’s also a reasonably tough place for enstil to be in. The barrier to entry is nonexistent since anyone can start out by using stable diffusion’s API or serving their models. And it’s unclear if any companies will build long-term moats.
But, there is also the opportunity for several companies to co-exist in the space. There could be the one for general consumers, the one geared for graphic artists, the one geared for web design, etc…
Where enstil lies
I think we probably live in the world where the packaging matters. And enstil was not targeted to any particular user group. It was a general consumer text-to-image model that our users loved.
There was a moment when building enstil.ai when I realized that text-to-image generation and editing is something that could genuinely capture mass consumer attention. I am not a particularly artistic person, and I never felt much of a wow moment initially playing with Mid-Journey or Dalle2. But, when our system quickly delivered stunning images, edits, and suggestions in seconds, I was hooked.
When you use a text-to-image model, you want to be wowed. The rough sketches of an idea in your mind are suddenly crystallized into a beautiful and vivid image. It’s a problem of mapping idea space to image space, in a way that shocks and amazes the user.
But, it needs to happen almost instantly to keep the user’s attention. And if the image doesn’t map to the idea in the user’s mind, the system needs to be able to quickly remedy its errors with minor user feedback. When talking to users, they agreed: speed is the key to success.
With a credit system, the user will always be anxiously counting down how many generations they have left instead of being mesmerized by the product. The pricing model needs to allow for unlimited generations.
I think someone will crack the code and get this consumer product right. As the price of running these models go to zero, text-to-image will capture more and more consumer attention. I just don’t think enstil.ai was the way to get there.
I’m surprised more people haven’t cropped up as API providers. Replicate’s price per image is heinously high. Just a few days ago, their pricing worked out to just under 4 cents per image. Right now it looks like 1.6 cents per image. And Stability.ai is offering about 1 cent per image.
Yet, it takes literally 1 day of work to get an API endpoint with autoscaling running for .3 cents per image and we got it running for .2 cents per image with a little more work. If someone is specifically building out an API, it wouldn’t be too hard to get pricing below 0.1 cents per image.
We thought a lot about training our own diffusion models. What’s insane to me is that if you do the math on Stable Diffusion’s FLOP utilization, they’re at about 5% for training the model at a cost of $300,000. Heavily optimized transformer codebases hit 40% utilization, so if you can just get utilization to 20% for stable diffusion, it would cost just $75K to train it from scratch.
Startups without any funding can get up to $100,000 of cloud credits from AWS or GCP. So you could feasibly train your own Stable Diffusion without spending a dollar of actual money!
Stable Diffusion and Stability.ai made me rethink everything about ML commercialization. They’re planning on open-sourcing Chinchilla and other SOTA models that would have stayed closed-source for several years.
The implications are crazy, and I think they may singlehandedly unlock a full startup ecosystem that can build products around their models. For example, what company would you build if you could rely on a much better code-complete model than Codex/Copilot being open-sourced in a few months?