Saving Big on LLM API Calls with Hosted Ollama

Disappointed of those sky-high bills for API calls to large language models? Yeah, me too. Here’s the good news: you don’t have to settle for those costs anymore.

November 20, 2024

DevOps

Introduction

When I first started experimenting with LLMs for my personal projects, I went straight to OpenAI’s API. It was super convenient—no setup, no hassle, just send a request and get a response. But then the bills started piling up. With charges based on tokens, even small projects quickly became expensive.

That’s when I decided to try running my own models. Now, I use Mistral on Ollama paired with a 1x RTX A4000 GPU rented on Vast.ai, and it costs me around 10 cents per hour. For small personal projects, this setup has been an absolute game-changer. The cost savings are huge, and I get more control over how the model runs.

Why This Setup Works for Me

For small-scale projects, OpenAI’s pricing model can be overkill. You’re essentially renting the convenience of their infrastructure, but for tinkering, testing ideas, or building lightweight tools, that level of expense doesn’t make sense.

With Ollama and Vast.ai, I get:

Flat, predictable costs: At just 10 cents per hour for an A4000 instance, I know exactly what I’m spending
Unlimited queries: I’m not counting tokens anymore—once the model is up and running, it’s all mine.
Flexibility: I can tweak or update my setup anytime, and I’m not tied to any one provider’s API terms.
For projects: that don’t need enterprise-scale infrastructure, this approach is simple, affordable, and practical.

Simple configuration

After selecting the GPU, I chose the preconfigured Open WebUI + Ollama image. It’s an all-in-one solution that saves you from having to tinker with installations or dependencies. You just fire it up, and you’re ready to run your model. This streamlined setup is one of the reasons Vast.ai pairs so well with Ollama—it’s quick, hassle-free, and gets you straight to experimenting.

Once your instance is up, Vast.ai provides an SSH connection string to access your machine.

After logging in, you’ll need to install the model using Ollama. Ollama makes it super simple—just run:

ollama run install mistral

This will download and set up the Mistral 7b model on the instance.

Once installed, you can start using it right away.

Here’s an example of how you can make a request to your running model using curl:

curl http://instance-ip:11434/api/generate -X POST \  
-H "Content-Type: application/json" \  
-d '{  
    "model": "mistral",  
    "prompt": "Write a short story about a brave cat."  
}'

Replace instance-ip with your instance's public IP. The model will return a JSON response with the generated text.

And that's actually it, If you’re also building smaller-scale tools or experimenting with AI, give this setup a shot.

It’s less intimidating than it sounds, and your wallet will thank you!

If you’re new to Vast.ai or Ollama, here’s some handy reading to get you started:

Ollama docs

Vast.ai

Happy coding!