# How to Set Up a Private OpenAI-Compatible LLM on Google Cloud Run

For those passionate about privacy and control, the trajectory of improvement in open-weight LLMs and the ecosystem around them has been extremely encouraging:

* The gap in raw reasoning capability between the best OSS models (DeepSeek, Qwen) and the best models from frontier labs has continued to shrink.
    
* OSS models now support key usability features such as function calling and structured output.
    
* Thanks to tools such as [Ollama](https://ollama.ai/) and the proliferation of GPUs in consumer laptops, more people than ever can discover and run models on their own hardware.
    

This continued trend means that even distilled versions of the best OSS models will become “good enough” for an increasing percentage of tasks. However, when it comes time to do something like serve an app to actual users, there’s a missing piece in this AI-hacker fairytale - **how do I easily deploy a model on infrastructure I control**?

When I saw that Google recently [brought serverless GPUs into GA](https://cloud.google.com/blog/products/serverless/cloud-run-gpus-are-now-generally-available), it piqued my interest, and I was able to adapt one of their examples to support API key auth and make it fully OpenAI and LangChain-compatible. I wrote up a small guide here + [a GitHub repo](https://github.com/jacoblee93/personallm) you can use to deploy your own!

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1749149071019/684386a2-eae6-4d7f-8040-827394f5d10e.jpeg align="center")

Once live, your endpoint can be used as a drop-in substitute for clients and code that use these interfaces. It also requires no infrastructure management and scales down to zero instances when not in use.

You can serve any open source model from [Ollama's registry](https://ollama.com/search) in theory, including [DeepSeek](https://ollama.com/library/deepseek-r1:14b), [Gemma](https://ollama.com/library/gemma3:4b), and [Qwen](https://ollama.com/library/qwen3), though in practice caps on Cloud Run resources will limit effective model size. For more on this, see the below section on model customization.

Let’s dive in!

## Quickstart

### Setting up Google Cloud resources

> The initial setup for this project is the same as the official Cloud Run guide [here](https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama).

If you don't already have a Google Cloud account, you will first need to [sign up](https://cloud.google.com/).

Navigate to the [Google Cloud project selector](https://console.cloud.google.com/projectselector2/home/dashboard) and select or create a Google Cloud project. You will need to [enabled billing for the project](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled#confirm_billing_is_enabled_on_a_project), since GPUs are currently not part of Google Cloud's free tier.

Next, you must enable access to **Artifact Registry**, **Cloud Build**, **Cloud Run**, and **Cloud Storage APIs** for your project. [Click here](https://console.cloud.google.com/apis/enableflow?apiid=artifactregistry.googleapis.com,cloudbuild.googleapis.com,run.googleapis.com,storage.googleapis.com), select your newly created project, then follow the instructions to do so.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1749056695474/4a827e90-fb01-4b7a-a9f8-66f83da864c4.png align="center")

GPUs are not part of the default project quota, so you will need to submit a quota increase request. From [this page](https://console.cloud.google.com/projectselector2/iam-admin/quotas), select your project, then filter by `Total Nvidia L4 GPU allocation without zonal redundancy, per project per region` in the search bar. Find your desired region (Google currently recommends `europe-west1`, note that [pricing](https://cloud.google.com/run/pricing) may vary depending on region), then click the side menu and press `Edit quota`:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1749056716918/e5f77afe-0825-4113-b1f4-bf8e10033c17.png align="center")

Enter a value (e.g. `5`), and submit a request. Google claims that increase requests may take a few days to process, but you may receive an approval email almost immediately in practice.

Finally, you will need to set up proper IAM permissions for your project. Navigate to [this page](https://console.cloud.google.com/projectselector2/iam-admin/iam) and select your project, then press `Grant Access`. In the resulting modal, paste the following permissions into the filter window and add them one by one to a principal on your project:

* `roles/artifactregistry.admin`
    
* `roles/cloudbuild.builds.editor`
    
* `roles/run.admin`
    
* `roles/resourcemanager.projectIamAdmin`
    
* `roles/iam.serviceAccountUser`
    
* `roles/serviceusage.serviceUsageConsumer`
    
* `roles/storage.admin`
    

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1749056734052/e6855c89-5c6d-4ac0-a9f8-82836a05e401.png align="center")

By the end, your screen should look something like this:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1749056774649/f049ca22-ba8f-49aa-ac7a-821e9a2b610d.png align="center")

### Deploying your endpoint

Now, clone [this repo](https://github.com/jacoblee93/personallm) and switch your working directory to be the cloned folder:

```bash
git clone https://github.com/jacoblee93/personallm.git
cd personallm
```

The repo extends Google’s official guide a lightweight proxy server, which runs in the Cloud Run instance. This proxy handles auth and forwards requests to a concurrently running [Ollama](https://ollama.ai/) instance.

Rename the `.env.example` file to `.env`. Run something similar to the following command to randomly generate an API key:

```bash
openssl rand -base64 32
```

Paste this value into the `API_KEYS` field. You can provide multiple API keys by comma separating them here, so make sure that none of your key values contain commas.

Install and initialize the `gcloud` CLI if you haven't already by [following these instructions](https://cloud.google.com/sdk/docs/install). If you already have the CLI installed, you may need to run `gcloud components update` to make sure you are on the latest CLI version.

Next, set your `gcloud` CLI project to be your project name:

```bash
gcloud config set project YOUR_PROJECT_NAME
```

And set the region to be the same one as where you requested GPU quota:

```bash
gcloud config set run/region YOUR_REGION
```

Finally, run the following command to deploy your new inference endpoint!

```bash
gcloud run deploy personallm \
  --source . \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600
```

When prompted with something like `Allow unauthenticated invocations to [personallm] (y/N)?`, you should respond with `y`. The internal proxy will handle authentication, and we want our endpoint to be reachable from anywhere for ease of use.

Note that deployments are quite slow since model weights are bundled directly into the Dockerfile - expect this step to take upwards of 20 minutes. Once it finishes, your terminal should print a `Service URL`, and that's it! You now have a personal, private LLM inference endpoint!

## Trying it out

You can call your endpoint in a similar way to how you'd call an OpenAI model, only using your generated API key and your provisioned endpoint. Here are some examples:

### OpenAI Python SDK

```bash
uv add openai
```

```python
from openai import OpenAI

# Note the /v1 suffix
client = OpenAI(
    base_url="https://YOUR_SERVICE_URL/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="qwen3:14b",
    messages=[
      {"role": "user", "content": "What is 2 + 2?"}
    ]
)
```

See [OpenAI's SDK docs](https://platform.openai.com/docs/overview) for examples of advanced features such as [function/tool calling](https://platform.openai.com/docs/guides/function-calling?api-mode=chat).

### LangChain

```bash
uv add langchain-ollama
```

```python
from langchain_ollama import ChatOllama

model = ChatOllama(
    model="qwen3:14b",
    base_url="https://YOUR_SERVICE_URL",
    client_kwargs={
      "headers": {
        "Authorization": "Bearer YOUR_API_KEY"
      }
    }
)

response = model.invoke("What is 2 + 2?")
```

See [LangChain's docs](https://python.langchain.com/) for examples of advanced features such as [function/tool calling](https://python.langchain.com/docs/how_to/tool_calling/).

### OpenAI JS SDK

```bash
npm install openai
```

```typescript
import OpenAI from "openai";

// Note the /v1 suffix
const client = new OpenAI({
  baseURL: "https://YOUR_SERVICE_URL/v1",
  apiKey: "YOUR_API_KEY",
});

const result = await client.chat.completions.create({
  model: "qwen3:14b",
  messages: [{ role: "user", content: "What is 2 + 2?" }],
});
```

See [OpenAI's SDK docs](https://platform.openai.com/docs/overview) for examples of advanced features such as [function/tool calling](https://platform.openai.com/docs/guides/function-calling?api-mode=chat).

### LangChain.js

```bash
npm install @langchain/ollama @langchain/core
```

```typescript
import { ChatOllama } from "@langchain/ollama";

const model = new ChatOllama({
  model: "qwen3:14b",
  baseUrl: "https://YOUR_SERVICE_URL",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
  },
});
const result = await model.invoke("What is 2 + 2?");
```

See [LangChain's docs](https://js.langchain.com/) for examples of advanced features such as [function/tool calling](https://js.langchain.com/docs/how_to/tool_calling/).

### Latency

Keep in mind that there will be additional cold start latency if the endpoint has not been used in some time.

## Model customization

The base configuration in this repo serves a 14 billion parameter model ([Qwen 3](https://ollama.com/library/qwen3:14b)) clocked at ~20-25 output tokens per second. This model is quite capable and also supports [function/tool calling](https://ollama.com/blog/tool-support), which makes it more useful when building agentic flows, but if speed becomes a concern you might try smaller models such as Google's 4 billion parameter [Gemma 3](https://ollama.com/library/gemma3). You can also run the popular [DeepSeek-R1](https://ollama.com/library/deepseek-r1:14b) if you do not need tool calling.

To customize the served model, open your `Dockerfile` and modify the `ENV MODEL qwen3:14b` line to be a different model from [Ollama's registry](https://ollama.com/search):

```ini
# Store the model weights in the container image
# ENV MODEL gemma3:4b
# ENV MODEL deepseek-r1:14b
ENV MODEL qwen3:14b
```

Note that you will also have to change your clientside code to specify the new model as a parameter.

## 🙏 Thank you!

[This GitHub repo](https://github.com/jacoblee93/personallm/) contains the source code for this guide.

If you have any questions or comments, please open an issue there. You can also follow me [@Hacubu](https://x.com/Hacubu) on X (formerly Twitter).
