AI/ML

How to Run Kimi K2 on RunPod - Step by Step Setup Guide

Why Choose RunPod?

RunPod is a cloud GPU platform trusted by the open source AI community. It’s:

Affordable (per-minute GPU pricing)
Simple to launch (no DevOps needed)
Perfect for LLMs like Kimi K2, Mistral, Mixtral, LLaMA, etc.

RunPod offers prebuilt templates, JupyterLab and Docker container runtimes, making it ideal for developers and researchers.

PrerequisitesFree RunPod account: https://runpod.io
Hugging Face account (to accept Kimi K2 license)
Basic familiarity with Docker or Python CLI (optional)

Step 1: Log in to RunPod & Choose a GPU

1. Go to https://runpod.io/console

2. Click on 'Deploy a Pod'

3. Under Template, choose:

Container: Custom Image OR
Prebuilt: Hugging Face Text Generation

4. Select a GPU type (suggested: A100, RTX 4090, 3090)

5. Choose Storage (at least 40 - 80 GB)

Step 2: Configure Your Pod

If using Custom Container:

Container Image:

nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04

Add CMD if needed:

sleep infinity

Enable:

Public IP
Docker Support
Volume Persistence

Click 'Deploy Pod'

Step 3: Access the Pod Terminal

Once the pod is running:

1. Click 'Connect → Terminal'

2. Update the system:

apt update && apt install -y git-lfs python3-pip git

3. Install libraries:

pip3 install torch torchvision transformers accelerate huggingface_hub

Step 4: Clone & Load Kimi K2

1. Clone the Kimi K2 instruct repo:

git lfs install

git clone 
https://huggingface.co/moonshotai/Kimi-K2-Instruct
cd Kimi-K2-Instruct

2. Accept the model license at: https://huggingface.co/moonshotai/Kimi-K2-Instruct

3. Test model load script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "moonshotai/Kimi-K2-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
torch_dtype=torch.float16).cuda()
prompt = "What is quantum computing?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Step 5 (Optional): Use vLLM for Fast Inference Server

Install vLLM:

pip install vllm

Run OpenAI-compatible server:

  python3 -m vllm.entrypoints.openai.api_server \
    --model moonshotai/Kimi-K2-Instruct \
    --tokenizer moonshotai/Kimi-K2-Instruct

Bonus: JupyterLab UI

Use RunPod's Jupyter template.

Paste your Hugging Face token in .env or login with:

huggingface-cli login

Load and run Kimi K2 from a notebook (great for rapid prototyping)

Final Thoughts

Running Kimi K2 on RunPod gives you a blazing fast, budget-friendly setup to experiment with one of the most powerful open-source LLMs without needing DevOps or hardware. Whether you’re building AI tools, researching language models, or just exploring prompts, RunPod + Kimi K2 is a perfect match.

Need enterprise grade deployment or DevOps help? Contact our expert AI Squad at OneClick IT Consultancy and let’s build something powerful together.