AI/ML

How to Run Kimi K2 on RunPod - Step by Step Setup Guide

Why Choose RunPod?

RunPod is a cloud GPU platform trusted by the open source AI community. It’s:

  • Affordable (per-minute GPU pricing)
  • Simple to launch (no DevOps needed)
  • Perfect for LLMs like Kimi K2, Mistral, Mixtral, LLaMA, etc.

RunPod offers prebuilt templates, JupyterLab and Docker container runtimes, making it ideal for developers and researchers.

  • PrerequisitesFree RunPod account: https://runpod.io 
  • Hugging Face account (to accept Kimi K2 license)
  • Basic familiarity with Docker or Python CLI (optional)

Step 1: Log in to RunPod & Choose a GPU

1. Go to https://runpod.io/console 

2. Click on 'Deploy a Pod'

3. Under Template, choose:

  • Container: Custom Image OR
  • Prebuilt: Hugging Face Text Generation

4. Select a GPU type (suggested: A100, RTX 4090, 3090)

5. Choose Storage (at least 40 - 80 GB)

Step 2: Configure Your Pod

If using Custom Container:

Container Image:

nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04

Add CMD if needed:

sleep infinity

Enable:

  • Public IP
  • Docker Support
  • Volume Persistence

Click 'Deploy Pod'

Step 3: Access the Pod Terminal

Once the pod is running:

1. Click 'Connect → Terminal'

2. Update the system:

apt update && apt install -y git-lfs python3-pip git

3. Install libraries:

pip3 install torch torchvision transformers accelerate huggingface_hub

Step 4: Clone & Load Kimi K2

1. Clone the Kimi K2 instruct repo:

git lfs install

git clone
https://huggingface.co/moonshotai/Kimi-K2-Instruct
cd Kimi-K2-Instruct

2. Accept the model license at: https://huggingface.co/moonshotai/Kimi-K2-Instruct 

3. Test model load script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "moonshotai/Kimi-K2-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
torch_dtype=torch.float16).cuda()
prompt = "What is quantum computing?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 

Step 5 (Optional): Use vLLM for Fast Inference Server

Install vLLM:

pip install vllm

Run OpenAI-compatible server:

python3 -m vllm.entrypoints.openai.api_server \
--model moonshotai/Kimi-K2-Instruct \
--tokenizer moonshotai/Kimi-K2-Instruct

Bonus: JupyterLab UI

Use RunPod's Jupyter template.

Paste your Hugging Face token in .env or login with:

huggingface-cli login

Load and run Kimi K2 from a notebook (great for rapid prototyping)

Final Thoughts

Running Kimi K2 on RunPod gives you a blazing fast, budget-friendly setup to experiment with one of the most powerful open-source LLMs without needing DevOps or hardware. Whether you’re building AI tools, researching language models, or just exploring prompts, RunPod + Kimi K2 is a perfect match.

Need enterprise grade deployment or DevOps help? Contact our expert AI Squad at OneClick IT Consultancy and let’s build something powerful together.

Contact Us

0

Comment

2k

Share

facebook
LinkedIn
Twitter
Mail
AI/ML

Related Center Of Excellence