| .. | ||
| testNanoGPT | ||
| find_models_on_github.py | ||
| quick_checks.sh | ||
| README.md | ||
Kimi “2.5” local deployment on macOS (what’s реально possible)
Executive summary
- Running the real “Kimi 2.5 / Kimi K2-class” model fully locally on a Mac is not practical: the official open-weight Kimi K2 deployment guidance targets multi-node NVIDIA GPU clusters (vLLM/SGLang/TensorRT-LLM) and assumes CUDA.
- What is practical on a Mac: use Kimi Code CLI (official) or the Moonshot API (official). That’s not “local inference” (weights on your laptop), but it is “local usage” (client runs on your laptop).
- VPN / proxy: yes, clients can usually work through VPN/proxy, but you must be able to reach the provider’s API endpoints. If your network blocks access, you’ll need allowlisting or a different route.
What GitHub shows (official sources)
1) Kimi K2 (open-weight model series)
- Official repo: https://github.com/MoonshotAI/Kimi-K2
- The repo includes a deployment guide at
docs/deploy_guidance.md. - Key reality check from the guide: the smallest FP8 128k-seqlen deployment is described as ~16 GPUs (H200/H20-class) for mainstream setups (vLLM/SGLang). This is fundamentally not macOS-laptop friendly.
2) Kimi Code CLI (best option for macOS)
- Official repo: https://github.com/MoonshotAI/kimi-cli
- Docs: https://moonshotai.github.io/kimi-cli/en/guides/getting-started.html
- Kimi Code CLI is an agent client (terminal tool) that talks to a remote provider.
- First-run authentication options:
/login(browser login; auto-configures models)/setup(API key flow)
Can we locally deploy “Kimi 2.5” on this Mac?
Practically: no (for K2 / K2-class)
- macOS has no CUDA, so GPU-first inference stacks referenced for K2 deployment (vLLM / TensorRT-LLM) are not an option.
- Even if you tried CPU inference, the K2-class model scale (MoE, 1T total params / 32B active) is far beyond what is reasonable on a laptop.
What’s feasible instead
- Use Kimi Code CLI on macOS (recommended)
- Use Moonshot/Kimi APIs from your own scripts (Python/Node/etc) once your network allows access
- If you truly need local weights on Mac, use a Mac-friendly model/runtime (e.g., MLX/Ollama) — but that would be a different model, not Kimi 2.5/K2.
Options: smaller, Mac-runnable open models (local inference)
If your goal is “no network calls at runtime”, pick a local runtime + a small-enough model + quantization.
If you can only access GitHub
If your enterprise network only allows github.com, that severely limits how you obtain model weights because most model hosting is not on GitHub.
What still works:
- GitHub Releases assets: some projects publish quantized model files (often
.gguf) as release assets. - Git LFS inside a repo: occasionally weights are stored in-repo via LFS (still uncommon for large models).
What usually won’t work:
- Downloading weights from external model registries / hosting sites (blocked by policy).
Reality check: GitHub has practical size limits, so very large models are rarely hosted there. Your best bet is to use small models (7B–14B) in quantized form.
How to find downloadable models on GitHub
- Use oss_llm/find_models_on_github.py to search GitHub repos and list any release assets that look like model files.
- Prefer assets ending in
.ggufif you plan to run withllama.cpp.
Example:
python3 oss_llm/find_models_on_github.py --query "gguf qwen 2.5" --limit 20
Recommended enterprise pattern
If you need a specific model but can’t download it directly:
- Ask security for an approved internal mirror / artifact store.
- Mirror the model files there.
- Point your local runtime (Ollama/llama.cpp/MLX) at the internal location.
“No security review” learning path (best effort): train a tiny model yourself
If you want to avoid downloading any third-party model weights at all, the cleanest option is to:
- clone training code from GitHub
- train a small model on a local/public-domain text file
This won’t produce a state-of-the-art assistant, but it’s excellent for learning tokenization, training loops, sampling, and basic eval.
One popular repo for this is karpathy/nanoGPT.
High-level steps:
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT
# create a python env (choose your preferred method)
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# run the built-in Shakespeare example (it downloads a small text file)
python data/shakespeare_char/prepare.py
python train.py data/shakespeare_char --device=cpu --compile=False
# sample
python sample.py --out_dir=out-shakespeare-char --device=cpu --compile=False
If even small downloads are restricted, replace the dataset step with your own local text file and adjust the dataset script accordingly.
nanoGPT demo (this repo)
This workspace includes a reproducible nanoGPT workflow (CPU and Apple Silicon MPS) plus sampling demo commands:
If you do download GGUF weights from GitHub
It can work (some repos commit a .gguf directly), but treat it like any third-party binary:
- Prefer repos with clear licensing (repo license + explicit model license/provenance)
- Prefer “original publisher” repos over re-uploads
- Keep models small (e.g., ~100M–3B) for macOS learning
Recommended runtimes (macOS)
- Ollama: easiest “download + chat + local HTTP API” experience.
- LM Studio: easy GUI; also exposes a local API server.
- llama.cpp: most portable; great for CPU/Metal, quantized GGUF models.
- MLX (Apple): best when you want Python-native workflows on Apple Silicon.
Model size guidance (rule of thumb)
Quantized models are what make laptops viable.
- 8–10B @ 4-bit: typically comfortable on 16GB unified memory.
- 14B @ 4-bit: better with 24–32GB unified memory.
- 30B+: usually needs 64GB+ and will still be slow.
Good “starter” model families (pick one)
These are widely supported by the runtimes above and have strong general utility:
- Llama 3.x (8B class): strong general chat + coding for the size.
- Qwen 2.5 (7B/14B class): strong multilingual + coding.
- Mistral 7B class: fast and solid baseline.
- Gemma 2 (9B class): good general-purpose quality.
- Phi-3.x (mini/small class): very fast and lightweight.
Suggested picks by Mac memory
- 16GB unified memory: start with an 8–9B model at 4-bit.
- 32GB unified memory: 14B at 4-bit is a good sweet spot.
- 64GB unified memory: 27–34B at 4-bit becomes feasible (still slower).
Practical setup: Ollama (quickest)
- Install runtime (choose the install method your enterprise allows; Homebrew is common):
brew install ollama
- Start the local service:
ollama serve
- Pull/run a model (example placeholder):
ollama run <model-name>
- Use the local API (optional):
curl http://127.0.0.1:11434/api/tags
Practical setup: llama.cpp (most controllable)
If you can obtain a quantized GGUF model file via an approved internal mirror:
brew install llama.cpp
llama-cli -m /path/to/model.gguf -p "Hello" -n 256
Practical setup: MLX (Python-centric on Apple Silicon)
If your environment allows Python packages and you have an MLX-converted model available internally:
python3 -m venv .venv && source .venv/bin/activate
pip install mlx mlx-lm
python -m mlx_lm.generate --model /path/to/mlx-model --prompt "Hello" --max-tokens 256
Enterprise note (important)
Because model hosting sites may be blocked in your network category, the usual pattern in enterprise is:
- Security-approved model list
- Internal artifact store / mirror for model files
- Local runtime (Ollama/llama.cpp/MLX) pointing to those internally hosted artifacts
Steps (macOS): run Kimi Code CLI locally (client-side)
Source: Kimi CLI docs.
- Install
# Install via uv (Python package manager)
uv tool install --python 3.13 kimi-cli
- Verify
kimi --version
- Start in a project directory
cd /Users/sd9235/code/mygh/learning_ai_2nd_brain
kimi
- Authenticate
- Preferred:
- Run
/logininside the CLI and complete the browser auth.
- Run
- Alternative:
- Run
/setupand choose an API platform + API key + model.
- Run
- If models don’t show up
- Kimi CLI FAQ: verify network access to your configured provider’s API endpoints.
Steps (NOT macOS): deploy Kimi K2 weights (server-side)
If you have access to Linux + NVIDIA GPUs, use the official K2 deployment guide:
- vLLM (requires CUDA; the guide notes vLLM v0.10.0rc1+)
- SGLang
- TensorRT-LLM
This is the realistic path if you truly need “local” (self-hosted) K2/K2-class inference: run the model on a GPU box/cluster and call it from your Mac.
VPN / proxy: are we able to access through it?
What you need
You generally need outbound access (through your VPN/proxy) to at least:
- your chosen provider’s API host (varies by provider)
- and, if downloading open weights, the model hosting site you plan to use.
Quick connectivity checks
# DNS + HTTPS reachability
curl -I https://<YOUR_PROVIDER_API_HOST>
# If you plan to download weights later
curl -I https://<YOUR_MODEL_HOSTING_SITE>
Configure proxy in a shell (typical)
export HTTP_PROXY="http://127.0.0.1:7890"
export HTTPS_PROXY="http://127.0.0.1:7890"
export ALL_PROXY="socks5://127.0.0.1:7890"
export NO_PROXY="localhost,127.0.0.1"
Configure proxy for Git
git config --global http.proxy "$HTTPS_PROXY"
git config --global https.proxy "$HTTPS_PROXY"
Configure proxy for Python/pip
pip config set global.proxy "$HTTPS_PROXY"
Notes about your current network
From within this VS Code environment, requests to some provider/model-hosting sites were redirected to a corporate/web-filter “blockpage” URL. If you see that on your Mac too, you’ll need one of:
- VPN that routes around the filter
- proxy that’s allowed
- allowlist/exception for those domains
Recommendation
- If your goal is to use Kimi on this Mac: install Kimi Code CLI and make sure your VPN/proxy allows access to your configured provider.
- If your goal is “true local inference”: host Kimi K2 on a CUDA GPU server (or use a smaller Mac-native model instead).
Your Mac (detected)
- macOS: 15.7.3 (24G419)
- CPU arch: arm64
- Machine: MacBook Pro (Mac16,7)
- Chip: Apple M4 Pro (14 cores)
- Memory: 48 GB
- Python: 3.13.10
Will nanoGPT work on this laptop?
Yes for learning, with a couple of caveats.
What will work well:
- Small CPU/MPS runs (toy datasets like Shakespeare, short experiments, sampling).
- With 48GB RAM and Apple Silicon, you have plenty of headroom for nanoGPT-style demos.
What might block you:
- Installing dependencies (notably PyTorch) typically requires access to package indexes that may be blocked in your network.
- In this workspace, PyTorch was successfully installed into the venv and
mps_availableisTrue. - If you can’t reach package indexes in a different environment, use an internal Python package mirror, or install from a pre-approved wheelhouse.
- In this workspace, PyTorch was successfully installed into the venv and
Practical suggestion:
- Start with CPU (
--device=cpu) to keep it simple. - If your PyTorch build supports Apple Metal (MPS), you can later try
--device=mpsfor speed.
Quick verification (after installing torch):
python3 -c "import torch; print(torch.__version__); print('mps', torch.backends.mps.is_available())"
nanoGPT: validated end-to-end in this workspace
This is a minimal, fast run that was verified on this machine.
1) Install deps (workspace venv)
# from the workspace root
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python -m pip install torch numpy transformers datasets tiktoken wandb tqdm
2) Clone nanoGPT
git clone https://github.com/karpathy/nanoGPT.git oss_llm/nanoGPT
3) Prepare dataset
cd oss_llm/nanoGPT
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python data/shakespeare_char/prepare.py
4) Short CPU training (writes out-shakespeare-char/ckpt.pt)
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python train.py config/train_shakespeare_char.py \
--device=cpu --compile=False \
--eval_interval=10 --eval_iters=10 --log_interval=10 \
--block_size=64 --batch_size=12 \
--n_layer=4 --n_head=4 --n_embd=128 \
--max_iters=60 --lr_decay_iters=60 --dropout=0.0 \
--always_save_checkpoint=True
5) Sample
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python sample.py \
--out_dir=out-shakespeare-char --device=cpu --max_new_tokens=200
Tip: for speed on Apple Silicon, try --device=mps once you’re comfortable.