learning_ai_common_plat/__LOCAL_LLMs/oss_llm
2026-02-28 00:03:37 -08:00
..
testNanoGPT move oss_llm/ from learning_ai_2nd_brain 2026-02-28 00:03:37 -08:00
find_models_on_github.py move oss_llm/ from learning_ai_2nd_brain 2026-02-28 00:03:37 -08:00
quick_checks.sh move oss_llm/ from learning_ai_2nd_brain 2026-02-28 00:03:37 -08:00
README.md move oss_llm/ from learning_ai_2nd_brain 2026-02-28 00:03:37 -08:00

Kimi “2.5” local deployment on macOS (whats реально possible)

Executive summary

  • Running the real “Kimi 2.5 / Kimi K2-class” model fully locally on a Mac is not practical: the official open-weight Kimi K2 deployment guidance targets multi-node NVIDIA GPU clusters (vLLM/SGLang/TensorRT-LLM) and assumes CUDA.
  • What is practical on a Mac: use Kimi Code CLI (official) or the Moonshot API (official). Thats not “local inference” (weights on your laptop), but it is “local usage” (client runs on your laptop).
  • VPN / proxy: yes, clients can usually work through VPN/proxy, but you must be able to reach the providers API endpoints. If your network blocks access, youll need allowlisting or a different route.

What GitHub shows (official sources)

1) Kimi K2 (open-weight model series)

  • Official repo: https://github.com/MoonshotAI/Kimi-K2
  • The repo includes a deployment guide at docs/deploy_guidance.md.
  • Key reality check from the guide: the smallest FP8 128k-seqlen deployment is described as ~16 GPUs (H200/H20-class) for mainstream setups (vLLM/SGLang). This is fundamentally not macOS-laptop friendly.

2) Kimi Code CLI (best option for macOS)

Can we locally deploy “Kimi 2.5” on this Mac?

Practically: no (for K2 / K2-class)

  • macOS has no CUDA, so GPU-first inference stacks referenced for K2 deployment (vLLM / TensorRT-LLM) are not an option.
  • Even if you tried CPU inference, the K2-class model scale (MoE, 1T total params / 32B active) is far beyond what is reasonable on a laptop.

Whats feasible instead

  1. Use Kimi Code CLI on macOS (recommended)
  2. Use Moonshot/Kimi APIs from your own scripts (Python/Node/etc) once your network allows access
  3. If you truly need local weights on Mac, use a Mac-friendly model/runtime (e.g., MLX/Ollama) — but that would be a different model, not Kimi 2.5/K2.

Options: smaller, Mac-runnable open models (local inference)

If your goal is “no network calls at runtime”, pick a local runtime + a small-enough model + quantization.

If you can only access GitHub

If your enterprise network only allows github.com, that severely limits how you obtain model weights because most model hosting is not on GitHub.

What still works:

  • GitHub Releases assets: some projects publish quantized model files (often .gguf) as release assets.
  • Git LFS inside a repo: occasionally weights are stored in-repo via LFS (still uncommon for large models).

What usually wont work:

  • Downloading weights from external model registries / hosting sites (blocked by policy).

Reality check: GitHub has practical size limits, so very large models are rarely hosted there. Your best bet is to use small models (7B14B) in quantized form.

How to find downloadable models on GitHub

  • Use oss_llm/find_models_on_github.py to search GitHub repos and list any release assets that look like model files.
  • Prefer assets ending in .gguf if you plan to run with llama.cpp.

Example:

python3 oss_llm/find_models_on_github.py --query "gguf qwen 2.5" --limit 20

If you need a specific model but cant download it directly:

  1. Ask security for an approved internal mirror / artifact store.
  2. Mirror the model files there.
  3. Point your local runtime (Ollama/llama.cpp/MLX) at the internal location.

“No security review” learning path (best effort): train a tiny model yourself

If you want to avoid downloading any third-party model weights at all, the cleanest option is to:

  • clone training code from GitHub
  • train a small model on a local/public-domain text file

This wont produce a state-of-the-art assistant, but its excellent for learning tokenization, training loops, sampling, and basic eval.

One popular repo for this is karpathy/nanoGPT.

High-level steps:

git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT

# create a python env (choose your preferred method)
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# run the built-in Shakespeare example (it downloads a small text file)
python data/shakespeare_char/prepare.py
python train.py data/shakespeare_char --device=cpu --compile=False

# sample
python sample.py --out_dir=out-shakespeare-char --device=cpu --compile=False

If even small downloads are restricted, replace the dataset step with your own local text file and adjust the dataset script accordingly.

nanoGPT demo (this repo)

This workspace includes a reproducible nanoGPT workflow (CPU and Apple Silicon MPS) plus sampling demo commands:

If you do download GGUF weights from GitHub

It can work (some repos commit a .gguf directly), but treat it like any third-party binary:

  • Prefer repos with clear licensing (repo license + explicit model license/provenance)
  • Prefer “original publisher” repos over re-uploads
  • Keep models small (e.g., ~100M3B) for macOS learning
  • Ollama: easiest “download + chat + local HTTP API” experience.
  • LM Studio: easy GUI; also exposes a local API server.
  • llama.cpp: most portable; great for CPU/Metal, quantized GGUF models.
  • MLX (Apple): best when you want Python-native workflows on Apple Silicon.

Model size guidance (rule of thumb)

Quantized models are what make laptops viable.

  • 810B @ 4-bit: typically comfortable on 16GB unified memory.
  • 14B @ 4-bit: better with 2432GB unified memory.
  • 30B+: usually needs 64GB+ and will still be slow.

Good “starter” model families (pick one)

These are widely supported by the runtimes above and have strong general utility:

  • Llama 3.x (8B class): strong general chat + coding for the size.
  • Qwen 2.5 (7B/14B class): strong multilingual + coding.
  • Mistral 7B class: fast and solid baseline.
  • Gemma 2 (9B class): good general-purpose quality.
  • Phi-3.x (mini/small class): very fast and lightweight.

Suggested picks by Mac memory

  • 16GB unified memory: start with an 89B model at 4-bit.
  • 32GB unified memory: 14B at 4-bit is a good sweet spot.
  • 64GB unified memory: 2734B at 4-bit becomes feasible (still slower).

Practical setup: Ollama (quickest)

  1. Install runtime (choose the install method your enterprise allows; Homebrew is common):
brew install ollama
  1. Start the local service:
ollama serve
  1. Pull/run a model (example placeholder):
ollama run <model-name>
  1. Use the local API (optional):
curl http://127.0.0.1:11434/api/tags

Practical setup: llama.cpp (most controllable)

If you can obtain a quantized GGUF model file via an approved internal mirror:

brew install llama.cpp
llama-cli -m /path/to/model.gguf -p "Hello" -n 256

Practical setup: MLX (Python-centric on Apple Silicon)

If your environment allows Python packages and you have an MLX-converted model available internally:

python3 -m venv .venv && source .venv/bin/activate
pip install mlx mlx-lm
python -m mlx_lm.generate --model /path/to/mlx-model --prompt "Hello" --max-tokens 256

Enterprise note (important)

Because model hosting sites may be blocked in your network category, the usual pattern in enterprise is:

  1. Security-approved model list
  2. Internal artifact store / mirror for model files
  3. Local runtime (Ollama/llama.cpp/MLX) pointing to those internally hosted artifacts

Steps (macOS): run Kimi Code CLI locally (client-side)

Source: Kimi CLI docs.

  1. Install
# Install via uv (Python package manager)
uv tool install --python 3.13 kimi-cli
  1. Verify
kimi --version
  1. Start in a project directory
cd /Users/sd9235/code/mygh/learning_ai_2nd_brain
kimi
  1. Authenticate
  • Preferred:
    • Run /login inside the CLI and complete the browser auth.
  • Alternative:
    • Run /setup and choose an API platform + API key + model.
  1. If models dont show up
  • Kimi CLI FAQ: verify network access to your configured providers API endpoints.

Steps (NOT macOS): deploy Kimi K2 weights (server-side)

If you have access to Linux + NVIDIA GPUs, use the official K2 deployment guide:

  • vLLM (requires CUDA; the guide notes vLLM v0.10.0rc1+)
  • SGLang
  • TensorRT-LLM

This is the realistic path if you truly need “local” (self-hosted) K2/K2-class inference: run the model on a GPU box/cluster and call it from your Mac.

VPN / proxy: are we able to access through it?

What you need

You generally need outbound access (through your VPN/proxy) to at least:

  • your chosen providers API host (varies by provider)
  • and, if downloading open weights, the model hosting site you plan to use.

Quick connectivity checks

# DNS + HTTPS reachability
curl -I https://<YOUR_PROVIDER_API_HOST>

# If you plan to download weights later
curl -I https://<YOUR_MODEL_HOSTING_SITE>

Configure proxy in a shell (typical)

export HTTP_PROXY="http://127.0.0.1:7890"
export HTTPS_PROXY="http://127.0.0.1:7890"
export ALL_PROXY="socks5://127.0.0.1:7890"
export NO_PROXY="localhost,127.0.0.1"

Configure proxy for Git

git config --global http.proxy "$HTTPS_PROXY"
git config --global https.proxy "$HTTPS_PROXY"

Configure proxy for Python/pip

pip config set global.proxy "$HTTPS_PROXY"

Notes about your current network

From within this VS Code environment, requests to some provider/model-hosting sites were redirected to a corporate/web-filter “blockpage” URL. If you see that on your Mac too, youll need one of:

  • VPN that routes around the filter
  • proxy thats allowed
  • allowlist/exception for those domains

Recommendation

  • If your goal is to use Kimi on this Mac: install Kimi Code CLI and make sure your VPN/proxy allows access to your configured provider.
  • If your goal is “true local inference”: host Kimi K2 on a CUDA GPU server (or use a smaller Mac-native model instead).

Your Mac (detected)

  • macOS: 15.7.3 (24G419)
  • CPU arch: arm64
  • Machine: MacBook Pro (Mac16,7)
  • Chip: Apple M4 Pro (14 cores)
  • Memory: 48 GB
  • Python: 3.13.10

Will nanoGPT work on this laptop?

Yes for learning, with a couple of caveats.

What will work well:

  • Small CPU/MPS runs (toy datasets like Shakespeare, short experiments, sampling).
  • With 48GB RAM and Apple Silicon, you have plenty of headroom for nanoGPT-style demos.

What might block you:

  • Installing dependencies (notably PyTorch) typically requires access to package indexes that may be blocked in your network.
    • In this workspace, PyTorch was successfully installed into the venv and mps_available is True.
    • If you cant reach package indexes in a different environment, use an internal Python package mirror, or install from a pre-approved wheelhouse.

Practical suggestion:

  • Start with CPU (--device=cpu) to keep it simple.
  • If your PyTorch build supports Apple Metal (MPS), you can later try --device=mps for speed.

Quick verification (after installing torch):

python3 -c "import torch; print(torch.__version__); print('mps', torch.backends.mps.is_available())"

nanoGPT: validated end-to-end in this workspace

This is a minimal, fast run that was verified on this machine.

1) Install deps (workspace venv)

# from the workspace root
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python -m pip install torch numpy transformers datasets tiktoken wandb tqdm

2) Clone nanoGPT

git clone https://github.com/karpathy/nanoGPT.git oss_llm/nanoGPT

3) Prepare dataset

cd oss_llm/nanoGPT
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python data/shakespeare_char/prepare.py

4) Short CPU training (writes out-shakespeare-char/ckpt.pt)

/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python train.py config/train_shakespeare_char.py \
  --device=cpu --compile=False \
  --eval_interval=10 --eval_iters=10 --log_interval=10 \
  --block_size=64 --batch_size=12 \
  --n_layer=4 --n_head=4 --n_embd=128 \
  --max_iters=60 --lr_decay_iters=60 --dropout=0.0 \
  --always_save_checkpoint=True

5) Sample

/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python sample.py \
  --out_dir=out-shakespeare-char --device=cpu --max_new_tokens=200

Tip: for speed on Apple Silicon, try --device=mps once youre comfortable.