A shell powered by LLM completion.
pip install kenoma
Or from source:
git clone https://github.com/a9lim/kenoma
cd kenoma
pip install -e .
For bitsandbytes quantization (CUDA only):
pip install kenoma[quantize]
kenoma # defaults to Qwen/Qwen2.5-0.5B
kenoma google/gemma-3n-E4B
kenoma /path/to/local/model
The model argument is any HuggingFace model id or a local path. This is meant for base/completion models, instruction-tuned models may not work properly.
By precedence: CLI flags, then KENOMA_* environment variables, then a TOML config file at $XDG_CONFIG_HOME/kenoma/config.toml (falls back to ~/.config/kenoma/config.toml).
Example config:
model = "google/gemma-3n-E4B"
device = "auto"
temperature = 1.0
top_p = 0.95
repetition_penalty = 1.05
max_new_tokens = 2048
context_chars = 6000
history = 20
tmux_lines = 300
quantize = "none"
kv_cache = true
compile = falseThe env var for any key is KENOMA_<KEY> uppercased, so KENOMA_MODEL=gpt2 kenoma works.
Flags:
--version: print version and exit.--prompt: override the capturedPS1. Multi-line prompts are not supported and fall back to a constructeduser@host:cwd $.--device {auto,cuda,mps,cpu}:autoresolves to cuda, then mps, then cpu.--quantize {none,4bit,8bit}: bitsandbytes quantization. Requires CUDA and thequantizeextra.--no-kv-cache: disable KV cache reuse across turns.--compile:torch.compilethe model with a static KV cache for faster decode (best on CUDA). The first turn pays a compile cost; cross-turn KV cache reuse is forfeited because the static cache doesn't exposecrop().--history N: seed with the last N commands from shell history (0 disables).--tmux-lines N: if inside tmux, seed with the last N lines of pane scrollback (0 disables).--context-chars N: cap the rolling buffer at N chars.--max-new-tokens N: per-turn cap on generated tokens.
Cancelling a turn. Ctrl-C during generation cancels the current turn, invalidates the KV cache, and redraws the prompt. Ctrl-C at the input prompt exits.
AGPL-3.0-or-later. See LICENSE.