Local, model-aware token counting for ruby_llm.
A facade over Hugging Face tokenizers, OpenAI tiktoken_ruby, and SentencePiece bindings that maps model identifiers (gpt-4o, llama-3, mistral, ...) to the right tokenizer for counting, analyzing, and truncating text locally.
No Rust toolchain required: cross-compiled binaries are inherited from the upstream gems.
bundle add ruby_llm-tokenizerOr:
gem install ruby_llm-tokenizerRequires Ruby >= 3.1.
require "ruby_llm/tokenizer"
# Count tokens
RubyLLM::Tokenizer.count("Hello, world!", model: "gpt-4o")
# => 4
# Detailed breakdown
analysis = RubyLLM::Tokenizer.analyze("Hello, world!", model: "gpt-4o")
analysis.ids # => [13225, 11, 2375, 0]
analysis.tokens # => ["Hello", ",", " world", "!"]
analysis.count # => 4
analysis.model # => "tiktoken:o200k_base"
# Truncate to fit a context window
RubyLLM::Tokenizer.truncate(
huge_log,
max_tokens: 30_000,
model: "gpt-4o",
overflow: :truncate_left # drop oldest content; default is :truncate_right
)
# Stream/Enumerable inputs work too
RubyLLM::Tokenizer.truncate(
File.foreach("huge_log.txt"),
max_tokens: 30_000,
model: "gpt-4o",
overflow: :truncate_left
)For stream-like inputs, truncate accepts any Enumerable of chunks (for example
File.foreach(...)) and incrementally applies the same exact token-limit semantics as
string input. This avoids requiring callers to materialize the original source text up
front and avoids some duplicate tokenization work during truncation, though the
implementation may still retain the kept portion in memory.
| Family | Backend | Encoding / Repo |
|---|---|---|
| All OpenAI families (gpt-3.5/4/4o/4.1/4.5/5, o-series, gpt-oss, embeddings, ft:, legacy) | tiktoken_auto |
resolved via Tiktoken.encoding_for_model |
gemini |
sentencepiece |
bundled .model, override with GEMINI_TOKENIZER_MODEL_FILE |
llama-3 / meta-llama |
hugging_face |
meta-llama/Meta-Llama-3-8B-Instruct |
mistral / mixtral |
hugging_face |
mistralai/Mistral-7B-Instruct-v0.2 |
deepseek |
hugging_face |
deepseek-ai/DeepSeek-V2 |
qwen |
hugging_face |
Qwen/Qwen2.5-7B-Instruct |
OpenAI model resolution is delegated to tiktoken_ruby — new OpenAI models become available on bundle update tiktoken_ruby with no change to this gem. Override a specific model at runtime with RubyLLM::Tokenizer.register(...).
OpenAI encodings are bundled with tiktoken_ruby (no network needed). Hugging Face tokenizer.json files are downloaded lazily on first use, then persisted under cache_dir for later offline reuse. Some HF repos (Llama 3, recent Mistral) are gated and require an HF token — see Configuration.
If a model ships a SentencePiece .model file instead of tokenizer.json, register it with the sentencepiece backend:
RubyLLM::Tokenizer.register(
match: /^gemma-/,
backend: :sentencepiece,
model_file: "/path/to/tokenizer.model"
)This backend uses the sentencepiece.rb gem. Add sentencepiece to your bundle and install the native SentencePiece library on your system.
Common install commands from the upstream project:
# macOS
brew install sentencepiece
# Ubuntu / Debian
sudo apt-get install sentencepiece libsentencepiece-devIf you install the gem directly on Apple Silicon, upstream also notes that you may need to point RubyGems at Homebrew's prefix:
gem install sentencepiece -- --with-opt-dir=/opt/homebrewGemini uses the bundled lib/ruby_llm/tokenizer/data/gemini_tokenizer.model by default; set GEMINI_TOKENIZER_MODEL_FILE to override it.
Anthropic does not publish Claude's tokenizer. By default, model: "claude-..." raises UnknownModelError.
You can opt in to an approximate count (uses o200k_base as a stand-in; typically within 5–15% of the real number):
RubyLLM::Tokenizer.enable_claude_approximation!
RubyLLM::Tokenizer.count("Hello", model: "claude-3-5-sonnet-20241022")
# warns once, then returns an approximate IntegerDo not use approximate counts to enforce hard context limits — leave headroom, or call Anthropic's count_tokens endpoint for exact numbers.
RubyLLM::Tokenizer.register(
match: /^my-finetuned-llama/,
backend: :hugging_face,
repo: "my-org/my-finetuned-llama-tokenizer"
)
RubyLLM::Tokenizer.register(
match: "gpt-4o-2024-internal",
backend: :tiktoken,
encoding: "o200k_base"
)User registrations take precedence over built-ins.
RubyLLM::Tokenizer.configure do |c|
c.cache_dir = Pathname("/tmp/ruby_llm_tokenizer") # default: ~/.cache/ruby_llm/tokenizer; stores downloaded HF tokenizers
c.offline = false # if true, never hits the HF Hub
c.hf_token = ENV["HF_TOKEN"] # also reads HUGGING_FACE_HUB_TOKEN
c.approximate_warn = true # warn on first approximate use
end| Class | Raised when |
|---|---|
RubyLLM::Tokenizer::UnknownModelError |
No registered pattern matches the given model id |
RubyLLM::Tokenizer::BackendError |
Underlying tokenizer engine failed to load or encode |
RubyLLM::Tokenizer::CacheError |
offline: true and the local tokenizer.json is missing |
RubyLLM::Tokenizer::ContextExceededError |
Raised when a token count exceeds a defined limit (reserved for future use) |
bin/setup
bundle exec rspec
bin/consoleSKIP_PUSH=1 ./build_release.sh
./build_release.sh
GEM_HOST_OTP=123456 ./build_release.shSKIP_PUSH=1builds the gem and verifies the release artifact without publishing.- Running
./build_release.shnormally builds and pushes, lettinggem pushprompt for MFA. GEM_HOST_OTP=...passes an explicit RubyGems OTP when you want a non-interactive push.
Bug reports and pull requests are welcome on GitHub at https://github.com/washu/ruby_llm-tokenizer.
The gem is available as open source under the terms of the MIT License.