Name	Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows	.github/workflows
.idea	.idea
bin	bin
lib/ruby_llm	lib/ruby_llm
sig/ruby_llm	sig/ruby_llm
spec	spec
.gitignore	.gitignore
.rspec	.rspec
.rubocop.yml	.rubocop.yml
CHANGELOG.md	CHANGELOG.md
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
Gemfile	Gemfile
Gemfile.lock	Gemfile.lock
LICENSE.txt	LICENSE.txt
README.md	README.md
Rakefile	Rakefile
build_release.sh	build_release.sh
ruby_llm-tokenizer.gemspec	ruby_llm-tokenizer.gemspec

ruby_llm-tokenizer

Local, model-aware token counting for ruby_llm. A facade over Hugging Face tokenizers, OpenAI tiktoken_ruby, and SentencePiece bindings that maps model identifiers (gpt-4o, llama-3, mistral, ...) to the right tokenizer for counting, analyzing, and truncating text locally. No Rust toolchain required: cross-compiled binaries are inherited from the upstream gems.

Installation

bundle add ruby_llm-tokenizer

Or:

gem install ruby_llm-tokenizer

Requires Ruby >= 3.1.

Usage

require "ruby_llm/tokenizer"

# Count tokens
RubyLLM::Tokenizer.count("Hello, world!", model: "gpt-4o")
# => 4

# Detailed breakdown
analysis = RubyLLM::Tokenizer.analyze("Hello, world!", model: "gpt-4o")
analysis.ids     # => [13225, 11, 2375, 0]
analysis.tokens  # => ["Hello", ",", " world", "!"]
analysis.count   # => 4
analysis.model   # => "tiktoken:o200k_base"

# Truncate to fit a context window
RubyLLM::Tokenizer.truncate(
  huge_log,
  max_tokens: 30_000,
  model: "gpt-4o",
  overflow: :truncate_left  # drop oldest content; default is :truncate_right
)

# Stream/Enumerable inputs work too
RubyLLM::Tokenizer.truncate(
  File.foreach("huge_log.txt"),
  max_tokens: 30_000,
  model: "gpt-4o",
  overflow: :truncate_left
)

For stream-like inputs, truncate accepts any Enumerable of chunks (for example File.foreach(...)) and incrementally applies the same exact token-limit semantics as string input. This avoids requiring callers to materialize the original source text up front and avoids some duplicate tokenization work during truncation, though the implementation may still retain the kept portion in memory.

Supported model families (built-in)

Family	Backend	Encoding / Repo
All OpenAI families (gpt-3.5/4/4o/4.1/4.5/5, o-series, gpt-oss, embeddings, ft:, legacy)	`tiktoken_auto`	resolved via `Tiktoken.encoding_for_model`
`gemini`	`sentencepiece`	bundled `.model`, override with `GEMINI_TOKENIZER_MODEL_FILE`
`llama-3` / `meta-llama`	`hugging_face`	`meta-llama/Meta-Llama-3-8B-Instruct`
`mistral` / `mixtral`	`hugging_face`	`mistralai/Mistral-7B-Instruct-v0.2`
`deepseek`	`hugging_face`	`deepseek-ai/DeepSeek-V2`
`qwen`	`hugging_face`	`Qwen/Qwen2.5-7B-Instruct`

OpenAI model resolution is delegated to tiktoken_ruby — new OpenAI models become available on bundle update tiktoken_ruby with no change to this gem. Override a specific model at runtime with RubyLLM::Tokenizer.register(...).

OpenAI encodings are bundled with tiktoken_ruby (no network needed). Hugging Face tokenizer.json files are downloaded lazily on first use, then persisted under cache_dir for later offline reuse. Some HF repos (Llama 3, recent Mistral) are gated and require an HF token — see Configuration.

If a model ships a SentencePiece .model file instead of tokenizer.json, register it with the sentencepiece backend:

RubyLLM::Tokenizer.register(
  match: /^gemma-/,
  backend: :sentencepiece,
  model_file: "/path/to/tokenizer.model"
)

This backend uses the sentencepiece.rb gem. Add sentencepiece to your bundle and install the native SentencePiece library on your system.

Common install commands from the upstream project:

# macOS
brew install sentencepiece

# Ubuntu / Debian
sudo apt-get install sentencepiece libsentencepiece-dev

If you install the gem directly on Apple Silicon, upstream also notes that you may need to point RubyGems at Homebrew's prefix:

gem install sentencepiece -- --with-opt-dir=/opt/homebrew

Gemini uses the bundled lib/ruby_llm/tokenizer/data/gemini_tokenizer.model by default; set GEMINI_TOKENIZER_MODEL_FILE to override it.

Claude / Anthropic

Anthropic does not publish Claude's tokenizer. By default, model: "claude-..." raises UnknownModelError.

You can opt in to an approximate count (uses o200k_base as a stand-in; typically within 5–15% of the real number):

RubyLLM::Tokenizer.enable_claude_approximation!
RubyLLM::Tokenizer.count("Hello", model: "claude-3-5-sonnet-20241022")
# warns once, then returns an approximate Integer

Do not use approximate counts to enforce hard context limits — leave headroom, or call Anthropic's count_tokens endpoint for exact numbers.

Registering custom models

RubyLLM::Tokenizer.register(
  match: /^my-finetuned-llama/,
  backend: :hugging_face,
  repo: "my-org/my-finetuned-llama-tokenizer"
)

RubyLLM::Tokenizer.register(
  match: "gpt-4o-2024-internal",
  backend: :tiktoken,
  encoding: "o200k_base"
)

User registrations take precedence over built-ins.

Configuration

RubyLLM::Tokenizer.configure do |c|
  c.cache_dir        = Pathname("/tmp/ruby_llm_tokenizer")  # default: ~/.cache/ruby_llm/tokenizer; stores downloaded HF tokenizers
  c.offline          = false                                # if true, never hits the HF Hub
  c.hf_token         = ENV["HF_TOKEN"]                      # also reads HUGGING_FACE_HUB_TOKEN
  c.approximate_warn = true                                 # warn on first approximate use
end

Errors

Class	Raised when
`RubyLLM::Tokenizer::UnknownModelError`	No registered pattern matches the given model id
`RubyLLM::Tokenizer::BackendError`	Underlying tokenizer engine failed to load or encode
`RubyLLM::Tokenizer::CacheError`	`offline: true` and the local tokenizer.json is missing
`RubyLLM::Tokenizer::ContextExceededError`	Raised when a token count exceeds a defined limit (reserved for future use)

Development

bin/setup
bundle exec rspec
bin/console

Releasing

SKIP_PUSH=1 ./build_release.sh
./build_release.sh
GEM_HOST_OTP=123456 ./build_release.sh

SKIP_PUSH=1 builds the gem and verifies the release artifact without publishing.
Running ./build_release.sh normally builds and pushes, letting gem push prompt for MFA.
GEM_HOST_OTP=... passes an explicit RubyGems OTP when you want a non-interactive push.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/washu/ruby_llm-tokenizer.

License

The gem is available as open source under the terms of the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ruby_llm-tokenizer

Installation

Usage

Supported model families (built-in)

Claude / Anthropic

Registering custom models

Configuration

Errors

Development

Releasing

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ruby_llm-tokenizer

Installation

Usage

Supported model families (built-in)

Claude / Anthropic

Registering custom models

Configuration

Errors

Development

Releasing

Contributing

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages