Datastrato

Datastrato · 2026-05-04T18:34:47.212Z

R2-D2 was the original AI agent. Traversed the galaxy. Carried sensitive plans. Interfaced with every system he encountered. Zero governance layer. Zero audit trail. Just vibes and beeping. In 2026, your AI agents deserve better than R2-D2's architecture. May the 4th be with you. #MayThe4thBeWithYou #AgenticAI

Software Development

San Mateo, CA 1,524 followers

Original creator of Apache Gravitino. Unified metadata platform for AI - multi-cloud, multi-engine and multi-modal.

Discover all 20 employees

About us

Datastrato is building the open data fabric platform to accelerate trusted AI. The company is the original creator of Apache Gravitino, unified metadata platform for AI - multi-cloud, multi-engine and multi-modal.

Website: https://datastrato.ai/
External link for Datastrato
Industry: Software Development
Company size: 11-50 employees
Headquarters: San Mateo, CA
Type: Privately Held
Founded: 2023
Specialties: Apache Gravitino, Universal Data Catalog, Federated Metadata Lake, Unified Metadata Management, Iceberg REST Catalog, Agentic Data Architecture, Data Agents, AI Infrastructure, Model Registry, Multi-Modal Data Management, AI Data Fabric, Unstructured Data Management, Data Governance, Data Lakehouse, Single Source of Truth, Multi-cloud Data Management, Open Source Software, AI-Native Date Lakehouse, and Modern Data System

Locations

Primary

San Mateo, CA 94402, US

Get directions
Palo Alto, US

Get directions

Employees at Datastrato

See all employees

Updates

Datastrato

1,524 followers
1d
Report this post
Most data estates end up spread across clouds and catalogs. There are usually two ways to handle that, and both cost you. Consolidate everything into one central catalog: a migration measured in quarters, resisted by every team, as it strips them of control over their own data. Or stitch the systems together by hand: per-system credentials, access control reimplemented a dozen times, and no single audit trail that can say who reached what. Apache Gravitino 1.3 takes a third path. Govern data where it lives, without copying it and without taking ownership away from the teams that hold it. One Iceberg REST Catalog serving S3, GCS, and ADLS simultaneously, issuing short-lived credentials on demand. Federated IRC: a Gravitino catalog federates other Iceberg REST catalogs with no metadata copied, and the owning catalog still authorizes the real user and keeps its own audit log. Mark Hoerth wrote up the full release, including the GA metadata cache (6.6x faster reads in-region, past 20x as storage gets slower), the AWS Glue catalog, a built-in identity provider, and governed agent access over MCP. Read it here: https://lnkd.in/gzcHSeKy Want the deeper version? Watch the deep dive: https://lnkd.in/gs_aSNDt Watch our interview with Mark Hoerth and Jerry (Saisai) Shao: https://lnkd.in/gvQkknWU Join the live Q&A on July 15: https://lnkd.in/gpE7KzsT #ApacheGravitino #ApacheIceberg #DataEngineering #OpenSource
Like Comment Share
Datastrato

1,524 followers
5d Edited
Report this post
Caching table metadata is one of those features that's easy to get wrong. Serve a stale entry, and an engine can plan against a commit that's no longer there. Skip the access check on a cache hit and you've quietly opened an authorization gap. Apache Gravitino 1.3 turns the cache on by default anyway, because the design makes the fast path and the safe path the same one: every hit revalidates against the backend and still runs the access check. It's one of the smaller decisions in the release, and there are plenty more like it. - How credentials get vended across clouds. - What stays with a catalog when you federate it. - Why a few things waited for the Iceberg standard. Mark Hoerth and Jerry (Saisai) Shao , who built it, are hosting an open session to talk through any of them. Architecture and design-choice questions welcome. Jul 15th | 8am PDT/3pm GMT Sign up for reminders on Luma: https://lnkd.in/gBB_Kbck #ApacheGravitino #ApacheIceberg #DataEngineering

Ask the Builders: Apache Gravitino 1.3

www.linkedin.com

1 Comment

Like Comment Share
Datastrato

1,524 followers
1w
Report this post
Your data estate is already plural. The usual fix makes it worse. The standard answer to multi-cloud sprawl is to consolidate: pick one catalog, copy everything in, migrate. The copy drifts from the source, access controls get rebuilt by people who don't own the data, and your audit trail splits in two. Gravitino 1.3 takes the other path: govern data where it lives. We're going live to walk through it. Mark Hoerth (Product Head) and Jerry (Saisai) Shao (Co-Founder and CTO) on what's new in 1.3: One catalog across clouds, public and private IRC federation with no metadata copied AWS Glue governed in place, no migration Faster reads out of the box, with table metadata cached automatically Plus first-class views, nested namespaces, and plenty of room for questions. Built for the Gravitino community: contributors, maintainers, and anyone running it in production. Live on June 24th at 8am PDT/3pm GMT Sign up for reminders: https://lnkd.in/gy5NSZWK #ApacheGravitino #ApacheIceberg #DataEngineering #OpenSource

Apache Gravitino 1.3 Deep Dive: Govern at the Source

www.linkedin.com

Like Comment Share
Datastrato

1,524 followers
1w
Report this post
Good infrastructure rarely starts on a roadmap. It starts with someone hitting a wall. Roku's wall: data spread across three clouds and their own data center, with no single place to govern it. The standard fix is a catalog per cloud. Do that, and you're suddenly running three or four, none of them agreeing with each other. Roku didn't want three catalogs. They wanted one that could hold all of it. That's what shipped in Apache Gravitino 1.3. One catalog, where a dataset on your on-prem storage grid sits in the same namespace as one on S3 or GCS. The difference between clouds comes down to a LOCATION clause. Very few catalogs can do this today, and there's a reason. The hard part isn't the idea, it's the credentials: vending the right one for each cloud and getting a single client to work across all of them. That's the piece the Datastrato team had to solve. In the clip, Mark Hoerth tells the origin story and Jerry (Saisai) Shao Shao gets into how it actually works. Mark and Jerry walk through all of 1.3 in the webinar, Govern at the Source: https://lnkd.in/gy5NSZWK #ApacheGravitino #ApacheIceberg #MultiCloud #DataEngineering

1 Comment

Like Comment Share
Datastrato reposted this
Apache Gravitino

322 followers
1w Edited
Report this post
Apache Gravitino 1.3 will be released soon! The release centers on governing data where it lives: across clouds and across catalogs, without copying metadata into a second source of truth. Apache Iceberg REST Catalog federation: a Gravitino IRC can federate other IRC services natively. No metadata is copied, and the owning catalog keeps its own authorization, credential vending, and audit log. Multi-cloud within a single catalog: a new MultiSchemeFileIO routes storage by URI scheme, so one Iceberg REST Catalog can serve a default warehouse alongside backends on other clouds at once, while staying IRC-compliant. Vended credentials now refresh across S3, GCS, OSS, and ADLS. AWS Glue catalog: Glue estates come under unified governance without migration, with Trino and Spark adapters in the same release. A Hologres JDBC catalog is also added. Table metadata cache reaches GA: enabled by default, with an access check and a staleness guardrail on every hit, and reads measured 4x faster under concurrent load. Views and hierarchical namespaces: logical views become first-class, versioned entities across the catalog surface, and both the core server and the IRC now support multi-level nested namespaces. Identity and access: a built-in identity provider, group-aware ownership, role inheritance, and function-level authorization. The IRC also gets more production-ready: asynchronous cleanup, ETag-based freshness, strict mode by default, health endpoints, and a move to Apache Iceberg 1.11.0. Want a closer look? Mark Hoerth and Jerry (Saisai) Shao walk through 1.3 in a live webinar hosted by Datastrato on June 24, with time for questions: https://lnkd.in/gy5NSZWK Thanks to everyone who contributed code, reviews, testing, and feedback to 1.3. #ApacheGravitino #ApacheIceberg #DataEngineering #OpenSource #Metadata

Apache Gravitino 1.3: Govern at the Source · Luma luma.com

Like Comment Share
Datastrato

1,524 followers
2w Edited
Report this post
Your data estate is already plural. Nobody planned it that way. It accumulates one reasonable decision at a time, until your data lives in more clouds and more catalogs than anyone intended. The reflex is to consolidate: pick one catalog, copy everything in, migrate. But the copy becomes a second source of truth that drifts from the original. Access controls get rebuilt by people who don't own the data. The audit trail splits in two. Apache Gravitino 1.3 does the opposite. It governs data where it lives: - One catalog across every cloud - IRC federation with zero metadata copied - AWS Glue is governed in place, no migration - Enterprise identity that works out of the box Mark Hoerth, our Product Head for Gravitino, and Jerry (Saisai) Shao, our co-founder and CTO, walk through it all on June 24, with time for your questions. Save your spot: https://lnkd.in/gy5NSZWK #ApacheGravitino #DataGovernance #DataEngineering #OpenSource #Iceberg
Like Comment Share
Datastrato

1,524 followers
1mo Edited
Report this post
Heading to #SnowflakeSummit? Don't miss a great conversation about bringing enterprise-wide #data to your #ai #agents!

Adi Wabisabi
1mo Edited

I'm excited to announce the panel for our next Data for AI on Jun 3rd at Yes SF! We have four experts in the field, who will be speaking on our panel on unifying enterprise-wide data for agents, each covering various dimensions of this challenge at scale. Josh W., LiveRamp: Josh is a Principal Architect at LiveRamp, where he's at the intersection of AI, data and MarTech. In case you missed his last Data for AI presentation, he's built a semantic middle layer that allows LiveRamp's agents to understand the context of data from over a dozen systems. They had to balance cost, security, auditability, and many other factors to make this initiative a success. Previously, he's held senior roles at Highnote, Coinbase, Twilio, and Salesforce, and holds a patent on database server access management. Mark Hoerth, Datastrato: Mark leads product at Datastrato, working on Apache Gravitino and the next chapter of open table formats for AI. He joined from Dremio, where he held product and solution architecture roles spanning Apache Iceberg lakehouse deployments and AI semantic search, and led Dremio's efforts security, Iceberg, and Apache Polaris. A Stanford alum based in the Bay Area, Mark is a longtime Silicon Valley startup builder. Andrew Madson, Fivetran: Andrew leads Developer Relations at Fivetran, where he builds programs that help developers and data teams adopt modern data and AI tooling. He's the author of O'Reilly's Apache Polaris: The Definitive Guide, with two more books on the way — AI-Ready Data (Wiley) and Data Transformation (O'Reilly). Andrew previously built DevRel functions at Tobiko Data and Dremio, and he teaches data science and engineering as a graduate professor. Alexy Khrabrov, LakeSail: Dr. Khrabrov is the Head of Community at LakeSail, building the Spark-compatible AI Lakehouse of tomorrow in Rust. He is also the founder of the Community Research Center for Reliable AI at Northeastern University, founder and organizer of AI By the Bay, Bay Area AI, AI Agent SF, the longest-running, deepest technical OSS AI communities, conferences, and meetups in the San Francisco Bay Area. Previously, Alexy was the Director of Open-Source Science at IBM Research, AI Community Architect at Neo4j, Senior Software Engineer at Amazon, and a co-founder and engineer in several Bay Area startups. Spaces are still available, but are running out! Sign up today! https://luma.com/8tvd2xla Thank you to our sponsors, Datastrato, Fivetran & LakeSail for making this event possible! See you there!

Snowflake Summit Side Event: How Production AI Agents Access Data Across the Entire Enterprise · Luma luma.com

Like Comment Share
Datastrato

1,524 followers
1mo Edited
Report this post
Do you remember the first time you implemented OAuth? It's a rite of passage for many developers. Take 4 minutes and watch this great demo by Bharath Krishna, which explains how Roku uses Apache Gravitino to unify data access across their systems. Link to the full talk is in the comments. Follow Datastrato to stay locked in on what's going on with Apache Gravitino and don't miss our next Data for AI during Snowflake Summit on Jun 3rd at Yes SF, where we'll be discussing how to unify your Data Platform for the Agentic Era. Sign up here: https://luma.com/8tvd2xla #data #airflow #spark

1 Comment

Like Comment Share
Datastrato

1,524 followers
1mo Edited
Report this post
When it comes to Apache Iceberg Materialized Views, it's all under the surface. Take 6 minutes and deep dive on this performance-improving technique with Mark Hoerth Follow Datastrato to keep up to date with the latest with Iceberg and Apache Gravitino. #apacheiceberg #data #techtalk

1 Comment

Like Comment Share
Datastrato

1,524 followers
1mo Edited
Report this post
R2-D2 was the original AI agent. Traversed the galaxy. Carried sensitive plans. Interfaced with every system he encountered. Zero governance layer. Zero audit trail. Just vibes and beeping. In 2026, your AI agents deserve better than R2-D2's architecture. May the 4th be with you. #MayThe4thBeWithYou #AgenticAI

Like Comment Share

Datastrato

Software Development

San Mateo, CA 1,524 followers

Original creator of Apache Gravitino. Unified metadata platform for AI - multi-cloud, multi-engine and multi-modal.

About us

Locations

Employees at Datastrato

Andrew D.

Tom Tan

Mark Hoerth

Shi Shao Feng

Updates

Ask the Builders: Apache Gravitino 1.3

www.linkedin.com

Apache Gravitino 1.3 Deep Dive: Govern at the Source

www.linkedin.com

Join now to see what you are missing

Similar pages

Apache Gravitino

VeloDB (Powered by Apache Doris)

Data for AI

EntreConnect

AI Vanguard Collective

KAIYUANSHE 开源社

LanceDB

Neo4j

WisdomAI

RisingWave

Browse jobs

Architect jobs

Associate Researcher jobs

Accounts Payable Accountant jobs

Ruby on Rails Developer jobs

Visual Designer jobs

Real Estate Manager jobs

SQL Database Administrator jobs

Product Development Specialist jobs

Risk Analyst jobs

Engineer jobs

Photographer jobs

Junior Software Engineer jobs

Program Manager jobs

Frontend Developer jobs

Android Developer jobs

Cyber Security Specialist jobs

Marketing Assistant jobs

Machine Learning Engineer jobs

Python Developer jobs

Senior Software Engineer jobs