Datastrato’s cover photo
Datastrato

Datastrato

Software Development

San Mateo, CA 1,524 followers

Original creator of Apache Gravitino. Unified metadata platform for AI - multi-cloud, multi-engine and multi-modal.

About us

Datastrato is building the open data fabric platform to accelerate trusted AI. The company is the original creator of Apache Gravitino, unified metadata platform for AI - multi-cloud, multi-engine and multi-modal.

Website
https://datastrato.ai/
Industry
Software Development
Company size
11-50 employees
Headquarters
San Mateo, CA
Type
Privately Held
Founded
2023
Specialties
Apache Gravitino, Universal Data Catalog, Federated Metadata Lake, Unified Metadata Management, Iceberg REST Catalog, Agentic Data Architecture, Data Agents, AI Infrastructure, Model Registry, Multi-Modal Data Management, AI Data Fabric, Unstructured Data Management, Data Governance, Data Lakehouse, Single Source of Truth, Multi-cloud Data Management, Open Source Software, AI-Native Date Lakehouse, and Modern Data System

Locations

Employees at Datastrato

Updates

  • Most data estates end up spread across clouds and catalogs. There are usually two ways to handle that, and both cost you. Consolidate everything into one central catalog: a migration measured in quarters, resisted by every team, as it strips them of control over their own data. Or stitch the systems together by hand: per-system credentials, access control reimplemented a dozen times, and no single audit trail that can say who reached what. Apache Gravitino 1.3 takes a third path. Govern data where it lives, without copying it and without taking ownership away from the teams that hold it. One Iceberg REST Catalog serving S3, GCS, and ADLS simultaneously, issuing short-lived credentials on demand. Federated IRC: a Gravitino catalog federates other Iceberg REST catalogs with no metadata copied, and the owning catalog still authorizes the real user and keeps its own audit log. Mark Hoerth wrote up the full release, including the GA metadata cache (6.6x faster reads in-region, past 20x as storage gets slower), the AWS Glue catalog, a built-in identity provider, and governed agent access over MCP. Read it here: https://lnkd.in/gzcHSeKy Want the deeper version? Watch the deep dive: https://lnkd.in/gs_aSNDt Watch our interview with Mark Hoerth and Jerry (Saisai) Shao: https://lnkd.in/gvQkknWU Join the live Q&A on July 15: https://lnkd.in/gpE7KzsT #ApacheGravitino #ApacheIceberg #DataEngineering #OpenSource

    • No alternative text description for this image
  • View organization page for Datastrato

    1,524 followers

    Caching table metadata is one of those features that's easy to get wrong. Serve a stale entry, and an engine can plan against a commit that's no longer there. Skip the access check on a cache hit and you've quietly opened an authorization gap. Apache Gravitino 1.3 turns the cache on by default anyway, because the design makes the fast path and the safe path the same one: every hit revalidates against the backend and still runs the access check. It's one of the smaller decisions in the release, and there are plenty more like it. - How credentials get vended across clouds. - What stays with a catalog when you federate it. - Why a few things waited for the Iceberg standard. Mark Hoerth and Jerry (Saisai) Shao , who built it, are hosting an open session to talk through any of them. Architecture and design-choice questions welcome. Jul 15th | 8am PDT/3pm GMT Sign up for reminders on Luma: https://lnkd.in/gBB_Kbck #ApacheGravitino #ApacheIceberg #DataEngineering

    Ask the Builders: Apache Gravitino 1.3

    Ask the Builders: Apache Gravitino 1.3

    www.linkedin.com

  • Your data estate is already plural. The usual fix makes it worse. The standard answer to multi-cloud sprawl is to consolidate: pick one catalog, copy everything in, migrate. The copy drifts from the source, access controls get rebuilt by people who don't own the data, and your audit trail splits in two. Gravitino 1.3 takes the other path: govern data where it lives. We're going live to walk through it. Mark Hoerth (Product Head) and Jerry (Saisai) Shao (Co-Founder and CTO) on what's new in 1.3: One catalog across clouds, public and private IRC federation with no metadata copied AWS Glue governed in place, no migration Faster reads out of the box, with table metadata cached automatically Plus first-class views, nested namespaces, and plenty of room for questions. Built for the Gravitino community: contributors, maintainers, and anyone running it in production. Live on June 24th at 8am PDT/3pm GMT Sign up for reminders: https://lnkd.in/gy5NSZWK #ApacheGravitino #ApacheIceberg #DataEngineering #OpenSource

    Apache Gravitino 1.3 Deep Dive: Govern at the Source

    Apache Gravitino 1.3 Deep Dive: Govern at the Source

    www.linkedin.com

  • Good infrastructure rarely starts on a roadmap. It starts with someone hitting a wall. Roku's wall: data spread across three clouds and their own data center, with no single place to govern it. The standard fix is a catalog per cloud. Do that, and you're suddenly running three or four, none of them agreeing with each other. Roku didn't want three catalogs. They wanted one that could hold all of it. That's what shipped in Apache Gravitino 1.3. One catalog, where a dataset on your on-prem storage grid sits in the same namespace as one on S3 or GCS. The difference between clouds comes down to a LOCATION clause. Very few catalogs can do this today, and there's a reason. The hard part isn't the idea, it's the credentials: vending the right one for each cloud and getting a single client to work across all of them. That's the piece the Datastrato team had to solve. In the clip, Mark Hoerth tells the origin story and Jerry (Saisai) Shao Shao gets into how it actually works. Mark and Jerry walk through all of 1.3 in the webinar, Govern at the Source: https://lnkd.in/gy5NSZWK #ApacheGravitino #ApacheIceberg #MultiCloud #DataEngineering

  • Datastrato reposted this

    Apache Gravitino 1.3 will be released soon! The release centers on governing data where it lives: across clouds and across catalogs, without copying metadata into a second source of truth. Apache Iceberg REST Catalog federation: a Gravitino IRC can federate other IRC services natively. No metadata is copied, and the owning catalog keeps its own authorization, credential vending, and audit log. Multi-cloud within a single catalog: a new MultiSchemeFileIO routes storage by URI scheme, so one Iceberg REST Catalog can serve a default warehouse alongside backends on other clouds at once, while staying IRC-compliant. Vended credentials now refresh across S3, GCS, OSS, and ADLS. AWS Glue catalog: Glue estates come under unified governance without migration, with Trino and Spark adapters in the same release. A Hologres JDBC catalog is also added. Table metadata cache reaches GA: enabled by default, with an access check and a staleness guardrail on every hit, and reads measured 4x faster under concurrent load. Views and hierarchical namespaces: logical views become first-class, versioned entities across the catalog surface, and both the core server and the IRC now support multi-level nested namespaces. Identity and access: a built-in identity provider, group-aware ownership, role inheritance, and function-level authorization. The IRC also gets more production-ready: asynchronous cleanup, ETag-based freshness, strict mode by default, health endpoints, and a move to Apache Iceberg 1.11.0. Want a closer look? Mark Hoerth and Jerry (Saisai) Shao walk through 1.3 in a live webinar hosted by Datastrato on June 24, with time for questions: https://lnkd.in/gy5NSZWK Thanks to everyone who contributed code, reviews, testing, and feedback to 1.3. #ApacheGravitino #ApacheIceberg #DataEngineering #OpenSource #Metadata

  • View organization page for Datastrato

    1,524 followers

    Your data estate is already plural. Nobody planned it that way. It accumulates one reasonable decision at a time, until your data lives in more clouds and more catalogs than anyone intended. The reflex is to consolidate: pick one catalog, copy everything in, migrate. But the copy becomes a second source of truth that drifts from the original. Access controls get rebuilt by people who don't own the data. The audit trail splits in two. Apache Gravitino 1.3 does the opposite. It governs data where it lives: - One catalog across every cloud - IRC federation with zero metadata copied - AWS Glue is governed in place, no migration - Enterprise identity that works out of the box Mark Hoerth, our Product Head for Gravitino, and Jerry (Saisai) Shao, our co-founder and CTO, walk through it all on June 24, with time for your questions. Save your spot: https://lnkd.in/gy5NSZWK #ApacheGravitino #DataGovernance #DataEngineering #OpenSource #Iceberg

    • No alternative text description for this image
  • View organization page for Datastrato

    1,524 followers

    Heading to #SnowflakeSummit? Don't miss a great conversation about bringing enterprise-wide #data to your #ai #agents!

    View profile for Adi Wabisabi

    I'm excited to announce the panel for our next Data for AI on Jun 3rd at Yes SF! We have four experts in the field, who will be speaking on our panel on unifying enterprise-wide data for agents, each covering various dimensions of this challenge at scale. Josh W., LiveRamp: Josh is a Principal Architect at LiveRamp, where he's at the intersection of AI, data and MarTech. In case you missed his last Data for AI presentation, he's built a semantic middle layer that allows LiveRamp's agents to understand the context of data from over a dozen systems. They had to balance cost, security, auditability, and many other factors to make this initiative a success. Previously, he's held senior roles at Highnote, Coinbase, Twilio, and Salesforce, and holds a patent on database server access management. Mark Hoerth, Datastrato: Mark leads product at Datastrato, working on Apache Gravitino and the next chapter of open table formats for AI. He joined from Dremio, where he held product and solution architecture roles spanning Apache Iceberg lakehouse deployments and AI semantic search, and led Dremio's efforts security, Iceberg, and Apache Polaris. A Stanford alum based in the Bay Area, Mark is a longtime Silicon Valley startup builder. Andrew Madson, Fivetran: Andrew leads Developer Relations at Fivetran, where he builds programs that help developers and data teams adopt modern data and AI tooling. He's the author of O'Reilly's Apache Polaris: The Definitive Guide, with two more books on the way — AI-Ready Data (Wiley) and Data Transformation (O'Reilly). Andrew previously built DevRel functions at Tobiko Data and Dremio, and he teaches data science and engineering as a graduate professor. Alexy Khrabrov, LakeSail: Dr. Khrabrov is the Head of Community at LakeSail, building the Spark-compatible AI Lakehouse of tomorrow in Rust. He is also the founder of the Community Research Center for Reliable AI at Northeastern University, founder and organizer of AI By the Bay, Bay Area AI, AI Agent SF, the longest-running, deepest technical OSS AI communities, conferences, and meetups in the San Francisco Bay Area. Previously, Alexy was the Director of Open-Source Science at IBM Research, AI Community Architect at Neo4j, Senior Software Engineer at Amazon, and a co-founder and engineer in several Bay Area startups. Spaces are still available, but are running out! Sign up today! https://luma.com/8tvd2xla Thank you to our sponsors, Datastrato, Fivetran & LakeSail for making this event possible! See you there!

  • View organization page for Datastrato

    1,524 followers

    Do you remember the first time you implemented OAuth? It's a rite of passage for many developers. Take 4 minutes and watch this great demo by Bharath Krishna, which explains how Roku uses Apache Gravitino to unify data access across their systems. Link to the full talk is in the comments. Follow Datastrato to stay locked in on what's going on with Apache Gravitino and don't miss our next Data for AI during Snowflake Summit on Jun 3rd at Yes SF, where we'll be discussing how to unify your Data Platform for the Agentic Era. Sign up here: https://luma.com/8tvd2xla #data #airflow #spark

Similar pages

Browse jobs