Building block of collaborative data economy

ODF is a set of formats and protocols that enable exchange of data between organizations and unlock global collaboration on data processing in a trustless environment. It's an open specification developed as a community effort.

Open Data Fabric

Introduction

ODF Introduction Video

FAQ

No. Databases focus on state and transactional processing while ODF focuses on history of events and analytical processing. Databases store data in custom data formats, while ODF proposes an interoperable format where data can be copied around as files and efficiently queried by multiple data engines similar to Open Lakehouse architecture. Databases are producs, while ODF is a specification that allows multiple implementations. Think of it as "HTTPS of data" - a standardized way to package, share, process, and verify data with complete provenance tracking that works across org boundaries. It doesn't replace databases but rather meant to be used alongside them, e.g. replicating your event stores for BI or maintaining projections in DBs for faster queries.

ODF does not replace or compete with any data processing engines - it builds on top of them. It uses engines like Spark, Flink, Arroyo, Datafusion and others as plugins, imposing certain requirements on data format, metadata conventions, and protocols needed to make data processing verifiable and suitable for multi-party collaboration. ODF pipelines can even mix multiple different engines together!

ODF uses Parquet as its underlying data storage format. It shares many concepts with Iceberg, but with a few notable differences. Instead of logical model of "tables" it models data as ledgers of events with explicit retractions and corrections for auditability. ODF defaults to preserve complete history of changes, it's non-destructive. ODF is designed to work with immutable and content-addressable storage systems. ODF goes beyond a data format to also specify a complete model of how data can be processed while preserving provenance and verifiability.

Yes, ODF is designed to integrate with existing infra and tools. You can ingest data from databases and run it alonside existing data lakehouses. Since ODF uses standard formats like Parquet and Arrow internally, you can also export data back to your existing systems. ODF implementations support standard query protocols like SQL, so your existing data science and BI tools and analytics platforms can query ODF datasets directly. The key benefit is that once data is in ODF format, it gains verifiability, provenance tracking, and collaboration capabilities that your existing infrastructure lacks.

ODF is designed with massive volumes of near-realtime data in mind (e.g. Industrial IoT, finance). It builds on decades of advancements in analytical data management and modern lakehouse architectures. It uses temporal processing model to allow highly efficient incremental processing of streaming data. Its scaling and performance characteristics should closely match those of the underlying engines, with minimal overhead introduced by metadata ledger upkeep and data hashing. Likewise ODF is very efficient for data transfer and replication, including direct-to-storage access to fully utilize your network bandwidth.

Consortium Objectives

ODF is a community effort to create a foundational data exchange mechanism that is not tied to any single vendor. Multiple organizations are contributing resources to this consortium driven by a set of common objectives:

1

Sovereign Exchange

Enable efficient data exchange between organizations without intermediaries. Handle peer-to-peer sharing and federated processing of large volumes of highly dynamic data.

2

Ownership & Accountability

Ensure all data is attributable to its rightful owner no matter how many hands it goes through, and that owners can be held accountable for the veracity of data they provide.

3

Collaboration & Trust

Enable global collaboration on data cleaning, enrichment, and integration in an environment of verifiable trust, where malicious actors are identified and exposed in a system with no central authority.

4

Complete Provenance

Allow to reliably trace all derivative data back to its sources and audit all transformations, so that ensuring validity of data would take minutes, not months.

5

Unified Privacy & Access Control

Unify how privacy and access control works across internal and external data and federates across organizations. Create a common foundation for integrating privacy-preserving compute technologies.

6

Quality, Recency, Reuse

Enable high degrees of data reuse by making high-quality and up-to-date data more readily available. Make data a grounding point of decision-making and societal debate over divisive issues.

7

Close Feedback Loops

Allow publishers to see how their data is used, and consumers to influence data design and delivery decisions in a spirit of open-source collaboration.

8

Equitable Economy

Provide mechanisms to fully align the incentives of publishers, collaborators, and consumers and build efficient and equitable supply chains that power data-driven society.

Join Discussion

Members

Implementations

ODF is an interoperability specification, not an end product.
To experience what it can do for you - try one of the tools that implement the spec.

Kamu is the official reference implementation of ODF protocol. It provides a CLI for local-first data management and pipeline development, as well as scalable production-ready server implementation that can be used in cloud and on-premise environments. It serves as a proving ground for many ideas that progressively make its way into core ODF spec.

Technology Pillars

Verifiable Computing

Ledgerized Data

Mainstream analytical data formats represent data as tables that are edited, overwritten, and routinely loose history - this is not acceptable when data needs to move between organizations. ODF stores data and metadata as cryptographic ledgers. Consumers can see all aspects of dataset evolution, can audit who changed data and when, can use ledger hashes as stable references to datasets at a specific point in time without complex versioning or snapshotting, and able to hold publishers forever accountable for data they provided as any alteration of history would invalidate the hashes. It keeps all history by default and makes corrections, retractions, compactions, and GDPR erasustes explicit.

Read more
Temporal Processing

Temporal Processing

Reading data from a database and writing it back as a new table is a common ETL practice today, but ODF avoids that. It keeps data processing within the system to ensure complete provenance and auditability of derivative data. Using temporal processing pararigm it can efficiently work with dynamic, potentially infinite streams of data. It makes the tradeoff between consistency and latency explicit and configurable, and allows corrections and retractions in source data to automatically propagate through multi-stage pipelines. This highly autonomous, self-correcting nature makes it uniquely suitable for multi-party collaboration.

Read more
Confidential Computing

Verifiable Computing

When someone presents you data as a result of a query or multi-stage processing - ODF allows you to verify that this data can be trusted. It fully automates verification through reproducibility - when result is traced back to trusted sources and raw data is used to replay all transformations to compare end result hashes. For private data this approach can be extended with a variety of confidential computing techniques, like TEEs, Zero-Knowledge proofs, and FHE. ODF serves as a foundation for integrating advanced privacy-preserving computing techniques and making them interoperable.

Read more

Protocol Stack

Building blocks of ODF are represented bottom-up, expanding simple to more complex behavior:

Federated Network
Processing Pipelines
Core Format
Federated Querying
Query Proofs
Planning
Federated Querying

Enables distributed query execution across multiple datasets and repositories with cryptographic proofs of correctness. Query planning optimizes execution paths while maintaining provenance tracking.

Discovery
Registry
Anchoring
Discovery

Dataset registry and discovery mechanisms for locating data across decentralized networks. Anchoring provides immutable references to dataset states via blockchain or distributed ledgers.

Decentralized Access Control
Threshold Cryptography
Decentralized Access Control

Threshold cryptography enables shared access control where multiple parties must cooperate to grant permissions, eliminating single points of failure in access management.

Substrait

Query plan representation

Substrait

Cross-platform query intermediate representation enabling execution of queries across different engines while maintaining portability and reproducibility of transformations.

Flight SQL / ADBC

High-performance data transfer

Flight SQL

Apache Arrow Flight SQL protocol for high-performance database access with native Arrow data transfer, enabling efficient query execution and result streaming.

Confidential Computing

Privacy-preserving techniques

ADBC

TEEs, FHA, Zero-Knowledge.

Derivative Datasets
Lineage
Cell-level Provenance
Migrations
Derivative Datasets

Datasets created through deterministic transformations of inputs. Complete lineage tracking traces data back to sources, while cell-level provenance explains which input values influenced each output.

Verification
Verification

Cryptographic verification of transformations through reproducibility checks and logical hash validation. Ensures data integrity and enables detection of computation tampering.

Engine Interface
gRPC
Sandboxing
Engine Interface

Language-agnostic query engine communication via gRPC. Sandboxed OCI containers ensure reproducibility by preventing external network access and enforcing deterministic execution.

Temporal Processing
Event-time Processing
Watermarks
Checkpoints
Streaming SQL
Temporal Processing

Third-generation stream processing with bitemporal modeling distinguishing event time from system time. Watermarks signal event boundaries, windowing aggregates data, and checkpointing enables pause/resume with exactly-once semantics.

Verifiable Computing
Determinism
Reproducibility
Verifiable Computing

Ensures all transformations produce identical results when re-executed with the same inputs and engine version. Deterministic execution combined with reproducibility guarantees enables trustless verification.

Sources
Polling
Push
State
Sources

External data ingestion mechanisms for root datasets. Polling fetches data on schedule, push accepts real-time submissions, and state management tracks ingestion progress for incremental updates.

Ingestion
Merge Strategies
Ingestion

Three merge strategies handle external data: Append for raw insertion, Ledger for event streams with deduplication, and Snapshot for change data capture converting periodic exports into event logs.

Compactions
Soft
Hard
Compactions

Soft compaction combines data files for efficiency while preserving full history. Hard compaction permanently removes old data for compliance (GDPR) or storage optimization, recorded explicitly in metadata.

Replication
Sync
Async
Replication

Dataset synchronization across repositories. Synchronous replication ensures immediate consistency, while asynchronous allows eventual consistency with lower latency and network requirements.

Dataset Layout
Named
Content-Addressable
Dataset Layout

Storage organization supporting both named references (aliases like "repo/dataset") and content-addressable access via cryptographic hashes for immutable dataset retrieval.

Repositories

P2P sharing

Repositories

Storage locations hosting datasets for peer-to-peer sharing. Support simple HTTP-based access or smart bidirectional protocols for efficient synchronization across distributed networks.

Transfer Protocols
HTTP
WS
Direct-to-Storage
Transfer Protocols

Simple HTTP for basic metadata and data fetching, WebSocket for bidirectional smart protocol communication, and direct-to-storage for efficient cloud object storage integration.

Metadata Chain
Semantics
Consistency Rules
Encoding
Metadata Chain

Immutable singly-linked list of blocks capturing complete dataset history—transformations, schemas, watermarks, and hashes. Each block references its predecessor cryptographically, creating auditable provenance.

Dataset Structure
References
Tags
Branches
Data Encoding
Dataset Structure

Organizational framework with symbolic references (head, branches), tags for milestone marking, and data file organization. Supports staged workflows and multi-branch development patterns.

Apache Arrow

In-memory data format

Apache Arrow

Cross-language in-memory columnar format enabling zero-copy data interchange between systems. Used for efficient processing and computing stable logical hashes (arrow-digest) for verification.

Apache Parquet

Columnar storage format

Apache Parquet

Columnar storage format for data at rest. All dataset events stored as Parquet files, providing efficient compression, encoding, and query performance for analytics workloads.

Flatbuffers

Binary serialization

Flatbuffers

Memory-efficient binary serialization for metadata blocks. Enables fast parsing without deserialization overhead and maintains backward compatibility for metadata chain evolution.

Metadata Format
Blocks
Events
JSON-LD
Encoding
Metadata Format

Structured metadata blocks containing dataset events (schema changes, transformations, watermarks). JSON-LD provides semantic extensibility, with YAML for human editing and FlatBuffers for binary storage.

Data Schema
ODF
DDL
Arrow
Data Schema

Schema definition using SQL-like DDL syntax with logical types abstracted from physical layouts. Arrow schema provides runtime representation. Evolution tracked in metadata chain with compatibility validation.

Data Format
Bitemporal
Changelog
Data Format

Bitemporal event streams with system_time and event_time dimensions. Changelog operations (append, retract, correct) enable explicit modifications without losing history, supporting late-arriving data and corrections.

Identity
DID
PKH
ECDSA
Identity

W3C Decentralized Identifiers (did:odf) derived from cryptographic key pairs enable dataset ownership without central authority. Public Key Hash addresses and ECDSA signatures ensure authenticity.

Permissions
ReBAC
UCAN
JWT
Permissions

Relationship-Based Access Control (ReBAC) for fine-grained permissions, User Controlled Authorization Networks (UCAN) for decentralized delegation, and JWT for traditional token-based auth.

Content Addressing
Multibase
Multihash
SHA3
Content Addressing

Self-describing hash formats using Multibase encoding and Multihash algorithm identification with SHA3-256 cryptography. Enables content-addressable storage and future hash algorithm migration.

W3C DID

Identity

W3C DID

W3C Decentralized Identifier standard providing globally unique, cryptographically verifiable identifiers without central registry. Foundation for dataset ownership and authentication.

Verifiable Credentials

Attestation

W3C Verifiable Credentials

Standard for cryptographically secure, privacy-respecting credentials. Enables trusted attestations about dataset quality, compliance, or provenance without centralized verification.

UCAN

Authorization

UCAN

User Controlled Authorization Networks enable decentralized, delegated authorization with offline capability. Users grant permissions without requiring coordination with central authority.

JWT
JWT

JSON Web Tokens for stateless authentication and authorization. Industry-standard format for securely transmitting claims between parties with digital signature verification.

Encryption
Encryption

End-to-end encryption mechanisms for confidential data processing. Enables secure collaboration on sensitive datasets while maintaining privacy and access control.

Latest RFCs

Explore the latest Request for Comments shaping the future of Open Data Fabric protocol.

View All RFCs