ODF is a set of formats and protocols that enable exchange of data between organizations and unlock global collaboration on data processing in a trustless environment. It's an open specification developed as a community effort.
No. Databases focus on state and transactional processing while ODF focuses on history of events and analytical processing. Databases store data in custom data formats, while ODF proposes an interoperable format where data can be copied around as files and efficiently queried by multiple data engines similar to Open Lakehouse architecture. Databases are producs, while ODF is a specification that allows multiple implementations. Think of it as "HTTPS of data" - a standardized way to package, share, process, and verify data with complete provenance tracking that works across org boundaries. It doesn't replace databases but rather meant to be used alongside them, e.g. replicating your event stores for BI or maintaining projections in DBs for faster queries.
ODF does not replace or compete with any data processing engines - it builds on top of them. It uses engines like Spark, Flink, Arroyo, Datafusion and others as plugins, imposing certain requirements on data format, metadata conventions, and protocols needed to make data processing verifiable and suitable for multi-party collaboration. ODF pipelines can even mix multiple different engines together!
ODF uses Parquet as its underlying data storage format. It shares many concepts with Iceberg, but with a few notable differences. Instead of logical model of "tables" it models data as ledgers of events with explicit retractions and corrections for auditability. ODF defaults to preserve complete history of changes, it's non-destructive. ODF is designed to work with immutable and content-addressable storage systems. ODF goes beyond a data format to also specify a complete model of how data can be processed while preserving provenance and verifiability.
Yes, ODF is designed to integrate with existing infra and tools. You can ingest data from databases and run it alonside existing data lakehouses. Since ODF uses standard formats like Parquet and Arrow internally, you can also export data back to your existing systems. ODF implementations support standard query protocols like SQL, so your existing data science and BI tools and analytics platforms can query ODF datasets directly. The key benefit is that once data is in ODF format, it gains verifiability, provenance tracking, and collaboration capabilities that your existing infrastructure lacks.
ODF is designed with massive volumes of near-realtime data in mind (e.g. Industrial IoT, finance). It builds on decades of advancements in analytical data management and modern lakehouse architectures. It uses temporal processing model to allow highly efficient incremental processing of streaming data. Its scaling and performance characteristics should closely match those of the underlying engines, with minimal overhead introduced by metadata ledger upkeep and data hashing. Likewise ODF is very efficient for data transfer and replication, including direct-to-storage access to fully utilize your network bandwidth.
ODF is a community effort to create a foundational data exchange mechanism that is not tied to any single vendor. Multiple organizations are contributing resources to this consortium driven by a set of common objectives:
Enable efficient data exchange between organizations without intermediaries. Handle peer-to-peer sharing and federated processing of large volumes of highly dynamic data.
Ensure all data is attributable to its rightful owner no matter how many hands it goes through, and that owners can be held accountable for the veracity of data they provide.
Enable global collaboration on data cleaning, enrichment, and integration in an environment of verifiable trust, where malicious actors are identified and exposed in a system with no central authority.
Allow to reliably trace all derivative data back to its sources and audit all transformations, so that ensuring validity of data would take minutes, not months.
Unify how privacy and access control works across internal and external data and federates across organizations. Create a common foundation for integrating privacy-preserving compute technologies.
Enable high degrees of data reuse by making high-quality and up-to-date data more readily available. Make data a grounding point of decision-making and societal debate over divisive issues.
Allow publishers to see how their data is used, and consumers to influence data design and delivery decisions in a spirit of open-source collaboration.
Provide mechanisms to fully align the incentives of publishers, collaborators, and consumers and build efficient and equitable supply chains that power data-driven society.
ODF is an interoperability specification, not an end product.
To experience what it can do for you - try one of the tools that implement the spec.
Mainstream analytical data formats represent data as tables that are edited, overwritten, and routinely loose history - this is not acceptable when data needs to move between organizations. ODF stores data and metadata as cryptographic ledgers. Consumers can see all aspects of dataset evolution, can audit who changed data and when, can use ledger hashes as stable references to datasets at a specific point in time without complex versioning or snapshotting, and able to hold publishers forever accountable for data they provided as any alteration of history would invalidate the hashes. It keeps all history by default and makes corrections, retractions, compactions, and GDPR erasustes explicit.
Read more
Reading data from a database and writing it back as a new table is a common ETL practice today, but ODF avoids that. It keeps data processing within the system to ensure complete provenance and auditability of derivative data. Using temporal processing pararigm it can efficiently work with dynamic, potentially infinite streams of data. It makes the tradeoff between consistency and latency explicit and configurable, and allows corrections and retractions in source data to automatically propagate through multi-stage pipelines. This highly autonomous, self-correcting nature makes it uniquely suitable for multi-party collaboration.
Read more
When someone presents you data as a result of a query or multi-stage processing - ODF allows you to verify that this data can be trusted. It fully automates verification through reproducibility - when result is traced back to trusted sources and raw data is used to replay all transformations to compare end result hashes. For private data this approach can be extended with a variety of confidential computing techniques, like TEEs, Zero-Knowledge proofs, and FHE. ODF serves as a foundation for integrating advanced privacy-preserving computing techniques and making them interoperable.
Read moreBuilding blocks of ODF are represented bottom-up, expanding simple to more complex behavior:
Enables distributed query execution across multiple datasets and repositories with cryptographic proofs of correctness. Query planning optimizes execution paths while maintaining provenance tracking.
Dataset registry and discovery mechanisms for locating data across decentralized networks. Anchoring provides immutable references to dataset states via blockchain or distributed ledgers.
Threshold cryptography enables shared access control where multiple parties must cooperate to grant permissions, eliminating single points of failure in access management.
Query plan representation
Cross-platform query intermediate representation enabling execution of queries across different engines while maintaining portability and reproducibility of transformations.
High-performance data transfer
Apache Arrow Flight SQL protocol for high-performance database access with native Arrow data transfer, enabling efficient query execution and result streaming.
Privacy-preserving techniques
TEEs, FHA, Zero-Knowledge.
Datasets created through deterministic transformations of inputs. Complete lineage tracking traces data back to sources, while cell-level provenance explains which input values influenced each output.
Cryptographic verification of transformations through reproducibility checks and logical hash validation. Ensures data integrity and enables detection of computation tampering.
Language-agnostic query engine communication via gRPC. Sandboxed OCI containers ensure reproducibility by preventing external network access and enforcing deterministic execution.
Third-generation stream processing with bitemporal modeling distinguishing event time from system time. Watermarks signal event boundaries, windowing aggregates data, and checkpointing enables pause/resume with exactly-once semantics.
Ensures all transformations produce identical results when re-executed with the same inputs and engine version. Deterministic execution combined with reproducibility guarantees enables trustless verification.
External data ingestion mechanisms for root datasets. Polling fetches data on schedule, push accepts real-time submissions, and state management tracks ingestion progress for incremental updates.
Three merge strategies handle external data: Append for raw insertion, Ledger for event streams with deduplication, and Snapshot for change data capture converting periodic exports into event logs.
Soft compaction combines data files for efficiency while preserving full history. Hard compaction permanently removes old data for compliance (GDPR) or storage optimization, recorded explicitly in metadata.
Dataset synchronization across repositories. Synchronous replication ensures immediate consistency, while asynchronous allows eventual consistency with lower latency and network requirements.
Storage organization supporting both named references (aliases like "repo/dataset") and content-addressable access via cryptographic hashes for immutable dataset retrieval.
P2P sharing
Storage locations hosting datasets for peer-to-peer sharing. Support simple HTTP-based access or smart bidirectional protocols for efficient synchronization across distributed networks.
Simple HTTP for basic metadata and data fetching, WebSocket for bidirectional smart protocol communication, and direct-to-storage for efficient cloud object storage integration.
Immutable singly-linked list of blocks capturing complete dataset history—transformations, schemas, watermarks, and hashes. Each block references its predecessor cryptographically, creating auditable provenance.
Organizational framework with symbolic references (head, branches), tags for milestone marking, and data file organization. Supports staged workflows and multi-branch development patterns.
In-memory data format
Cross-language in-memory columnar format enabling zero-copy data interchange between systems. Used for efficient processing and computing stable logical hashes (arrow-digest) for verification.
Columnar storage format
Columnar storage format for data at rest. All dataset events stored as Parquet files, providing efficient compression, encoding, and query performance for analytics workloads.
Binary serialization
Memory-efficient binary serialization for metadata blocks. Enables fast parsing without deserialization overhead and maintains backward compatibility for metadata chain evolution.
Structured metadata blocks containing dataset events (schema changes, transformations, watermarks). JSON-LD provides semantic extensibility, with YAML for human editing and FlatBuffers for binary storage.
Schema definition using SQL-like DDL syntax with logical types abstracted from physical layouts. Arrow schema provides runtime representation. Evolution tracked in metadata chain with compatibility validation.
Bitemporal event streams with system_time and event_time dimensions. Changelog operations (append, retract, correct) enable explicit modifications without losing history, supporting late-arriving data and corrections.
W3C Decentralized Identifiers (did:odf) derived from cryptographic key pairs enable dataset ownership without central authority. Public Key Hash addresses and ECDSA signatures ensure authenticity.
Relationship-Based Access Control (ReBAC) for fine-grained permissions, User Controlled Authorization Networks (UCAN) for decentralized delegation, and JWT for traditional token-based auth.
Self-describing hash formats using Multibase encoding and Multihash algorithm identification with SHA3-256 cryptography. Enables content-addressable storage and future hash algorithm migration.
Identity
W3C Decentralized Identifier standard providing globally unique, cryptographically verifiable identifiers without central registry. Foundation for dataset ownership and authentication.
Attestation
Standard for cryptographically secure, privacy-respecting credentials. Enables trusted attestations about dataset quality, compliance, or provenance without centralized verification.
Authorization
User Controlled Authorization Networks enable decentralized, delegated authorization with offline capability. Users grant permissions without requiring coordination with central authority.
JSON Web Tokens for stateless authentication and authorization. Industry-standard format for securely transmitting claims between parties with digital signature verification.
End-to-end encryption mechanisms for confidential data processing. Enables secure collaboration on sensitive datasets while maintaining privacy and access control.
Explore the latest Request for Comments shaping the future of Open Data Fabric protocol.
Introduces an experimental mechanism for associating large binary objects—such as images, videos, and documents—with ODF datasets. Uses hash-based references to maintain tamper-proof properties while enabling faster dataset access and flexible storage strategies.
Introduces a new human-readable, extensible schema format for ODF datasets. Replaces Arrow flatbuffer schemas with an explicitly defined logical schema supporting rich metadata like column descriptions, logical types, and custom annotations.
Introduces a new `op` system column to standardize how datasets represent data modifications. Implements a two-event changelog stream model to distinguish between regular appends, corrections, and retractions in immutable append-only streams.