Open Data Fabric

Introduction

FAQ

No. Databases focus on state and transactional processing while ODF focuses on history of events and analytical processing. Databases store data in custom data formats, while ODF proposes an interoperable format where data can be copied around as files and efficiently queried by multiple data engines similar to Open Lakehouse architecture. Databases are producs, while ODF is a specification that allows multiple implementations. Think of it as "HTTPS of data" - a standardized way to package, share, process, and verify data with complete provenance tracking that works across org boundaries. It doesn't replace databases but rather meant to be used alongside them, e.g. replicating your event stores for BI or maintaining projections in DBs for faster queries.

ODF does not replace or compete with any data processing engines - it builds on top of them. It uses engines like Spark, Flink, Arroyo, Datafusion and others as plugins, imposing certain requirements on data format, metadata conventions, and protocols needed to make data processing verifiable and suitable for multi-party collaboration. ODF pipelines can even mix multiple different engines together!

ODF uses Parquet as its underlying data storage format. It shares many concepts with Iceberg, but with a few notable differences. Instead of logical model of "tables" it models data as ledgers of events with explicit retractions and corrections for auditability. ODF defaults to preserve complete history of changes, it's non-destructive. ODF is designed to work with immutable and content-addressable storage systems. ODF goes beyond a data format to also specify a complete model of how data can be processed while preserving provenance and verifiability.

Yes, ODF is designed to integrate with existing infra and tools. You can ingest data from databases and run it alonside existing data lakehouses. Since ODF uses standard formats like Parquet and Arrow internally, you can also export data back to your existing systems. ODF implementations support standard query protocols like SQL, so your existing data science and BI tools and analytics platforms can query ODF datasets directly. The key benefit is that once data is in ODF format, it gains verifiability, provenance tracking, and collaboration capabilities that your existing infrastructure lacks.

ODF is designed with massive volumes of near-realtime data in mind (e.g. Industrial IoT, finance). It builds on decades of advancements in analytical data management and modern lakehouse architectures. It uses temporal processing model to allow highly efficient incremental processing of streaming data. Its scaling and performance characteristics should closely match those of the underlying engines, with minimal overhead introduced by metadata ledger upkeep and data hashing. Likewise ODF is very efficient for data transfer and replication, including direct-to-storage access to fully utilize your network bandwidth.

Consortium Objectives

ODF is a community effort to create a foundational data exchange mechanism that is not tied to any single vendor. Multiple organizations are contributing resources to this consortium driven by a set of common objectives:

Sovereign Exchange

Enable efficient data exchange between organizations without intermediaries. Handle peer-to-peer sharing and federated processing of large volumes of highly dynamic data.

Ownership & Accountability

Ensure all data is attributable to its rightful owner no matter how many hands it goes through, and that owners can be held accountable for the veracity of data they provide.

Collaboration & Trust

Enable global collaboration on data cleaning, enrichment, and integration in an environment of verifiable trust, where malicious actors are identified and exposed in a system with no central authority.

Complete Provenance

Allow to reliably trace all derivative data back to its sources and audit all transformations, so that ensuring validity of data would take minutes, not months.

Unified Privacy & Access Control

Unify how privacy and access control works across internal and external data and federates across organizations. Create a common foundation for integrating privacy-preserving compute technologies.

Quality, Recency, Reuse

Enable high degrees of data reuse by making high-quality and up-to-date data more readily available. Make data a grounding point of decision-making and societal debate over divisive issues.

Close Feedback Loops

Allow publishers to see how their data is used, and consumers to influence data design and delivery decisions in a spirit of open-source collaboration.

Equitable Economy

Provide mechanisms to fully align the incentives of publishers, collaborators, and consumers and build efficient and equitable supply chains that power data-driven society.

Join Discussion

Technology Pillars

Ledgerized Data

Mainstream analytical data formats represent data as tables that are edited, overwritten, and routinely loose history - this is not acceptable when data needs to move between organizations. ODF stores data and metadata as cryptographic ledgers. Consumers can see all aspects of dataset evolution, can audit who changed data and when, can use ledger hashes as stable references to datasets at a specific point in time without complex versioning or snapshotting, and able to hold publishers forever accountable for data they provided as any alteration of history would invalidate the hashes. It keeps all history by default and makes corrections, retractions, compactions, and GDPR erasustes explicit.

Temporal Processing

Reading data from a database and writing it back as a new table is a common ETL practice today, but ODF avoids that. It keeps data processing within the system to ensure complete provenance and auditability of derivative data. Using temporal processing pararigm it can efficiently work with dynamic, potentially infinite streams of data. It makes the tradeoff between consistency and latency explicit and configurable, and allows corrections and retractions in source data to automatically propagate through multi-stage pipelines. This highly autonomous, self-correcting nature makes it uniquely suitable for multi-party collaboration.

Verifiable Computing

When someone presents you data as a result of a query or multi-stage processing - ODF allows you to verify that this data can be trusted. It fully automates verification through reproducibility - when result is traced back to trusted sources and raw data is used to replay all transformations to compare end result hashes. For private data this approach can be extended with a variety of confidential computing techniques, like TEEs, Zero-Knowledge proofs, and FHE. ODF serves as a foundation for integrating advanced privacy-preserving computing techniques and making them interoperable.

Protocol Stack

Building blocks of ODF are represented bottom-up, expanding simple to more complex behavior:

Federated Network

Processing Pipelines

Core Format

Federated Querying

Query Proofs

Planning

Discovery

Registry

Anchoring

Decentralized Access Control

Threshold Cryptography

Substrait

Query plan representation

Flight SQL / ADBC

High-performance data transfer

Confidential Computing

Privacy-preserving techniques

Derivative Datasets

Lineage

Cell-level Provenance

Migrations

Verification

Engine Interface

gRPC

Sandboxing

Temporal Processing

Event-time Processing

Watermarks

Checkpoints

Streaming SQL

Verifiable Computing

Determinism

Reproducibility

Sources

Polling

Push

State

Ingestion

Merge Strategies

Compactions

Soft

Hard

Replication

Sync

Async

Dataset Layout

Named

Content-Addressable

Repositories

P2P sharing

Transfer Protocols

HTTP

Direct-to-Storage

Metadata Chain

Semantics

Consistency Rules

Encoding

Dataset Structure

References

Apache Arrow

In-memory data format

Apache Parquet

Columnar storage format

Flatbuffers

Binary serialization

Metadata Format

Blocks

Events

JSON-LD

Encoding

Data Schema

ODF

DDL

Arrow

Data Format

Bitemporal

Changelog

Identity

DID

PKH

ECDSA

Permissions

ReBAC

UCAN

JWT

Content Addressing

Multibase

Multihash

SHA3

W3C DID

Identity

Verifiable Credentials

Attestation

UCAN

Authorization

Building block of collaborative data economy

Introduction

FAQ

Consortium Objectives

Sovereign Exchange

Ownership & Accountability

Collaboration & Trust

Complete Provenance

Unified Privacy & Access Control

Quality, Recency, Reuse

Close Feedback Loops

Equitable Economy

Members

Implementations

Technology Pillars

Ledgerized Data

Temporal Processing

Verifiable Computing

Protocol Stack

Federated Querying

Federated Querying

Discovery

Discovery

Decentralized Access Control

Decentralized Access Control

Substrait

Substrait

Flight SQL / ADBC

Flight SQL

Confidential Computing

ADBC

Derivative Datasets

Derivative Datasets

Verification

Verification

Engine Interface

Engine Interface

Temporal Processing

Temporal Processing

Verifiable Computing

Verifiable Computing

Sources

Sources

Ingestion

Ingestion

Compactions

Compactions

Replication

Replication

Dataset Layout

Dataset Layout

Repositories

Repositories

Transfer Protocols

Transfer Protocols

Metadata Chain

Metadata Chain

Dataset Structure

Dataset Structure

Apache Arrow

Apache Arrow

Apache Parquet

Apache Parquet

Flatbuffers

Flatbuffers

Metadata Format

Metadata Format

Data Schema

Data Schema

Data Format

Data Format

Identity

Identity

Permissions

Permissions

Content Addressing

Content Addressing

W3C DID

W3C DID

Verifiable Credentials