Delta Lake Roadmap

Important

The following is the Delta Lake 2022 H2 Roadmap; for the latest updates, comments, and discussions; please refer to the Github source.


This is a working issue for folks to provide feedback on the prioritization of the Delta Lake priorities spanning July to December 2022. With the release of Delta Lake 2.0, we wanted to take the opportunity to discuss other vital features for prioritization with the community based on the feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), the Roadmap 2022H2 (discussion), and more.

Note
  • Tasks that are crossed out (i.e., 00) have been completed. - We're currently clarifying the Delta Rust roadmap, please refer to this document for more information.

Priority 0


We will focus on these issues and continue to deliver parts (or all of the issue) over the next six months

IssueCategoryTaskDescription
256FlinkFlink SourceBuild Flink source to read Delta tables in batch and streaming jobs
238FlinkFlink SQL+ Table API + Catalog SupportAfter Flink Sink and Source, build support for Flink Catalog, SQL, and Table API
411, 410FlinkProductionize support for all cloud object storesMake sure that Flink Sink can write robustly to S3, GCS, ADLS2 with full transactional guarantees
RustIntegrate with a common object-store abstraction from arrow / Rust ecosystemThis will allow us to provide a more convenient and performant API on the Rust and python side
RustSupport V2 writer protocolUtilize PyArrow-based writer function (write_deltalake) support writer protocol V2 and object stores S3, GCS, and ADLS2.
RustExpand write support for cloud object storesWrite to object stores S3, GCS, and ADLS2 from multiple clusters with full transactional guarantees
1257SparkRelease Delta 2.1 on Apache Spark 3.3Ensure the latest version of Delta Lake works with the latest version of Apache Spark™
1367SparkSupport reading tables with Deletion VectorsAllow reads on tables that have deletion vectors to mark rows in parquet files as removed.
1242SparkSupport time travel SQL syntaxDelta currently supports time travel via Python and Scala APIs. We would like to extend support for the SQL syntax VERSION AS OF and TIMESTAMP AS OF in SELECT statements.
StandaloneExtend Delta Standalone for higher protocol versionsExtend Delta Standalone to support logs using higher protocol versions and advanced features like constraints, generated columns, column mapping, etc.
StandaloneExpand support for data skipping in Delta StandaloneExtend the current data skipping to skip file using column stats and more expressions
WebsiteUpdated Delta Lake documentationMove Delta Lake documentation to the website GitHub repo to allow easier community collaboration
WebsiteConsolidate all connector documentationConsolidate docs of all connectors in the website Github repo

Priority 1


We should be able to deliver parts (or all of the issue) over the next six months

IssueCategoryTaskDescription
4CoreDelta Acceptance Testing (DAT)With various languages interacting with the Delta protocol (e.g., Delta Standalone, Delta Spark, Delta Rust, Trino, etc.), we propose to have the same reference tables and library of reference tests to ensure all Delta APIs remain in compliance.
1347CoreSupport Bloom filtersImprove query performance by utilizing bloom filters. The approach is TBD due to recent updates to Apache Parquet to support bloom filters.
1387CoreEnable Delta cloneClones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.
Delta connectorsGoLang Delta connectorSupport GoLang reading a Delta Lake table natively
Delta connectorsImprove partition filtering in Power BI clientImproved partition filtering using built-in UI filters in Power BI
Delta connectorsPulsar Source connectorSupport Apache Pulsar reading a Delta Lake table natively
FlinkColumn stats generation in Flink SinkMake the Flink Delta sink generate column stats
Presto/TrinoSupport higher protocol versions in Presto and TrinoUse Standalone to support higher protocol versions
RustDelta Rust API UpdatesUpdate APIs and support more high-level operations on top of delta; this includes better conflict resolution
RustBetter support for large logsBetter support for handling large Delta logs/snapshots
Sharing ConnectorsGoLang Delta Sharing clientSupport GoLang client for Delta Sharing
Sharing ConnectorsR Delta Sharing clientSupport R client for Delta Sharing
1072SparkSupport for Identity columnsCreate an identity column that will be automatically assigned a unique and statistically increasing (or decreasing if the step is negative) value.
SparkSupport querying Change Data Feed (CDF) using SQL queriesTo support querying CDF using SQL queries in Apache Spark, we need to allow custom TVFs to be resolved using injected rules.
1156SparkSupport Auto CompactionProvide auto compaction functionality to simplify compaction tasks
1198SparkSupport Optimize WritesOptimize Spark to Delta Lake writes
1349SparkImprove semantics of column mapping and Change Data FeedImprove semantics of how column renames/drops (aka column mapping) interact with CDF and streaming

Priority 2


Nice to have

IssueCategoryTaskDescription
SharingShare individual partitionsSupport Sharing individual partitions in Delta Sharing
Sharing ConnectorsRust Delta Sharing clientSupport Rust client for Delta Sharing
Sharing ConnectorsStarburst/Trino Delta Sharing connectorSupport Starburst/Trino client for Delta Sharing
Sharing ConnectorsAirflow Delta Sharing connectorSupport sharing data from Airflow sensor

History

  • 2022-08-01: Initial creation
  • 2022-08-02: Delta Sharing updates
  • 2022-08-08: Include Identity columns in the roadmap
  • 2022-09-13: Update issues and include into roadmap auto compaction, optimize writes, and bloom filters.
  • 2022-09-19: Update to include Delta Clone
  • 2022-09-22: Including working Delta Rust roadmap document