Delta Lake Videos

Watch the latest videos and webinars for the open-source Delta Lake project.

Delta Lake Roadmap 2021 H2

We're starting to get feedback from the Delta Lake community on more integrations per the proposed 2021 H2 roadmap. Through this talk, Vini and Denny will provide a recap of the features including Spark 3.1 support, Delta Sharing released in 2021 H1, and what the community asks for the future roadmap for Delta Lake OSS. There are callouts for OPTIMIZE, Apache Heron, and Trino CTAS support as well as the current integration efforts around Apache Flink, PrestoDB, Apache Pulsar, LakeFS, and Nessie. Don't forget the standalone readers and writers and rust lang API optimizations.

Data Reliability for Data Lakes

Building a modern data lake requires dealing with a lot of complexity: querying historical data + streaming data simultaneously (lambda architecture), validation to ensure data isn't too messy for data science and machine learning, reprocessing to handle failures, and ensuring ACID-compliant data updates. We created the Delta Lake project, open sourced under the Linux Foundation, to relieve data scientists and data engineers from these complex systems problems and instead enable them to focus on extracting value from data. In this talk, we'll dive into these challenges and how ACID transactions solve them. We'll discuss patterns that emerge when you can focus on data quality and the nitty gritty internals of ACID on Spark which enable this focus.

SmartSQL Queries powered by Delta Engine on Lakehouse

As a data analyst have you ever wanted to be able to simply add some Machine Learning capabilities into your SQL Query? Does your database engine require additional laborious steps in order to leverage Python or R functionality for your data scientist to work on? Do you feel that your teams are siloed from each other preventing you from getting the most out of your data? Join us while we work together to build machine learning algorithms into simple functions that our data analysts can use to build smarts into their analytics.

Generating Surrogate Keys for your Data Lakehouse with Spark SQL and Delta Lake

For this tech chat, we will discuss a popular data warehousing fundamental - surrogate keys. As we had discussed in various other Delta Lake tech talks, the reliability brought to data lakes by Delta Lake has brought a resurgence of many of the data warehousing fundamentals such as Change Data Capture in data lakes. Surrogate keys are unique and lack any business context so they can stand the test of time when joining domain (or dimensional) and fact data. This can be difficult in single-node systems and can be even more complex for distributed systems. In this session, we will discuss the history and value of surrogate keys and what are the requirements for good strategies to implement this data warehousing fundamental into your Delta Lake.

Simplifying Disaster Recovery with Delta Lake

There's a need to develop a recovery process for Delta table in a DR scenario. Cloud multi-region sync is Asynchronous. This type of replication does not guarantee the chronological order of files at the target (DR) region. In some cases, we can expect large files to arrive later than small files. With Delta Lake, this can create an incomplete version at the DR site at the breakup point. The assumption is that the Primary (Prod) site is not reachable and therefore there's a need to identify and fix the incomplete version of the Delta Lake table. Similar scenarios happen with RDBMS replication, they rely on their logs to restore the database to a stable version and run the recovery or reload process. This document will address this need and look for a solution that can be shared with customers.

Real-Time Forecasting at Scale using Delta Lake and Delta Caching

GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. Inventory impression is the real estate to show potential ads on a publisher page. By generating near-real-time inventory forecast based on campaign-specific targeting rules, GumGum enables the account managers to set up successful future campaigns. This talk will highlight the data pipelines and architecture that help the company achieve a forecast response time of less than 30 seconds for this scale.

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc. This talk will share the research that we did for the comparison about the key features and design these table format holds, the maturity of features, such as APIs expose to end user, how to work with compute engines and finally a comprehensive benchmark about transaction, upsert and mass partitions will be shared as references to audiences.

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code. This talk will share the practice for simplify CDC pipeline with SparkStreaming SQL and Delta Lake.

Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with Delta Lake

Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts. It also includes analyzing product reviews to increase customer satisfaction. In this presentation, we'll walk through how we achieved a 70% reduction in pipeline creation time and reduced ETL workload times from four hours with previous data warehouses to minutes using Azure Databricks, hence enabling near real-time analytics. We migrated from multiple legacy data warehouses, run by individual lines of business, to a single scalable, reliable, performant data lake on top of Azure and Delta Lake.

Building Data Quality Audit Framework using Delta Lake at Cerner

Cerner needs to know what assets it owns, where they are located, and the status of those assets. A configuration management system is an inventory of IT assets and IT things like servers, network devices, storage arrays, and software licenses. There was a need to bring all the data sources into one place so that Cerner has a single source of truth for configuration. This gave birth to a data platform called Beacon. Bad data quality has a significant business costs in time, effort and accuracy. Poor-quality data is often pegged as the source of operational snafus, inaccurate analytics and ill-conceived business strategies. In our case since configuration data is largely used in making decisions about security, incident management, cost analysis etc it caused downstream impact due to gaps in data. To handle data quality issues, Databricks and Delta Lake was introduced at the helm of the data pipeline architecture. In this talk we'll describe the journey behind building an end to end pipeline conformed to CI/CD standards of the industry from data ingestion, processing, reporting to machine learning and how Delta Lake plays a vital role in not only catching data issues but make it scalable and re-usable for other teams. We'll talk about the challenges faced in between and lessons learned from it.

Powering Interactive BI Analytics with Presto and Delta Lake

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines. In particular, Starburst, developed a native integration for Presto that leverages Delta-specific performance optimizations. In this talk we show how a combination of Presto, Spark Streaming, and Delta Lake into one architecture supports highly concurrent and interactive BI analytics. Furthermore Presto enables query-time correlations between S3-based IoT data, customer data in a legacy Oracle database, and web log data in Elasticsearch.

How Starbucks is Achieving Enterprise Data and ML at Scale

Starbucks makes sure that everything we do is through the lens of humanity - from our commitment to the highest quality coffee in the world, to the way we engage with our customers and communities to do business responsibly. A key aspect to ensuring those world-class customer experiences is data. This talk highlights the Enterprise Data Analytics mission at Starbucks that helps making decisions powered by data at tremendous scale. This includes everything ranging from processing data at petabyte scale with governed processes, deploying platforms at the speed-of-business and enabling ML across the enterprise. In this session, Vish Subramanian will detail how Starbucks has built world-class Enterprise data platforms to drive world-class customer experiences.

Realizing the Vision of the Data Lakehouse

This keynote by Databricks CEO, Ali Ghodsi, explains why the open source Delta Lake project takes the industry closer to realizing the full potential of the data lakehouse, including new capabilities within the Databricks Unified Data Analytics platform to significantly accelerate performance. In addition, Ali will announce new open source capabilities to collaboratively run SQL queries against your data lake, build live dashboards, and alert on important changes to make it easier for all data teams to analyze and understand their data.

Building a Better Delta Lake with Talend and Databricks

With the introduction of Delta Lake last year, a well-tested pattern of building out the bronze, silver, and gold data architecture approach has proven useful. This session will review how to use Talend Data Fabric to accelerate the development of a Delta Lake using highly productive, scalable, and enterprise ready data flow tools. Covered in this section are demonstrations of ingesting 'Bronze' data, refining 'Silver' data tables, and performing Feature Engineering for 'Gold' tables.

Slowly Changing Dimensions (SCD) Type 2

We will discuss a popular online analytics processing (OLAP) fundamental - slowly changing dimensions (SCD) - specifically Type-2. As we have discussed in various other Delta Lake tech talks, the reliability brought to data lakes by Delta Lake has brought a resurgence of many of the data warehousing fundamentals such as Change Data Capture in data lakes. Type 2 SCD within data warehousing allows you to keep track of both the history and current data over time. We will discuss how to apply these concepts to your data lake within the context of the market segmentation of a climbing eCommerce site.

Data and AI Talk with Databricks Co-Founder, Matei Zaharia

We are happy to have Matei Zaharia join this month's Data and AI Talk Matei Zaharia is an assistant professor at Stanford CS, where he works on computer systems and machine learning as part of Stanford DAWN. He is also co-founder and Chief Technologist of Databricks, the data and AI platform startup. During his Ph.D., Matei started the Apache Spark project, which is now one of the most widely used frameworks for distributed data processing. He also co-started other widely used data and AI software such as MLflow, Apache Mesos and Spark Streaming. Join this great session where we will discuss these technologies as well as answer selected questions posted in the comment section.

Addressing GDPR and CCPA Scenarios with Delta Lake and Apache Spark™

Your organization may manage hundreds of terabytes worth of personal information in your cloud. Bringing these datasets into GDPR and CCPA compliance is of paramount importance, but this can be a big challenge, especially for larger datasets stored in data lakes. Learn how you can use Delta Lake which is created by Databricks and powered by Apache Spark™ to manage GDPR and CCPA compliance for your data lake. Because Delta Lake adds a transactional layer that provides structured data management on top of your data lake, it can dramatically simplify and accelerate your ability to locate and remove personal information (also known as "personal data") in response to consumer GDPR or CCPA requests without disrupting your data pipelines.

Predictive Maintenance (PdM) on IoT Data for Early Fault Detection w/ Delta Lake

Predictive Maintenance (PdM) is different from other routine or time-based maintenance approaches as it combines various sensor readings and sophisticated analytics on thousands of logged events in near real time and promises several fold improvements in cost savings because tasks are performed only when warranted. The top industries leading the IoT revolution include manufacturing, transportation, utilities, healthcare, consumer electronics & cars. The global market size for this is expected to grow at a CAGR of 28%. PdM plays a key role in Industry 4.0 to help corporations not only reduce unplanned downtimes, but also improve productivity and safety. The collaborative Data and Analytics platform from Databricks is a great technology fit to facilitate these use cases by providing a single unified platform to ingest the sensor data, perform the necessary transformations and exploration, run ML and generate valuable insights.

Diving into Delta Lake Part 2: Enforcing and Evolving the Schema

As business problems and requirements evolve over time, so too does the structure of your data. With Delta Lake, as the data changes, incorporating new dimensions is easy. Users have access to simple semantics to control the schema of their tables. These tools include schema enforcement, which prevents users from accidentally polluting their tables with mistakes or garbage data, as well as schema evolution, which enables them to automatically add new columns of rich data when those columns belong. In this webinar, we'll dive into the use of these tools.

Simplify and Scale Data Engineering Pipelines with Delta Lake

A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion ("Bronze" tables), transformation/feature engineering ("Silver" tables), and machine learning training or prediction ("Gold" tables). Combined, we refer to these tables as a "multi-hop" architecture. It allows data engineers to build a pipeline that begins with raw data as a "single source of truth" from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake.

Threat Detection and Response at Scale (Dominque Brezinski & Michael Armbrust)

We approached Databricks with a set of challenges to collaborate on: provide a stable and optimized platform for Unified Analytics that allows our team to focus on value delivery using streaming, SQL, graph, and ML; leverage decoupled storage and compute while delivering high performance over a broad set of workloads; use S3 notifications instead of list operations; remove Hive Metastore from the write path; and approach indexed response times for our more common search cases, without hard-to-scale index maintenance, over our entire retention window. This is about the fruit of that collaboration.