Salesforce Engineering | Delta Lake Tech Talk Series

March 2, 2021
Denny Lee

Salesforce Engineering | Delta Lake Tech Talk Series

Salesforce Engineering | Delta Lake Tech Talk Series

News



 
We are happy to announce the Salesforce Engineering Delta Lake Tech Talk Series for March and April 2021.


Part 1: Engagement Activity Delta Lake

Watch Now | March 18th, 2021 10am PDT

In part one, we’ll talk about how they built the engagement activity Delta Lake to support Einstein Analytics for creating powerful reports and dashboards and Sales Cloud Einstein for training machine learning models. At Salesforce, their customers are using High Velocity Sales to intelligently convert leads and create new opportunities. To support it, we built the engagement activity platform to automatically capture and store user engagement activities using Delta Lake, which is one of the key components supporting Einstein Analytics for creating powerful reports and dashboards and Sales Cloud Einstein for training machine learning models. We will include:
  • Ingest the data
  • Incremental Read
  • Support exact once write across tables
  • Handle mutation with cascading changes
  • Normalize tables in data lake
For more background in preparation for this session, please refer to Engagement Activity Delta Lake.


Part 2: Boost Delta Lake Performance with Data Skipping and Z-Order

Watch Now | April 1st, 2021 9am PDT

When building a data lake, partitioning strategy is one of the most critical decisions to make. Less optimized data partitioning strategy can generate small files and undermine read and write performance. Besides traditional file based partitioning with partition pruning, Databricks provides another option of Data Skipping and Z-Ordering with I/O pruning and file Compaction. In this talk, we will share the evolving thinking of our partitioning strategy when building Engagement delta lake. Using this real world use case, We will elaborate why and how we leverage Data Skipping and Z-Ordering to Boost Delta Lake Performance.

For more background in preparation for this session, please refer to Boost Delta Lake Performance with Data Skipping and Z-Order.


Part 3: Global Synchronization and Ordering in Delta Lake

Watch Now | April 15th, 2021 9am PDT

One of the great features provided by Delta Lake is ACID Transactions. This feature is critical to maintain data integrity when multiple independent write streams are modifying the same delta table. Running this in the real world, we observe frequent Conflicting Commits errors which fail our pipeline. We realize that, while ACID Transactions maintain data integrity, there is no mechanism to resolve writing conflicts. In this talk, we share a solution to ensure global synchronousness and ordering of multiple process streams that perform concurrent writes to the shared Delta Lake. With this mechanism, we greatly improved our pipeline stability by eliminating Conflicting Commits errors and maintaining data integrity

For more background in preparation for this session, please refer to Global Synchronousness and Ordering in Delta Lake.


Part 4: Continuous Integration and Continuous Delivery with Delta Lake

Watch now | April 29th, 2021 9am PDT

As we build our Engagement Delta Lake on Databricks Workspace, one of the challenges is how to automate the integration testing of our Spark jobs in the CI/CD pipeline. We came up with two designs to tackle the challenge : Namespace Deployment and Scenario Based Testing. In this talk, we will discuss the rationale and implementations of the two designs.


Speakers

Zhidong Ke
Software Engineer PMTS, Salesforce
Zhidong is passionate about designing distributed systems, real-time/batch data processing, and building applications.

Heng Zhang
Software Engineering PMTS, Salesforce
Heng is a software engineer who is interested and specialized in microservices, distributed systems, and big data.

Panelists

Aaron Zhang
Software Engineering PMTS, Salesforce
Aaron is an experienced software engineering leader with interests and areas of focus in engineering secure, fault-tolerant, high volume systems built on microservices.

Yifeng Liu
Software Engineer LMTS, Salesforce
Yifeng is a software engineer who has extensive experience in big data processing and distributed systems, and interested in high volume, high complexity, low latency data pipeline, and framework building.

Craig Ng
Solution Architect, Databricks

Chris Hoshino-Fish
Sr. Solution Architect, Databricks

Denny Lee
Staff Developer Advocate, Databricks


Join the Delta Lake Community

Communicate with fellow Delta Lake users and contributors, ask questions and share tips.

Project Governance

Delta Lake is an independent open-source project and not controlled by any single company. To emphasize this we joined the Delta Lake Project in 2019, which is a sub-project of the Linux Foundation Projects.

 

Within the project, we make decisions based on these rules.

Copyright © 2020 Delta Lake, a Series of LF Projects, LLC. For web site terms of use, trademark policy and other project policies please see https://lfprojects.org.
twitterstack-overflow