linux-foundation

How to use Delta Lake generated columns

This blog post describes how to create Delta Lake tables with generated columns, which are a great way to automatically populate columns based on other column values.

You will learn about when generated columns are useful in your data workflows.

All the computations in this post are in this notebook in case you’d like to follow along and run the code on your local machine.

Create Delta Lake table with generated columns

In this section, we’ll create a Delta Lake table with id, first_name, last_name, age, and full_name columns.

The id, first_name, last_name and age columns will be supplied by the user when they’re appending data to the table.

The full_name column will be generated by Delta Lake when data is appended to the table. The full_name column will simply concatenate the first_name and last_name columns and separate them with a space.

Here’s how we can create the empty Delta table with Python:

Copy
from delta import DeltaTable

(
    DeltaTable.create(spark)
    .tableName("default.some_people")
    .addColumn("id", "LONG")
    .addColumn("first_name", "STRING")
    .addColumn("last_name", "STRING")
    .addColumn("age", "LONG")
    .addColumn(
        "full_name", "STRING", generatedAlwaysAs="concat(first_name, ' ', last_name)"
    )
    .execute()
)

Let’s show the Delta table and make sure it’s empty:

Copy
spark.sql("select * from some_people").show()

+---+----------+---------+---+---------+
| id|first_name|last_name|age|full_name|
+---+----------+---------+---+---------+
+---+----------+---------+---+---------+

Great, now let’s see how to append data to this table and take advantage of the generated column feature.

Insert into Delta Lake table with generated columns

Let’s append data to the Delta Lake table. We’ll append a DataFrame that has id, first_name, last_name, and age columns. This will let us observe how Delta Lake automatically generates the full_name column.

Copy
df = spark.createDataFrame(
    [(0, "Bob", "Loblaw", 23),(1, "Sue", "Grafton", None), (2, "Jim", "Carrey", 61)]
).toDF("id", "first_name", "last_name", "age")

df.write.mode("append").format("delta").saveAsTable("some_people")

Now query the table and verify that the full_name column was populated by Delta Lake:

Copy
DeltaTable.forName(spark, "some_people").toDF().show()

+---+----------+---------+----+-----------+
| id|first_name|last_name| age|  full_name|
+---+----------+---------+----+-----------+
|  2|       Jim|   Carrey|  61| Jim Carrey|
|  0|       Bob|   Loblaw|  23| Bob Loblaw|
|  1|       Sue|  Grafton|null|Sue Grafton|
+---+----------+---------+----+-----------+

Let’s look at how Delta Lake stores the full_name column in Delta tables and the associated performance/storage implications.

How Delta Lake stores data for generated columns

Delta Lake persists the generated column data in storage. The column isn’t computed on the fly when the data is read. The data is computed and persisted when DataFrames are appended to storage.

Let’s refresh our memory on the high-level structure of a Delta table:

Delta tables consist of data stored in Parquet files and metadata stored in the transaction log.

The generated column data is persisted in the underlying Parquet files, just like the data for the other columns.

Persisting data in storage adds to storage costs. There is an extra column that’s persisted in your Delta table which you need to pay for.

Persisting data generally saves on compute costs because you don’t need to compute the full_name value anymore when reading data. You can just read the value directly.

Compute is generally more expensive than storage, so strategically generating columns and saving on compute should generally save you money overall.

Let’s now turn our attention to how generated columns interact with Delta table schema evolution.

How Delta Lake generated columns work with schema evolution

When Delta Lake schema evolution is enabled, you can append DataFrames to Delta tables that have missing or extra columns, see this blog post for more details.

Once column generation is enabled, certain columns become required and schema evolution doesn’t behave as usual. If you enable schema evolution on a Delta table without any generated columns, then the first_name and last_name columns would be optional. In this case, the full_name column depends on the first_name and last_name columns, so they’re required, even if schema evolution is enabled.

Let’s see what happens when you try to append a DataFrame without the last_name column to the Delta table.

Copy
df = spark.createDataFrame(
    [
        (8, "Liam", 66),
        (9, "Colin", 77),
    ]
).toDF("id", "first_name", "age")

df.write.option("mergeSchema", "true").mode("append").format("delta").saveAsTable(
    "some_people"
)

Here’s the error message you’ll get:

Copy
AnalysisException: Column 'last_name' does not exist. Did you mean one of the following? [first_name, age, id]; line 1 pos 24;
'Project [id#2335L AS id#2391L, first_name#2336 AS first_name#2392, age#2337L AS age#2393L, 'concat(first_name#2336,  , 'last_name) AS full_name#2394]
+- Project [_1#2329L AS id#2335L, _2#2330 AS first_name#2336, _3#2331L AS age#2337L]
   +- LogicalRDD [_1#2329L, _2#2330, _3#2331L], false

Now let’s see what happens if you try to append a DataFrame with values that should be generated.

What happens when users supply columns that should be generated?

Delta Lake will reject the transaction if you try to append values that the Delta table should be generating for you. Let’s try to append a DataFrame that contains the full_name column to the Delta table.

Start by creating the DataFrame:

Copy
df = spark.createDataFrame([
    (21, "Curtis", "Jackson", 47, "50 cent"),
    (22, "Eric", "Wright", None, "easy-e"),
]).toDF("id", "first_name", "last_name", "age", "full_name")

Now try to append it to the Delta table:

Copy
df.write.mode("append").format("delta").saveAsTable("some_people")

Delta Lake rejects this transaction and provides the following error message:

Copy
Py4JJavaError: An error occurred while calling o234.saveAsTable.
: org.apache.spark.sql.delta.schema.DeltaInvariantViolationException: CHECK constraint Generated Column (full_name <=> concat(first_name, ' ', last_name)) violated by row with values:
 - first_name : Eric
 - full_name : easy-e
 - last_name : Wright

Delta Lake rightfully rejects this transaction.

What happens when generated columns depend on columns will null values?

Let’s see what happens when you try to append a DataFrame with first_name or last_name columns that are null to the Delta table.

Copy
df = spark.createDataFrame(
    [
        (44, None, "Perkins", 20),
        (55, "Li", None, 30),
    ]
).toDF("id", "first_name", "last_name", "age")

df.write.mode("append").format("delta").saveAsTable(
    "some_people"
)

View the contents of the DataFrame:

Copy
spark.sql("select * from some_people").show()

+---+----------+---------+----+-----------+
| id|first_name|last_name| age|  full_name|
+---+----------+---------+----+-----------+
|  2|       Jim|   Carrey|  61| Jim Carrey|
|  0|       Bob|   Loblaw|  23| Bob Loblaw|
|  1|       Sue|  Grafton|null|Sue Grafton|
| 44|      null|  Perkins|  20|       null|
| 55|        Li|     null|  30|       null|
+---+----------+---------+----+-----------+

In this case, the full_name is null whenever first_name or last_name is null. This behavior is consistent with the normal behavior of the PySpark concat function.

Delta Lake generated columns: Conclusion

Delta Lake allows you to create Delta tables with generated columns that are automatically computed based on other column values and are persisted in storage.

Generated columns are a great way to automatically and consistently populate columns in your Delta table. You don’t need to manually append columns to your DataFrames before appending if generated columns are set up.

Generated columns can only be added to Delta tables that are newly created or replaced. You can’t add a generated column to an existing Delta table.

Persisting generated columns in storage can be a great way to save on compute and make your queries run faster. You don’t need to compute values when reading data if the computations have already been run beforehand and are persisted in storage.

Take a look at the documentation on generated columns to learn more about the types of expressions that are supported by generated columns.

Follow our authors onLinkedIn