PyData Yerevan 2022

Building a Lakehouse data platform using Delta Lake, PySpark, and Trino
08-13, 14:30–15:10 (Asia/Yerevan), 213W PAB

In this talk I would like to present the concept of Lakehouse, which is a novel architecture to resolve problems and combine capabilities of the classical Data Warehouse and Data Lake. I will talk about the Delta Lake table format that resides in the core of Lakehouse. I will demonstrate how Delta Lake integrates with Apache Spark, to build data ingestion pipelines. I will also show how Delta Lake integrates with Apache Trino, to provide a fast SQL-based serving layer. As a result, I will bring all these components together to describe how they enable a modern big data platform. This talk will be useful for an intermediate level audience of data engineers, data analysts, and data scientists.


In this talk I would like to present the concept of Lakehouse, which is a novel architecture to combine capabilities and resolve problems of the classical Data Warehouse and Data Lake. In the core of Lakehouse architecture resides Delta Lake - an open source table format based on Apache Parquet. Delta Lake provides ACID transactions and enables concurrent reading and writing of arbitrary large datasets, on top of a cloud object storage, e.g. AWS S3. The second foundational component of this architecture is Apache Spark, and, in particular, it's PySpark version, that allows to build ingestion pipelines of batch and streaming kind. The third piece is Apache Trino (former PrestoSQL), that adds functionalities of a fast SQL-based serving layer, and by now natively supports Delta Lake tables. Each of three described components - Delta Lake, Spark, and Trino - are mature open source projects with a strong community and a roadmap of new features. Built based on them, the Lakehouse architecture can be a foundation for a big data platform of a company of any size.

This talk is for an intermediate level audience of data engineers, data analysts, and data scientists, though beginners will also be able to pick up many useful ideas. Main takeaway is the high level architecture of the Lakehouse, together with a number of details on building PySpark jobs. Prior knowledge of PySpark and Parquet file format is ideal, but not critical. If you at least know Python that should be good enough. Rough time breakdown: 7 minutes to introduce Lakehouse, 7 minutes to introduce Delta Lake, 7 minutes to show integration of PySpark, 3 minutes to show integration of Trino, 3 minutes to wrap up and summarize, 3 minutes for questions from the audience.


Prior Knowledge Expected

Previous knowledge expected

I am a Software Engineer with 10 years of professional experience in data and backend engineering. During my career I have contributed to the design and development of various data systems, such as data lake, data mesh, lakehouse, batch and streaming data pipelines. My interests: software architecture principles, programming patterns, distributed data processing frameworks, distributed storage systems, file/table formats, queueing systems, databases, resource managers, schedulers, metastores, cloud services. I hold a Specialist degree in Applied Mathematics and Computer Science, and a Master's degree in Computer Science. I speak English, German, and Russian.