PyData Yerevan 2022

Exploratory Data Analysis and Feature Engineering with PySpark
08-13, 16:15–16:55 (Asia/Yerevan), 214W PAB

Learn more about handling Big Data with PySpark and why PySpark might become your to-go framework for Exploratory Data Analysis (EDA) and Feature Engineering (FE)! This talk will show the main differences between Pandas and PySpark frameworks and outline the advantages of performing EDA and FE with PySpark. It will be most beneficial for Data Scientists, Data Analysts, and other attendees that perform data exploration and wrangling. They will learn about handling Big Data with PySpark, its functionalities for EDA and FE, and means of visualising the results. This talk is a good fit for the Beginner to Intermediate level audience with prior experience in Python and SQL.


The objective of the talk is to make a broader audience of data scientists and analysts familiar with the PySpark functionalities that can be used to perform Exploratory Data Analysis and Feature Engineering in a more efficient way, as compared to using Pandas. The talk will start with a general introduction to PySpark (5 min), followed by the main part (15 min) focusing on PySpark characteristics and functionalities. This main part will cover topics like handling big datasets, exploring and analysing data, constructing new features. While doing so, it will elaborate PySpark specific topics like characteristics of the execution model (lazy evaluation and computation DAGs), reading and writing of data, using SparkSQL. Finally, there will be an outline of the differences between Pandas and PySpark as frameworks to perform EDA and FE (5 min). Last few minutes (5 min) will be used for the wrap up and questions from the audience. The key takeaway of the talk will be a clear picture of why and how to use PySpark for EDA and FE, and a sparked curiosity to try it out.


Prior Knowledge Expected

Previous knowledge expected

I am an Applied Scientist, currently working in Zalando (Berlin, Germany). Throughout my career I have contributed to the design and development of several machine learning projects, spanning from product localisation to optimising onsite marketing campaigns. In the past I also conducted independent research in the areas of natural language processing and statistical modelling. I have strong interest in building complex systems using Python, Spark, as well as cloud technologies such as AWS. I am a career changer, having a background in humanities and music.