PyData Yerevan 2022

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:30
08:30
60min
Registration
Manookian Hall
08:30
60min
Registration
113W PAB
08:30
60min
Registration
114W PAB
08:30
60min
Registration
213W PAB
08:30
60min
Registration
214W PAB
09:30
09:30
30min
Opening Notes
Manookian Hall
10:00
10:00
60min
Keynote Speech By Fritz Obermeyer
Manookian Hall
11:00
11:00
15min
Coffee Break
113W PAB
11:00
15min
Coffee Break
114W PAB
11:00
15min
Coffee Break
213W PAB
11:00
15min
Coffee Break
214W PAB
11:15
11:15
40min
Accelerated Data Science Libraries
Ashot Vardanian

Everyone knows and uses Pandas, NumPy and NetworkX, but is there something better? Something equally easy to use, but hopefully with more features, or, more importantly, higher performance!

114W PAB
11:15
40min
Grover’s quantum search for data science and why should we care
Tigran Sedrakyan

Among the most prominent achievements of the quantum computing field is an algorithm known as Grover’s quantum search. This talk focuses on Grover’s algorithm and its applications to machine learning routines. Prior knowledge required is a basic understanding of linear algebra and computer science, and familiarity with the concepts of machine learning.

213W PAB
11:15
40min
NVIDIA NeMo: toolkit for conversational AI
Alex Laptev

This talk introduces NeMo: NVIDIA's open-source toolkit for conversational AI that provides a wide collection of models for automatic speech recognition, text-to-speech, natural language processing and neural machine translation.

113W PAB
11:15
90min
Sequential Attention-Based Neural Machine Translation
Hadi Abdi Khojasteh

Sequential models in natural language understanding are widely useful for everything from machine translation to speech recognition. In machine translation, encoder-decoder architecture, especially the Transformer, is one of the most prominent branches. The attention idea has been one of the most influential ideas in deep learning. By adopting this idea, we can make a model that takes the long sequence of data (for example, the words in the long sentence for translation) and divides them into small parts while looking at the others simultaneously to generate the output at the end.

214W PAB
12:00
12:00
40min
Best practices for coding in ML/DS - Techniques to construct your project
Mher Khachatryan

Many engineers, particularly those in Data Science, do not focus on writing better code, which
their coworkers will love.

This is bad!

Writing cleaner code, and using appropriate tools for experiment logging reduces the time of
debugging and the effort spent on the project in the long term. Consequently, the code becomes
readable and onboarding new engineers on the project becomes easier.

The target audience is beginner ML/DS practitioners who struggle to write cleaner code, data
scientists who consider adapting better techniques and tools in the project, and students who
make their first steps into the world of ML.

The only necessary knowledge is the ability to read and understand Python code, as all my
examples contain the latter. No prior technical knowledge is necessary as the talk is fully
introductory on a high level.

By the end of the lecture, attendees will have learned about the importance of having a clean
code in the ML project. They will have developed intuition about wiring readable and
understandable code and will have acquired knowledge about the general design of a good
codebase, and some tools that will help engineers log experiments for a cleaner environment.

213W PAB
12:00
40min
Building Data Pipelines on AWS
Rudolf Eremyan

Building Data Pipelines on AWS and hidden costs that can destroy budget

114W PAB
12:00
40min
Large scale field delineation
Hrach Asatryan

The talk is about methods of doing large scale field delineation from aerial imagery. Given the increasing importance of global food supplies, AI in agriculture has become integral for later development in the field. Given that modern agriculture is field level, a delineation of fields is required to be able to use such methods. The talk has no pre-requisites and will be a research-oriented informative talk.

113W PAB
12:45
12:45
60min
Coffee Break
113W PAB
12:45
60min
Coffee Break
114W PAB
12:45
60min
Coffee Break
213W PAB
12:45
60min
Coffee Break
214W PAB
13:45
13:45
40min
A/B testing in production
Sona Hambaryan

Being part of statistical learning apparatus, and having a strong mathematical background AB testing remains one of the aspects in the field that continue to be violated and misinterpreted. A big part of violations covers the wrong experiment setup, which I'll try to cover in practice taking into consideration the business setup: whether it's a B2B platform or B2C. It would be nice if the audience had a hands-on experience with A/B testing, if not - I'm still going to cover it on a high level. The main takeaway for the audience will be to understand the pitfalls that relay under experiment setup, where a single disregarded use case can violate the whole experiment outcome. Time breakdown: 10mins A/B testing basics, 15mins Pitfalls, 5min Q&A

113W PAB
13:45
90min
Accelerated Machine Learning Systems End-to-End: From Data Stream to App
Dmitry Mironov

This hands-on tutorial will teach you how to accelerate every component of a machine learning system and improve your team’s productivity at every stage of the ML workflow. You’ll learn how to get started with RAPIDS and NVIDIA Forest Inference Library, and how to go beyond the basics to get the most out of your accelerated infrastructure. We’ll do all of this in the context of a real-world application that models financial payments fraud and detects it in real-time. We’ll show you how: RAPIDS enables you to find better insights into your data more quickly, through accelerated visualization techniques RAPIDS Machine Learning models can outperform rules-based approaches to detecting payments fraud NVIDIA Forest Inference Library enables you to accelerate inference of tree models, scoring incoming transactions with high throughput and low latency Data scientists will experience the high-velocity exploratory workflows enabled by NVIDIA RAPIDS and learn how to best take advantage of GPUs when porting CPU-based pandas and scikit-learn code to run on RAPIDS. Application developers and IT ops professionals will learn more about data science workflows, see how real-world ML systems work, and learn about the myriad benefits of GPU acceleration for these systems and the teams who build them. The tutorial can be delivered both remotely and onsite. Attendees would need a laptop and a stable internet connection. Attendees to the tutorial will be provided with a url to access the lab environment, so that they can access and run the tutorial with no prior set-up required. Familiarity with standard Python code is desirable.

214W PAB
13:45
40min
Cifar-10 Exploratory Data Analysis
Anna Shahinyan

Image classification datasets are completed from the analysis point of view taking into account complicated structure of images. However the understanding of the dataset descriptors in the high level can add debugging facilities and in early stage predict the quality of the classification model. During this session we will visually analyze one of the challenging SOTA dataset like Cifar-10.

114W PAB
13:45
40min
Semantic Multimodal Multilingual Similarity engine
Vladimir Orshulevich

Since multimodality became popular, lots of engineer are trying to make domain-universal search. The search engines that will find in image by textual query, the HTML file by piece of audio and so on. So here is our (Unum) approach with a bias towards GPU accelerating inference (for underlying models) and passion to distribute everything.

213W PAB
14:30
14:30
40min
Building a Streaming (E-health) Data Pipeline: When and How?
Zohreh Jafari

In this talk, we discuss streaming and the real-time data stack as a solution for analyzing massive, unbounded data sets that are increasingly common in many modern businesses in different fields and their need for more timely and accurate answers.

A streaming data pipeline flows data continuously from source to destination as it is generated, making it being processed along the way so they are used when the analytics, application or business process requires an updating data flow for an on-time analysis. This analysis can be descriptive like a data dashboard, diagnostic like monitoring logs, predictive like an online fraud detection system or prescriptive like data process in self-driving cars.

This talk is perfect for developers or architects and will be accessible to technical managers and business decision makers—no prior experience with streaming data systems required. The only technical requirement this talk makes is that the audience should feel comfortable with major concepts in data pipelines.

During the talk, examples of e-health data pipelines plus our experience in setting up a data streaming stack help to explicitly of the subject.

113W PAB
14:30
40min
Explainable AI as a Conventional Data Analysis Tool
Maria Sahakyan

The recent surge of interest in Machine Learning (ML) and Artificial Intelligence (AI) has spurred a wide array of models designed to make decisions in a variety of domains, including healthcare [1, 2, 3], financial systems [4, 5, 6, 7], and criminal justice [8, 9, 10], just to name a few. When evaluating alternative models, it may seem natural to prefer those that are more accurate. However, the obsession with accuracy has led to unintended consequences, as developers often strove to achieve greater accuracy at the expense of interpretability by making their models increasingly complicated and harder to understand [11]. This lack of interpretability becomes a serious concern when the model is entrusted with the power to make critical decisions that affect people’s well-being. These concerns have been manifested by the European Union’s recent General Data Protection Regulation, which guarantees a right to explanation, i.e., a right to understand the rationale behind an algorithmic decision that affects individuals negatively [12]. To address these issues, a number of techniques have been proposed to make the decision-making process of AI more understandable to humans. These “Explainable AI ” techniques (commonly abbreviated as XAI) are the primary focus of this talk. The talk will be divided into three sections, during which the audience will learn (i) the differences between existing XAI techniques, (ii) the practical implementation of some well-known XAI techniques, and (iii) possible uses of XAI as a conventional data analysis tool.

213W PAB
14:30
40min
Near-Duplicate Ad Detection in Online Classified Ad Services
Diyar Mohammadi

Near-Duplicate Ads in online listings usually affect both demand-side and supply-side users’ experiences.

Using Machine-Learning and Deep-Learning we tried to detect duplicated ads.

Ideally, people attending the talk are familiar with the basics of machine learning and deep learning.

114W PAB
15:15
15:15
15min
Coffee Break
113W PAB
15:15
15min
Coffee Break
114W PAB
15:15
15min
Coffee Break
213W PAB
15:15
15min
Coffee Break
214W PAB
15:30
15:30
40min
Building your own Multiskill AI Assistant with DeepPavlov
Daniel Kornev

-

113W PAB
15:30
40min
Tiling & Parallel Processing of Large Images
Anush Tosunyan

During this session we will review benefits of processing large imageries by tiles, review use cases and later combination of results. No previous knowledge is required for the talk.

213W PAB
15:30
40min
Use AutoML to create high-quality models
Aleksandr Patrushev

AutoML automates each step of the ML workflow so that it’s easier to use machine learning.

114W PAB
15:30
45min
Visualizing and Analyzing Protein Structures with UCSF Chimera
Christopher Markosian

Proteins are molecules that perform various functions in living cells. All proteins are composed
of the same possible twenty amino acids, but in different combinations. Based on these
combinations, proteins have different three-dimensional structures, which ultimately form the
basis of their unique functions. A protein’s structure can be determined in the laboratory or
predicted with high accuracy using artificial intelligence programs. Understanding the structures
of proteins and the ways they interact in three-dimensional space is critical for human health.
For example, a single amino acid change in a protein can lead to disease such as cancer.
In this Tutorial, participants will learn the basics of the UCSF Chimera software to visualize and
analyze protein structures. This Tutorial will be particularly useful for individuals interested in
bioinformatics and/or structural biology. No background knowledge is necessary, though it may
help to be familiar with the classifications of amino acids (charged, polar, hydrophobic) and
levels of protein structure (primary, secondary, tertiary, quaternary). This Tutorial is code-free,
but the principles learned will be useful for similar Python-based software. Materials will be
distributed via file storage links. By end of this Tutorial, participants will learn the basic principles
of protein structure, how to retrieve atomic coordinates of an individual protein or protein
complex, and how to examine and analyze protein structures with various features of UCSF
Chimera.

214W PAB
16:15
16:15
40min
Moving Inference to Triton Servers
Marine Palyan

The talk will introduce the audience to Triton Inference Server, the requirements for migrating from regular AWS instances, advantages and benchmarks from our production. This talk is mostly targeted towards Machine Learning Engineers and ML/Ops Engineers, although no previous knowledge is required to attend and understand the topic.

113W PAB
16:15
40min
Streamlit: A faster way to build and share data apps
Karen Javadyan

Poor tooling slows down data science and machine learning projects.

Streamlit is a fast way to build and share data apps. It is able to turn data scripts into shareable web apps with minimal effort. Let's hear Karen Javadyanl introduce Streamlit: the fastest way to build and share data apps as Python scripts.

213W PAB
16:15
40min
Using Python for the PCR design or an easy way to data analysis in life science projects
Ricardas Ralys

Using Python for the PCR design or an easy way to data analysis in life science projects.

114W PAB
16:20
16:20
15min
Decoding Human Physiological Behaviors from Intracranial Field Potentials
Manana Hakobyan

The attempt to decode the human brain using computers is not novel, however, doing it dynamically in an uncontrolled environment with many external confounding factors has been deemed to be very challenging computationally, and hence, yet to be explored in depth. This study aims to predict the human physiological behaviors using machine learning and invasively recorded intracranial field potentials received through electroctrocortigography (ECoG) procedure from the brain surface in an uncontrolled real life setting. After a rigorous feature engineering process I showcase that the well-defined behaviors such as sleeping, eating and video gaming can be decoded with greater than 0.95 AUCs, and the noisier behaviors such as movements, spoken and heard speech are decoded with AUCs higher than 0.80. To ensure that the classification results are reliable I run a series of experiments with different controls and find that despite the drop in AUCs the behaviors are still robustly classified better than the random for all of the tests. I also dive deeper into exploring the brain regions which contributed to the high performance of the classification. Not only does this research show that it is possible to classify twelve natural continuous human behaviors with high performance, it also confirms many of the prior literature findings which state that certain brain region activities correspond to specific human physiological actions.

214W PAB
17:00
17:00
15min
EENLP: Cross-lingual Eastern European NLP Index
Andrey Manoshin

In our project we present a wide index of existing Eastern European language datasets (90+) and models (60+). Furthermore, to support the evaluation of commonsense reasoning tasks, we compile and publish cross-lingual datasets for five such tasks and provide evaluation results for several existing multilingual models.

113W PAB
17:00
15min
What can your Telegram tell about you? (Answer: Everything)
Hayk Aprikyan

How much has your vocabulary changed over the last year? Who shares the funniest memes with you? And does she find you interesting to chat with? ̶N̶o̶p̶e̶.̶

If you're a Telegram guy, Neplo is that painstakingly data-driven guy who's got answers to these (and hundreds of other) questions, based on your Telegram chat histories.

Still skeptical? Come and see. (John 1:39)

114W PAB
09:00
09:00
60min
Registration
Manookian Hall
09:00
60min
Registration
113W PAB
09:00
60min
Registration
114W PAB
09:00
60min
Registration
213W PAB
09:00
60min
Registration
214W PAB
10:00
10:00
60min
Keynote Speech by Wolfgang Weidinger
Manookian Hall
11:00
11:00
15min
Coffee Break
113W PAB
11:00
15min
Coffee Break
114W PAB
11:00
15min
Coffee Break
213W PAB
11:00
15min
Coffee Break
214W PAB
11:15
11:15
40min
Eating humble Py: From toy problem to real-world solution in predicting Customer Lifetime Value
Katherine Munro

I should have known I was up against it when even my Kaggle solution sucked. I’d been tasked with launching our company’s research efforts into Customer Lifetime Value prediction, so naturally, I turned to that grail of tutorials and toy datasets, and started exploring. Very quickly I learned two things: the go-to CLV dataset was not worth going to, and I really needed some retail domain experts.

213W PAB
11:15
40min
Large Scale Representation Learning In-the-wild
Hadi Abdi Khojasteh

A significant amount of progress is being made today in the field of representation learning. It has been demonstrated that unsupervised techniques can perform as well as, if not better than, fully supervised ones on benchmarks such as image classification, while also demonstrating improvements in label efficiency by multiple orders of magnitude. In this sense, representation learning is now addressing some of the major challenges in deep learning today. It is imperative, however, to understand systematically the nature of the learnt representations and how they relate to the learning objectives.

113W PAB
11:15
40min
The dangers of mindless forecasting
Aghasi Tavadyan

"Prediction is very difficult, especially if it’s about the future!" This phrase is attributed to Niels Bohr, the Nobel laureate in Physics and father of the atomic model. This quote warns about the unreliability of forecasts without proper testing and about constant changes in the initial assumed conditions.

With modern programming languages and convenient packages that provide ready-made modeling solutions, it is often easy to find a model that fits the past data well; perhaps too well! But does the maximization of metrics justify the means? Should the complex structures of predictions be built on the quicksand of noisy data?

This talk is a laid-back discussion that will be useful for the audience from any background, from beginner to advanced. Aghasi Tavadyan is the founder of Tvyal.com, which translates to "data" from Armenian. You can find more info about him following these websites: tavadyan.com, tvyal.com.

114W PAB
11:15
40min
The structure and interpretation of ML metadata
Gevorg Soghomonyan

ML metadata connects different parts of the ML infrastructure together into a complete system. This talk is about - How and where the Metadata is generated. - How it's used in modern ML Infra stacks - How to use the metadata in building that enables reproducibility, explainability and governance over the models.

214W PAB
12:00
12:00
40min
Active Learning for 3D mesh semantic segmentation
Erik Harutyunyan

The talk is about applications of active learning methods, mainly Monte-Carlo Dropout on 3D mesh/pointcloud semantic segmentation task. The topic is particularly interesting for practical applications of Deep Learning models on this type of data, as it gives a working approach for reducing the amount of data needed for training.

114W PAB
12:00
40min
Empirical determinacy of posterior location and scale in Bayesian hierarchical models
Sona Hunanyan

The parameters in a statistical model are not always identified by the data. In Bayesian analysis, this
problem remains unnoticed because of prior assumptions. It is crucial to find out whether the data
determine the marginal posterior parameters. As the famous mathematician George Box stated,
“Since all models are wrong the scientist must be alert to what is importantly wrong.”

The R package ed4bhm, which is available on GitHub allows to examine the empirical determinacy of
posterior parameters for the models fitted with well-known Bayesian techniques.

213W PAB
12:00
40min
Target based sentiment analysis with T5
Liana Minasyan

The classic sentiment analysis analyzes texts, images, emojis, etc to know what other people think of a product, service, company, or event. While sentiment analysis can be considered one of the accomplished tasks of Natural Language Processing tasks, more fine-grained types of it like Target Based Sentiment Analysis(TSA) or Aspect-based sentiment analysis(ABSA) are the not quite the same. In TSA we want to see the sentiment of a given text towards a particular entity(in my case person or organization). This task is one of the non-solved ones. With the T5 question answering transformer model it was possible to solve the task with results 20% higher than the current leaderboards.

214W PAB
12:00
40min
The Explainability Problem: Towards Understanding Artificial Intelligence
Nura Kawa

Explainable Artificial Intelligence (XAI) is crucial for the development of responsible, trustworthy AI. Machine learning models such as deep neural networks can perform highly complex computational tasks at scale, but they do not reveal their decision-making process. This becomes problematic when such models are used to make high-stakes decisions, such as medical diagnoses, which require clear explanations in order to be trusted.

This talk discusses Explainable AI using examples of interest for both machine learning practitioners and non-technical audiences. This talk is not very technical; it does not focus on how to apply an existing method to their model. Rather, the talk discusses the problem of Explainability_ as whole, namely: what is the Explainability Problem and why it must be solved, how recent academic literature addresses the problem, and how the problem will evolve with new legislation.

To get the most from this talk, the audience should have some familiarity with standard machine learning algorithms. However, no technical background is needed to grasp the key takeaways: the necessity of explainability in machine learning, the challenges of developing explainability methods, and the impact that XAI has on businesses, practitioners and end-users.

113W PAB
12:45
12:45
60min
Coffee Break
113W PAB
12:45
60min
Coffee Break
114W PAB
12:45
60min
Coffee Break
213W PAB
12:45
60min
Coffee Break
214W PAB
13:45
13:45
40min
Bachelor theses in Deep Learning: submitted to an Armenian University
Sergey Hayrapetyan

Bachelor theses written in the area of Deep Learning based object detection will be presented. Main focus is on the detection of vehicles captured from top, e.g. parking lots, satellites: I will present the challenges we encountered and solved in the scope of Bachelor thesis. The goal of this talk is not only to present the results of young undergraduate students, but also to encourage new ones to get involved into the sphere.

113W PAB
13:45
90min
How to use Pandas efficiently
Hossein Mortazavi

This 90-minute tutorial will demonstrate how to use the Pandas package effectively, which means that the audience will understand the package better after viewing the tutorial.

214W PAB
13:45
40min
ML Platform for Insurance Conglomerate
Dmitry Mezhensky

The guide though modern ML platform development for insurance sector and challenges around.

114W PAB
13:45
40min
Using Few Shot Object Detection for Utility Pole detection from Google Street View images.
Mark Hamazaspyan

Traditional methods of detecting and mapping utility poles are manual, time-consuming and costly processes. Current solutions focus on detection of T-shaped (cross-arm shaped poles) and the lack of labeled data makes it difficult to generalize the process of other types of poles. This work aims to use Few Shot Object Detection techniques to overcome the unavailability of the data and to create a general pole detection model with few labeled images.

213W PAB
14:30
14:30
40min
Building a Lakehouse data platform using Delta Lake, PySpark, and Trino
Viacheslav Inozemtsev

In this talk I would like to present the concept of Lakehouse, which is a novel architecture to resolve problems and combine capabilities of the classical Data Warehouse and Data Lake. I will talk about the Delta Lake table format that resides in the core of Lakehouse. I will demonstrate how Delta Lake integrates with Apache Spark, to build data ingestion pipelines. I will also show how Delta Lake integrates with Apache Trino, to provide a fast SQL-based serving layer. As a result, I will bring all these components together to describe how they enable a modern big data platform. This talk will be useful for an intermediate level audience of data engineers, data analysts, and data scientists.

213W PAB
14:30
40min
NetworkX - your unexcpected assistant for clustering analysis
Meirav Ben Izhak

Clustering analysis is a common task in data science but it can sometimes get tedious. In this talk, I will present how functionality from the package NetworkX can assist us in analyzing and presenting results of clustering analysis. This talk assumes no previous knowledge, a brief reminder of graph basics will be given and networkx will be shortly presented. Join in if you - - want to hear about a new suggested usage of a known data structure. - like to get things done more efficiently when you cluster. - never heard of networkx but would like to. - all of the above.

113W PAB
14:30
40min
PyTorch Geometric for Graph Neural Nets
Dmitry Korobchenko
  • In contrast to classical Deep Learning models (such as MLP, CNN, RNN, Transformers), which are usually applied to tensors and sequences, Graph Neural Net (GNN) is a special type of Deep Learning model, which works with non-euclidian data structures, such as graphs. Examples of graph analysis tasks, where a data-driven approach can help, may include 3D mesh processing, molecular analysis, social graphs data mining and potentially any other task, where traditional DL methods are inapplicable.
  • PyTorch is an industry standard Deep Leaning framework, which provides a lot of useful DL operations and utilities. PyTorch Geometric is a library built on top of PyTorch, implementing a set of tools to create and train Graph Neural Networks.
  • In this talk I will give a very quick and high-level introduction to GNNs and PyTorch Geometric.
114W PAB
15:15
15:15
15min
Coffee Break
113W PAB
15:15
15min
Coffee Break
114W PAB
15:15
15min
Coffee Break
213W PAB
15:15
15min
Coffee Break
214W PAB
15:30
15:30
40min
AI-Powered Solutions for Cybersecurity
Elina Israelyan

Cyberattacks are continuously growing in volume and entanglement. They
target organizations' systems, networks, and private data, causing financial loss,
customer loss, and data leakage. As technology improves nowadays, Artificial
Intelligence (AI) based solutions help boost Cybersecurity. Attend this talk to
discover how AI-powered algorithms are used to stay ahead of Cyberattacks such as
Phishing, Lookalike domains, or Name Spoofing.

214W PAB
15:30
40min
BERT Model for Real World Healthcare Data
Artem Terentyuk

Early indication and detection of diseases, can provide patients with the chance of early intervention, better disease management, and efficient allocation of healthcare resources. The latest developments in machine learning provides a great opportunity to address this unmet need. In this lecture, we introduce modified BERT: A deep neural sequence transduction model designed for electronic health records (EHR). We will consider the application of this methodology to the task of classifying patients into cohorts reflecting different disease patterns.

113W PAB
15:30
40min
Classical Texture Synthesis and Beyond
Hovhannes Margaryan

Given a small example of a texture as an input the goal of texture synthesis is to generate an output image that is an expanded and smartly tiled version of the given input maintaining perceptual information. Texture synthesis methods are categorized into three main types: non-parametric, parametric and procedural methods. Two of these categories namely non-parametric and parametric are discussed during the talk. On the one hand, non-parametric approaches resample pixels or patches from the given source texture. Texture Optimization for Example-based Synthesis and Image Quilting for Texture Synthesis and Transfer are discussed. On the other hand, parametric methods require an explicit definition of a parametric texture. Two parametric methods, namely Texture Synthesis using CNNs and Non-Stationary Texture Synthesis by Adversarial Expansion are presented during the presentation. Results of the above-mentioned methods are demonstrated as a conclusion.

213W PAB
15:30
40min
Recommendation Systems in Market Research
Davit Abgaryan

The combined use of recommendation systems, multicriteria optimization, and regression models allowed us to solve the problem of efficient user-survey matching at scale in the process of gathering opinion data.

114W PAB
16:15
16:15
40min
Exploratory Data Analysis and Feature Engineering with PySpark
Yana Khalitova

Learn more about handling Big Data with PySpark and why PySpark might become your to-go framework for Exploratory Data Analysis (EDA) and Feature Engineering (FE)! This talk will show the main differences between Pandas and PySpark frameworks and outline the advantages of performing EDA and FE with PySpark. It will be most beneficial for Data Scientists, Data Analysts, and other attendees that perform data exploration and wrangling. They will learn about handling Big Data with PySpark, its functionalities for EDA and FE, and means of visualising the results. This talk is a good fit for the Beginner to Intermediate level audience with prior experience in Python and SQL.

214W PAB
16:15
40min
How to start critical thinking in Data Science
Arpine Sahakyan

The aim of the presentation is to address issues concerning bias in data, misleading statistics, issues in testing and other matters that are prevalent in the field of data science.

213W PAB
16:15
40min
Modern Data Stack: Optimising and Scaling Data in a tech company
Nacho Aranguren

A new approach to data integration that Dataops can enable to save engineering time, allowing engineers and analysts to pursue higher-value activities.

The modern data stack (MDS) is a suite of tools used for data integration. These tools include, in order of how the data flows:

a combination of fully managed ELT Pipeline + custom managed ETL pipeline a cloud-based columnar warehouse or data lake as a destination a data transformation tool a business intelligence or data visualisation platform.

114W PAB
16:15
40min
Scaling Semi-Supervised Production-Grade ASR on 200 Languages
Luka Chkhetiani

Self-Supervised pretraining has been wildly successful lately, covering almost every domain: Speech, NLP, Vision. Networks, such as: Wav2Vec2, Hubert, JUST and alikes have enabled rapid development of Speech-related products. In this talk we're going to go through the end-to-end research and engineering process of production-grade self-supervised ASR in the multilingual setting. Covered topics include: Compute, Data, Scalability, Engineering for Pretraining and Downstream Tuning.

113W PAB