PyData Yerevan 2022
Everyone knows and uses Pandas, NumPy and NetworkX, but is there something better? Something equally easy to use, but hopefully with more features, or, more importantly, higher performance!
Among the most prominent achievements of the quantum computing field is an algorithm known as Grover’s quantum search. This talk focuses on Grover’s algorithm and its applications to machine learning routines. Prior knowledge required is a basic understanding of linear algebra and computer science, and familiarity with the concepts of machine learning.
This talk introduces NeMo: NVIDIA's open-source toolkit for conversational AI that provides a wide collection of models for automatic speech recognition, text-to-speech, natural language processing and neural machine translation.
Sequential models in natural language understanding are widely useful for everything from machine translation to speech recognition. In machine translation, encoder-decoder architecture, especially the Transformer, is one of the most prominent branches. The attention idea has been one of the most influential ideas in deep learning. By adopting this idea, we can make a model that takes the long sequence of data (for example, the words in the long sentence for translation) and divides them into small parts while looking at the others simultaneously to generate the output at the end.
Many engineers, particularly those in Data Science, do not focus on writing better code, which
their coworkers will love.
This is bad!
Writing cleaner code, and using appropriate tools for experiment logging reduces the time of
debugging and the effort spent on the project in the long term. Consequently, the code becomes
readable and onboarding new engineers on the project becomes easier.
The target audience is beginner ML/DS practitioners who struggle to write cleaner code, data
scientists who consider adapting better techniques and tools in the project, and students who
make their first steps into the world of ML.
The only necessary knowledge is the ability to read and understand Python code, as all my
examples contain the latter. No prior technical knowledge is necessary as the talk is fully
introductory on a high level.
By the end of the lecture, attendees will have learned about the importance of having a clean
code in the ML project. They will have developed intuition about wiring readable and
understandable code and will have acquired knowledge about the general design of a good
codebase, and some tools that will help engineers log experiments for a cleaner environment.
Building Data Pipelines on AWS and hidden costs that can destroy budget
The talk is about methods of doing large scale field delineation from aerial imagery. Given the increasing importance of global food supplies, AI in agriculture has become integral for later development in the field. Given that modern agriculture is field level, a delineation of fields is required to be able to use such methods. The talk has no pre-requisites and will be a research-oriented informative talk.
Being part of statistical learning apparatus, and having a strong mathematical background AB testing remains one of the aspects in the field that continue to be violated and misinterpreted. A big part of violations covers the wrong experiment setup, which I'll try to cover in practice taking into consideration the business setup: whether it's a B2B platform or B2C. It would be nice if the audience had a hands-on experience with A/B testing, if not - I'm still going to cover it on a high level. The main takeaway for the audience will be to understand the pitfalls that relay under experiment setup, where a single disregarded use case can violate the whole experiment outcome. Time breakdown: 10mins A/B testing basics, 15mins Pitfalls, 5min Q&A
This hands-on tutorial will teach you how to accelerate every component of a machine learning system and improve your team’s productivity at every stage of the ML workflow. You’ll learn how to get started with RAPIDS and NVIDIA Forest Inference Library, and how to go beyond the basics to get the most out of your accelerated infrastructure. We’ll do all of this in the context of a real-world application that models financial payments fraud and detects it in real-time. We’ll show you how: RAPIDS enables you to find better insights into your data more quickly, through accelerated visualization techniques RAPIDS Machine Learning models can outperform rules-based approaches to detecting payments fraud NVIDIA Forest Inference Library enables you to accelerate inference of tree models, scoring incoming transactions with high throughput and low latency Data scientists will experience the high-velocity exploratory workflows enabled by NVIDIA RAPIDS and learn how to best take advantage of GPUs when porting CPU-based pandas and scikit-learn code to run on RAPIDS. Application developers and IT ops professionals will learn more about data science workflows, see how real-world ML systems work, and learn about the myriad benefits of GPU acceleration for these systems and the teams who build them. The tutorial can be delivered both remotely and onsite. Attendees would need a laptop and a stable internet connection. Attendees to the tutorial will be provided with a url to access the lab environment, so that they can access and run the tutorial with no prior set-up required. Familiarity with standard Python code is desirable.
Image classification datasets are completed from the analysis point of view taking into account complicated structure of images. However the understanding of the dataset descriptors in the high level can add debugging facilities and in early stage predict the quality of the classification model. During this session we will visually analyze one of the challenging SOTA dataset like Cifar-10.
Since multimodality became popular, lots of engineer are trying to make domain-universal search. The search engines that will find in image by textual query, the HTML file by piece of audio and so on. So here is our (Unum) approach with a bias towards GPU accelerating inference (for underlying models) and passion to distribute everything.
In this talk, we discuss streaming and the real-time data stack as a solution for analyzing massive, unbounded data sets that are increasingly common in many modern businesses in different fields and their need for more timely and accurate answers.
A streaming data pipeline flows data continuously from source to destination as it is generated, making it being processed along the way so they are used when the analytics, application or business process requires an updating data flow for an on-time analysis. This analysis can be descriptive like a data dashboard, diagnostic like monitoring logs, predictive like an online fraud detection system or prescriptive like data process in self-driving cars.
This talk is perfect for developers or architects and will be accessible to technical managers and business decision makers—no prior experience with streaming data systems required. The only technical requirement this talk makes is that the audience should feel comfortable with major concepts in data pipelines.
During the talk, examples of e-health data pipelines plus our experience in setting up a data streaming stack help to explicitly of the subject.
The recent surge of interest in Machine Learning (ML) and Artificial Intelligence (AI) has spurred a wide array of models designed to make decisions in a variety of domains, including healthcare [1, 2, 3], financial systems [4, 5, 6, 7], and criminal justice [8, 9, 10], just to name a few. When evaluating alternative models, it may seem natural to prefer those that are more accurate. However, the obsession with accuracy has led to unintended consequences, as developers often strove to achieve greater accuracy at the expense of interpretability by making their models increasingly complicated and harder to understand [11]. This lack of interpretability becomes a serious concern when the model is entrusted with the power to make critical decisions that affect people’s well-being. These concerns have been manifested by the European Union’s recent General Data Protection Regulation, which guarantees a right to explanation, i.e., a right to understand the rationale behind an algorithmic decision that affects individuals negatively [12]. To address these issues, a number of techniques have been proposed to make the decision-making process of AI more understandable to humans. These “Explainable AI ” techniques (commonly abbreviated as XAI) are the primary focus of this talk. The talk will be divided into three sections, during which the audience will learn (i) the differences between existing XAI techniques, (ii) the practical implementation of some well-known XAI techniques, and (iii) possible uses of XAI as a conventional data analysis tool.
Near-Duplicate Ads in online listings usually affect both demand-side and supply-side users’ experiences.
Using Machine-Learning and Deep-Learning we tried to detect duplicated ads.
Ideally, people attending the talk are familiar with the basics of machine learning and deep learning.
-
During this session we will review benefits of processing large imageries by tiles, review use cases and later combination of results. No previous knowledge is required for the talk.
AutoML automates each step of the ML workflow so that it’s easier to use machine learning.
Proteins are molecules that perform various functions in living cells. All proteins are composed
of the same possible twenty amino acids, but in different combinations. Based on these
combinations, proteins have different three-dimensional structures, which ultimately form the
basis of their unique functions. A protein’s structure can be determined in the laboratory or
predicted with high accuracy using artificial intelligence programs. Understanding the structures
of proteins and the ways they interact in three-dimensional space is critical for human health.
For example, a single amino acid change in a protein can lead to disease such as cancer.
In this Tutorial, participants will learn the basics of the UCSF Chimera software to visualize and
analyze protein structures. This Tutorial will be particularly useful for individuals interested in
bioinformatics and/or structural biology. No background knowledge is necessary, though it may
help to be familiar with the classifications of amino acids (charged, polar, hydrophobic) and
levels of protein structure (primary, secondary, tertiary, quaternary). This Tutorial is code-free,
but the principles learned will be useful for similar Python-based software. Materials will be
distributed via file storage links. By end of this Tutorial, participants will learn the basic principles
of protein structure, how to retrieve atomic coordinates of an individual protein or protein
complex, and how to examine and analyze protein structures with various features of UCSF
Chimera.
The talk will introduce the audience to Triton Inference Server, the requirements for migrating from regular AWS instances, advantages and benchmarks from our production. This talk is mostly targeted towards Machine Learning Engineers and ML/Ops Engineers, although no previous knowledge is required to attend and understand the topic.
Poor tooling slows down data science and machine learning projects.
Streamlit is a fast way to build and share data apps. It is able to turn data scripts into shareable web apps with minimal effort. Let's hear Karen Javadyanl introduce Streamlit: the fastest way to build and share data apps as Python scripts.
Using Python for the PCR design or an easy way to data analysis in life science projects.
The attempt to decode the human brain using computers is not novel, however, doing it dynamically in an uncontrolled environment with many external confounding factors has been deemed to be very challenging computationally, and hence, yet to be explored in depth. This study aims to predict the human physiological behaviors using machine learning and invasively recorded intracranial field potentials received through electroctrocortigography (ECoG) procedure from the brain surface in an uncontrolled real life setting. After a rigorous feature engineering process I showcase that the well-defined behaviors such as sleeping, eating and video gaming can be decoded with greater than 0.95 AUCs, and the noisier behaviors such as movements, spoken and heard speech are decoded with AUCs higher than 0.80. To ensure that the classification results are reliable I run a series of experiments with different controls and find that despite the drop in AUCs the behaviors are still robustly classified better than the random for all of the tests. I also dive deeper into exploring the brain regions which contributed to the high performance of the classification. Not only does this research show that it is possible to classify twelve natural continuous human behaviors with high performance, it also confirms many of the prior literature findings which state that certain brain region activities correspond to specific human physiological actions.
In our project we present a wide index of existing Eastern European language datasets (90+) and models (60+). Furthermore, to support the evaluation of commonsense reasoning tasks, we compile and publish cross-lingual datasets for five such tasks and provide evaluation results for several existing multilingual models.
How much has your vocabulary changed over the last year? Who shares the funniest memes with you? And does she find you interesting to chat with? ̶N̶o̶p̶e̶.̶
If you're a Telegram guy, Neplo is that painstakingly data-driven guy who's got answers to these (and hundreds of other) questions, based on your Telegram chat histories.
Still skeptical? Come and see. (John 1:39)
I should have known I was up against it when even my Kaggle solution sucked. I’d been tasked with launching our company’s research efforts into Customer Lifetime Value prediction, so naturally, I turned to that grail of tutorials and toy datasets, and started exploring. Very quickly I learned two things: the go-to CLV dataset was not worth going to, and I really needed some retail domain experts.
A significant amount of progress is being made today in the field of representation learning. It has been demonstrated that unsupervised techniques can perform as well as, if not better than, fully supervised ones on benchmarks such as image classification, while also demonstrating improvements in label efficiency by multiple orders of magnitude. In this sense, representation learning is now addressing some of the major challenges in deep learning today. It is imperative, however, to understand systematically the nature of the learnt representations and how they relate to the learning objectives.
"Prediction is very difficult, especially if it’s about the future!" This phrase is attributed to Niels Bohr, the Nobel laureate in Physics and father of the atomic model. This quote warns about the unreliability of forecasts without proper testing and about constant changes in the initial assumed conditions.
With modern programming languages and convenient packages that provide ready-made modeling solutions, it is often easy to find a model that fits the past data well; perhaps too well! But does the maximization of metrics justify the means? Should the complex structures of predictions be built on the quicksand of noisy data?
This talk is a laid-back discussion that will be useful for the audience from any background, from beginner to advanced. Aghasi Tavadyan is the founder of Tvyal.com, which translates to "data" from Armenian. You can find more info about him following these websites: tavadyan.com, tvyal.com.
ML metadata connects different parts of the ML infrastructure together into a complete system. This talk is about - How and where the Metadata is generated. - How it's used in modern ML Infra stacks - How to use the metadata in building that enables reproducibility, explainability and governance over the models.
The talk is about applications of active learning methods, mainly Monte-Carlo Dropout on 3D mesh/pointcloud semantic segmentation task. The topic is particularly interesting for practical applications of Deep Learning models on this type of data, as it gives a working approach for reducing the amount of data needed for training.
The parameters in a statistical model are not always identified by the data. In Bayesian analysis, this
problem remains unnoticed because of prior assumptions. It is crucial to find out whether the data
determine the marginal posterior parameters. As the famous mathematician George Box stated,
“Since all models are wrong the scientist must be alert to what is importantly wrong.”
The R package ed4bhm, which is available on GitHub allows to examine the empirical determinacy of
posterior parameters for the models fitted with well-known Bayesian techniques.
The classic sentiment analysis analyzes texts, images, emojis, etc to know what other people think of a product, service, company, or event. While sentiment analysis can be considered one of the accomplished tasks of Natural Language Processing tasks, more fine-grained types of it like Target Based Sentiment Analysis(TSA) or Aspect-based sentiment analysis(ABSA) are the not quite the same. In TSA we want to see the sentiment of a given text towards a particular entity(in my case person or organization). This task is one of the non-solved ones. With the T5 question answering transformer model it was possible to solve the task with results 20% higher than the current leaderboards.
Explainable Artificial Intelligence (XAI) is crucial for the development of responsible, trustworthy AI. Machine learning models such as deep neural networks can perform highly complex computational tasks at scale, but they do not reveal their decision-making process. This becomes problematic when such models are used to make high-stakes decisions, such as medical diagnoses, which require clear explanations in order to be trusted.
This talk discusses Explainable AI using examples of interest for both machine learning practitioners and non-technical audiences. This talk is not very technical; it does not focus on how to apply an existing method to their model. Rather, the talk discusses the problem of Explainability_ as whole, namely: what is the Explainability Problem and why it must be solved, how recent academic literature addresses the problem, and how the problem will evolve with new legislation.
To get the most from this talk, the audience should have some familiarity with standard machine learning algorithms. However, no technical background is needed to grasp the key takeaways: the necessity of explainability in machine learning, the challenges of developing explainability methods, and the impact that XAI has on businesses, practitioners and end-users.
Bachelor theses written in the area of Deep Learning based object detection will be presented. Main focus is on the detection of vehicles captured from top, e.g. parking lots, satellites: I will present the challenges we encountered and solved in the scope of Bachelor thesis. The goal of this talk is not only to present the results of young undergraduate students, but also to encourage new ones to get involved into the sphere.
This 90-minute tutorial will demonstrate how to use the Pandas package effectively, which means that the audience will understand the package better after viewing the tutorial.
The guide though modern ML platform development for insurance sector and challenges around.
Traditional methods of detecting and mapping utility poles are manual, time-consuming and costly processes. Current solutions focus on detection of T-shaped (cross-arm shaped poles) and the lack of labeled data makes it difficult to generalize the process of other types of poles. This work aims to use Few Shot Object Detection techniques to overcome the unavailability of the data and to create a general pole detection model with few labeled images.
In this talk I would like to present the concept of Lakehouse, which is a novel architecture to resolve problems and combine capabilities of the classical Data Warehouse and Data Lake. I will talk about the Delta Lake table format that resides in the core of Lakehouse. I will demonstrate how Delta Lake integrates with Apache Spark, to build data ingestion pipelines. I will also show how Delta Lake integrates with Apache Trino, to provide a fast SQL-based serving layer. As a result, I will bring all these components together to describe how they enable a modern big data platform. This talk will be useful for an intermediate level audience of data engineers, data analysts, and data scientists.
Clustering analysis is a common task in data science but it can sometimes get tedious. In this talk, I will present how functionality from the package NetworkX can assist us in analyzing and presenting results of clustering analysis. This talk assumes no previous knowledge, a brief reminder of graph basics will be given and networkx will be shortly presented. Join in if you - - want to hear about a new suggested usage of a known data structure. - like to get things done more efficiently when you cluster. - never heard of networkx but would like to. - all of the above.
- In contrast to classical Deep Learning models (such as MLP, CNN, RNN, Transformers), which are usually applied to tensors and sequences, Graph Neural Net (GNN) is a special type of Deep Learning model, which works with non-euclidian data structures, such as graphs. Examples of graph analysis tasks, where a data-driven approach can help, may include 3D mesh processing, molecular analysis, social graphs data mining and potentially any other task, where traditional DL methods are inapplicable.
- PyTorch is an industry standard Deep Leaning framework, which provides a lot of useful DL operations and utilities. PyTorch Geometric is a library built on top of PyTorch, implementing a set of tools to create and train Graph Neural Networks.
- In this talk I will give a very quick and high-level introduction to GNNs and PyTorch Geometric.
Cyberattacks are continuously growing in volume and entanglement. They
target organizations' systems, networks, and private data, causing financial loss,
customer loss, and data leakage. As technology improves nowadays, Artificial
Intelligence (AI) based solutions help boost Cybersecurity. Attend this talk to
discover how AI-powered algorithms are used to stay ahead of Cyberattacks such as
Phishing, Lookalike domains, or Name Spoofing.
Early indication and detection of diseases, can provide patients with the chance of early intervention, better disease management, and efficient allocation of healthcare resources. The latest developments in machine learning provides a great opportunity to address this unmet need. In this lecture, we introduce modified BERT: A deep neural sequence transduction model designed for electronic health records (EHR). We will consider the application of this methodology to the task of classifying patients into cohorts reflecting different disease patterns.
Given a small example of a texture as an input the goal of texture synthesis is to generate an output image that is an expanded and smartly tiled version of the given input maintaining perceptual information. Texture synthesis methods are categorized into three main types: non-parametric, parametric and procedural methods. Two of these categories namely non-parametric and parametric are discussed during the talk. On the one hand, non-parametric approaches resample pixels or patches from the given source texture. Texture Optimization for Example-based Synthesis and Image Quilting for Texture Synthesis and Transfer are discussed. On the other hand, parametric methods require an explicit definition of a parametric texture. Two parametric methods, namely Texture Synthesis using CNNs and Non-Stationary Texture Synthesis by Adversarial Expansion are presented during the presentation. Results of the above-mentioned methods are demonstrated as a conclusion.
The combined use of recommendation systems, multicriteria optimization, and regression models allowed us to solve the problem of efficient user-survey matching at scale in the process of gathering opinion data.
Learn more about handling Big Data with PySpark and why PySpark might become your to-go framework for Exploratory Data Analysis (EDA) and Feature Engineering (FE)! This talk will show the main differences between Pandas and PySpark frameworks and outline the advantages of performing EDA and FE with PySpark. It will be most beneficial for Data Scientists, Data Analysts, and other attendees that perform data exploration and wrangling. They will learn about handling Big Data with PySpark, its functionalities for EDA and FE, and means of visualising the results. This talk is a good fit for the Beginner to Intermediate level audience with prior experience in Python and SQL.
The aim of the presentation is to address issues concerning bias in data, misleading statistics, issues in testing and other matters that are prevalent in the field of data science.
A new approach to data integration that Dataops can enable to save engineering time, allowing engineers and analysts to pursue higher-value activities.
The modern data stack (MDS) is a suite of tools used for data integration. These tools include, in order of how the data flows:
a combination of fully managed ELT Pipeline + custom managed ETL pipeline a cloud-based columnar warehouse or data lake as a destination a data transformation tool a business intelligence or data visualisation platform.
Self-Supervised pretraining has been wildly successful lately, covering almost every domain: Speech, NLP, Vision. Networks, such as: Wav2Vec2, Hubert, JUST and alikes have enabled rapid development of Speech-related products. In this talk we're going to go through the end-to-end research and engineering process of production-grade self-supervised ASR in the multilingual setting. Covered topics include: Compute, Data, Scalability, Engineering for Pretraining and Downstream Tuning.