PyData Yerevan 2022

Moving Inference to Triton Servers
08-12, 16:15–16:55 (Asia/Yerevan), 113W PAB

The talk will introduce the audience to Triton Inference Server, the requirements for migrating from regular AWS instances, advantages and benchmarks from our production. This talk is mostly targeted towards Machine Learning Engineers and ML/Ops Engineers, although no previous knowledge is required to attend and understand the topic.


For large amounts of inferencing on cloud instances, cost and time effectiveness is key. Running predictions on GPU instances becomes costly, even though some resources, like large RAM or GPUs, are used only during the inference, which is only a fraction of the prediction pipeline. Therefore, we decided to distribute the resources smarter and divide the prediction process into 3 parts: (in our case, image) pre-processing, primary inference and post-processing. In the first 10 minutes of my proposed presentation, I will discuss how to separate data pre- and post-processing from the main inference using Triton Inference Server. Next, I will show the infrastructure changes we made to integrate the Triton to our system, which will take about 10 minutes. Finally (about 5 minutes), I will present the cost and time efficiency statistics of this change in our production.


Prior Knowledge Expected

No previous knowledge expected

Marine Palyan works as a Data Scientist at IntelInAir Armenia. Her job mostly involves image processing, data analysis, and Python development. She is also interested in Data Engineering and Cloud Development. As for her education, she is currently a second year Masters student at Yerevan State University, majoring in Applied Statistics and Data Science. She has a Bachelor’s degree in Computer Science from YSU. In her free time, she likes playing the guitar and enjoys reading science fiction and thriller books.