PyData Yerevan 2022

Building a Streaming (E-health) Data Pipeline: When and How?
08-12, 14:30–15:10 (Asia/Yerevan), 113W PAB

In this talk, we discuss streaming and the real-time data stack as a solution for analyzing massive, unbounded data sets that are increasingly common in many modern businesses in different fields and their need for more timely and accurate answers.

A streaming data pipeline flows data continuously from source to destination as it is generated, making it being processed along the way so they are used when the analytics, application or business process requires an updating data flow for an on-time analysis. This analysis can be descriptive like a data dashboard, diagnostic like monitoring logs, predictive like an online fraud detection system or prescriptive like data process in self-driving cars.

This talk is perfect for developers or architects and will be accessible to technical managers and business decision makers—no prior experience with streaming data systems required. The only technical requirement this talk makes is that the audience should feel comfortable with major concepts in data pipelines.

During the talk, examples of e-health data pipelines plus our experience in setting up a data streaming stack help to explicitly of the subject.

In this talk, first we explain the precise definitions of real-time data, big data pipeline and message queuing as one of the core concepts in the streaming data architecture. In the following, we shortly review the path we have come so far from nightly jobs for processing previous day’s transactions to emerge of distributed storage particularly Spark which with a consistent API made it possible for many data engineers to have and demand higher-level tools to build data pipelines.

Then as the next step, we zoom on the tiers of a data streaming architecture and the purpose of each tier, remembering that these tiers are not hard and rigid, as we may have seen in other architectures:

  1. Collection tier
  2. Message queuing tier
  3. Analysis tier
  4. Long-term storage tier as history for further analysis
  5. In-memory data store tier
  6. Data access tier or destination

Also, in explanation of each tier, several examples and ideas from our real experience in e-health will play role of materializing the theory. For example, doctors or hospital administrators need to monitor sum-up of the daily activities in the e-prescribing App or website, also on each prescription all the guidelines of health insurance are checked and possible frauds are prevented. Considering volume of data, necessary well-timed actions and the level of complexity, the business is collection of good examples.

Prior Knowledge Expected

No previous knowledge expected

I have studied Geometry in university, but programming and numerical solutions have never ceased to fascinate me. For now, I've come to conclusion that I like using Mathematics for obtaining more precise answers in data science.

Several years ago at age 38, I left my job as a lecturer/researcher in Computer Science department to face more challenges in data industry.
Although, the data industry is inherently interdisciplinary, broad and constantly evolving, and the ecosystem around it very noisy, I was very lucky to begin from classic BI and step by step found my way towards big data pipeline.

Currently I am head of data team at Avihang which is a software company finding solutions and producing enterprise products in e-health, mostly.