PyData Yerevan 2022

Near-Duplicate Ad Detection in Online Classified Ad Services
08-12, 14:30–15:10 (Asia/Yerevan), 114W PAB

Near-Duplicate Ads in online listings usually affect both demand-side and supply-side users’ experiences.

Using Machine-Learning and Deep-Learning we tried to detect duplicated ads.

Ideally, people attending the talk are familiar with the basics of machine learning and deep learning.


Near-Duplicate Ads are harmful to online classified ad services in many ways.
Demand-side users face low-quality listings which increases the time and effort required to find desired ads. Also, normal supply-side users get fewer views and make less profit out of their ads. Finally, Duplicate ads may cause a direct decrease in the business’s revenue (By skipping payments such as ad boosting)

In this talk first, We will discuss the problem, definition, metrics, and, training data generation.
Then, we will talk about modeling, feature engineering, and how step-by-step metrics were improved.

For texts, we have tried different approaches such as MinHash, CountVectorizer, Bi-LSTM, and, transformers.

For images, we have tried different approaches such as Perception-Hash and, CNNs.

Also, we will discuss how to apply these approaches to find global duplicate ads and confront spammers.


Prior Knowledge Expected

Previous knowledge expected

Diyar Mohammadi is a Data Scientist at Divar.ir (an online classified ad service in Iran). I've been using various machine learning approaches to develop valuable products or solve complex problems in Divar's business.
He is interested in learning and experiencing new approaches which lead to expanding my stack in machine learning and having a broader viewpoint toward problems.
Recently Diyar Mohammadi has been working on Multi-modal deep learning models for complex tasks (such as Near-Duplicate Ad Detection and Real-Time Ad Recommender Systems).
Also, he is currently an undergrad student at the University of Tehran studying Computer Engineering.