Thwarting Digital Ad Fraud at Scale: An Open Source Experiment with Anomaly Detection

Thwarting Digital Ad Fraud at Scale: An Open Source Experiment with Anomaly Detection

instaclustr logoAd fraud continues to be a thorn in the side of digital advertisers, as bot traffic and fraudulent human activity falsely inflates ad statistics. Such activity forces unwitting brands to shell out for clicks or impressions that have no hope of reaching – let alone converting – potential customers. A recent study by White Ops and the Association of National Advertisers (ANA) finds that advertisers lost around $5.8 billion to ad fraud last year. On the brighter side, the study also found that efforts to defeat ad fraud – using techniques such as anomaly detection – are now more successful than ever.

In fact, ad fraud would have amounted to advertiser losses of more than $14 billion in 2018 if not for improved anti-fraud initiatives. While the study estimates that digital ad fraud attempts currently represent somewhere in the neighborhood of 20-35% of all ad impressions, the amount that gets through (and which advertisers actually pay for) is significantly smaller due to improved mitigation.

But there remains a long way to go, and one of the most widespread sources of digital ad fraud remains non-human traffic. As a rising practice, bad actors enlist botnets to execute fraud by directing an army of connected devices to visit sites and peck away at ads. In other cases, real humans use click farms to receive compensation for producing clicks and impressions with no intention of considering the ad content. Still, other ad fraud scenarios involve non-viewable ads or ads which honest users cannot interact with correctly, such as ads that are unable to be closed. Importantly, in each of these ad fraud varieties, anomalous data is produced that anomaly detection systems can recognize in order to impede fraud and protect advertisers’ bottom lines.

Read more: Beware: Ad Fraud is Not a Harmless Crime

All that said, the vast scale of advertising data generated through websites and ad networks adds a monumental data challenge to the task of implementing an ad fraud anomaly detection system – requiring appropriate computational, scalability, and performance capabilities. Instaclustr, the company I work for, doesn’t sell ad fraud solutions but we recently completed a purely experimental anomaly detection application that shows how scalable open source technologies might be able to spare advertisers from the costs and harm of ad fraud. To achieve the requisite capabilities while keeping the experimental solution cost-effective for practical usage, our test system used an architecture comprised of open source Apache Kafka, Apache Cassandra, and our anomaly detection application. Beyond the performance, scalability, and affordably Kafka and Cassandra provide, both open-source data technologies also offer a particularly high degree of compatibility and pair well together.

Our experiment assembles Kafka, Cassandra, and our anomaly detection application in a Lambda architecture, in which Kafka and our streaming data pipeline are the speed layer, and Cassandra acts as the batch and serving layer. In this configuration, Kafka makes it possible to ingest streaming digital ad data in a fast and scalable manner, while taking a “store and forward” approach so that Kafka can serve as a buffer to protect the Cassandra database from being overwhelmed by major data surges. Cassandra’s strength is in storing high-velocity streams of ad metric data in its linearly scalable, write-optimized database. In order to handle automation for provisioning, deploying, and scaling the application, the anomaly detection experiment relies on Kubernetes on AWS EKS.

In the end, the experiment was a successful one. The anomaly detection application has demonstrated the ability to process 19 billion real-time data events in a day, likely meeting the ad fraud detection needs of even the largest brands. To reach these results, we scaled the application from an initial three Cassandra nodes all the way out to 48. At the same time, the experiment made use of 574 CPU cores, counting all Cassandra, Kafka, and Kubernetes clusters. The experimental application proved capable of maintaining a peak 2.3 million writes per second into Kafka, amounting to a sustainable 220,000 anomaly checks every second.

By teaming open-source data-layer technologies like Kafka and Cassandra and making the most of the intrinsic benefits each has to offer, this experiment demonstrates a successful method for advertisers and ad networks to use for their own needs – a path for affordable, scalable, high performance anomaly detection applications that ensure the integrity of the ad metrics they pay good money to achieve.

Read more: Combatting Ad Fraud Requires Smart People, Not Just Smart Tech

Previous ArticleNext Article

Leave a Reply

Your email address will not be published. Required fields are marked *