It should come as no surprise that Artificial Intelligence (AI) is being applied to almost every use and application of data imaginable. The sheer fact is there’s so much data at our fingertips that we need machines to do the heavy lifting of analysis in order to apply effective learnings and make informed data-driven decisions.
This couldn’t be truer than in the case of email inbox filtering to combat abuse as fifteen years ago, spam and phish outnumbered legitimate emails in our inbox. The job of protecting recipients from malicious emails was often accomplished by a combination of IP blacklists, URL blocklists, lists of keywords and whatever other data a mailbox provider had at their disposal. Examples of spammy email might include three $’s in a row or too many pharmaceutical words, or even the links in emails were filtered through rules that looked for folder names comprised of all consonants. All of these bits of learning were wrapped into rules and packaged with SpamAssassin as optional modules to help receivers employing the technology to filter and prevent unwanted and potentially dangerous emails from reaching recipients. It was moderately effective.
The real shift in the battle against phish came when AI and Machine Learning were applied to inbox filtering that used user actions to create engagement frameworks that would better determine on a per user basis where email should land. Gmail has done a great job of filtering a vast majority of unwanted mail from the inbox. To achieve the high efficacy rate they employed Tensor Flow to train models that better catch and filter messages. When you stop and consider what phish is, emails designed to fool a recipient into giving up some bit of personal information because they’re spoofing a known brand, then you begin to understand how difficult it is to identify the good from the bad.
The job and duty of protecting recipients are incumbent on both the receiving domains that provide a mailbox and the senders who actively communicate with them. We at Twilio SendGrid understand this responsibility and applied Tensor Flow to our Machine Learning algorithm, Phisherman. Phisherman builds on the early anti-abuse systems that struggled with the complex problem of understanding what a piece of phish looked like from a machine standpoint. Instead of using static keyword lists, or only identifying domains associated with phishing attacks, Phisherman can process numerous factors to determine the likelihood that a given email is phish or a legitimate message.
By using genericized word-to-vector comparisons to identify patterns in large data sets we’re able to tease apart legitimate mail from bad actors attempting to weaponize our platform. Single words or clusters of words are not enough to determine good from bad email. Highly specialized emails from legitimate senders could be seen as bad or a false positive such as a legitimate pharmaceutical company sending a newsletter to their subscribers. Understanding the relationship between words, such as ‘queen’ to ‘woman,’ or ‘queen’ to ‘king’ (both are related but have different forms of the relatable meaning) become sophisticated models for isolating and stopping phish.
Phisherman analyzes over 50 billion emails a month in near real-time and our compliance agents review Phisherman’s results as part of an iterative process of training and improving our Machine Learning model. Bad actors are likewise changing the signature and tone of the emails they’re attempting to send therefore, keeping pace is only achievable through the careful application of people plus technology capable of analyzing vast data sets.
Twilio SendGrid has processed email on behalf of over 80,000 paying customers which creates a rich data set from which to train Phisherman to be ever vigilant. Training Machine Learning models requires vast quantities of data and email is rife with deep metrics: from engagement-based metrics such as clicks and opens, to lower-level metrics derived from the SMTP handshake, transit times, hops and authentication results, email is a hotbed of possible insights that can only be unlocked through the careful use of AI and Machine Learning.