I’ve been quiet on this blog for a couple of weeks, and that’s because I’ve been helping out addressing some of the spam complaints in Outlook.com.
The biggest issue we’ve seen recently is spam from invalid senders. This is an email where the From: address is not RFC compliant, and does one of two things:
- It gets to the inbox, polluting user experience
. - It goes to Junk, but when a user goes to block it, they can’t. Instead, they get the following error:
.
There are all sorts of example malformed From: addresses like:
From: =?UTF-8?B?Q2Fz?==?UTF-8?B?aW5v?= <"<?<SK3OF@alt2.woanager.com>?>"> From: "CVS Reward Giveaway " <"<?<infofzhVqwg@mailin.shawbals.com>?>"> From: "CVS Reward Giveaway " <"<?<infoNKKWCGk@msgin.ejnerpa.com>?>"> From: Family Survival <"<info.rnK3ZXkAUDeI@trot-pulldown.shapesfarm.com>"@-->
The list goes on an on.
.
We had some issues with some of our feedback loops but they have been fixed. This now sends most of these types of messages to Junk. However, there are still some other spams that leak through due to spammers fast-adapting their sending IPs and message content.
Upon investigation, it turns out that mail systems send email in many, many, many different ways. Some include Display Names, some don’t. Some include angle brackets around the email address, some don’t. Some even reverse the email address and Display Name:
From: sender@example.com (John Doe)
I didn’t even know that was a thing.
So, for the past several days I have been furiously hard at work coming up with some validation of an email address. This is first getting rolled out as a series of rules, and then will be done in code. I’ve borrowed liberally from Wikipedia’s definition of an email address, but not matched it completely. Here are the guidelines (these are not difficult to adhere to, 99+% of valid email does):
A From: address must contain an email address, and the email address must contain a valid localpart@domain.
For the localpart:
- A localpart MUST start with letters A-Z, or digits 0-9
. - A localpart MUST end with letters A-Z, digits 0-9, the underscore _, a dash -, a plus sign +, or an equal sign =
. - The middle characters may be letter A-Z, digits 0-9, or almost any special character (ampersand, tilda, dashes, semicolons, etc.) except the double quote
. - The dot character . may not be repeated sequentially in the localpart
. - If including double quotes, it may only be the first last and last characters in the localpart, e.g., “sender”@
. - The one exception is a phone number, in which case it should have a leading plus sign, e.g., <+15552032212@example.com>
For the domain:
- A domain MUST start with letters A-Z, or digits 0-9
. - A domain MUST end with letters A-Z, or digits 0-9
. - For the middle characters, it may contain letters A-Z, digits 0-9, dot ., or dash -, provided there are not two dots in a row
. - Or, it can be an IP address literal, e.g., sender@[123.123.123.123]
.
If you’re including a Display Name, it is strongly recommended to do the following:
- If including a Display Name, put angle brackets around the email address, From: John Doe <sender@example.com> and not From: “John Doe” sender@example.com
. - Put a space between the email address and the Display Name, From: “John Doe” <sender@example.com>, and not From: “John Doe”<sender@example.com>. I cannot emphasize this enough!
. - Don’t put a space between the angle brackets in the email address, so not this: From: “John Doe” < sender@example.com>
. - Don’t get fancy with double quotes. If using them, put them around the Display Name. Putting them into the email address may cause parsing issues
. - Display Names encoded in base 64 are fine
If you don’t, we may not be able to figure out the email address. That may cause problems.