Monday, 16 July 2012

Forensic linguistics for spam detection

We're all getting quite good at spotting email spam now. Our filters are pretty clever and get rid of all the obvious stuff anyway, so that leaves us with just the non-malicious stuff from companies we've bought things from, and the ones that the filter can't quite detect. Most of us don't fall even for these ones, but some people obviously still do, or they wouldn't still be doing the rounds. Almost all of these tell you something about your account, either the email account you're using or some other account that they purport to be from, and ask you to verify your details. By doing this, you're giving your details to some scammer who will then use them for nefarious purposes.

Now I think that this is the chance for linguistics to come into its own, and simultaneously force everyone to learn how to write properly. The plan is twofold:

The first stage is to filter out all emails with spelling, punctuation or grammar errors in them. Obviously we'd have to be careful about what counts as errors, so that people know if it's OK to write in an informal way, or if they have to write as if it's a formal letter. I think that because of the sort of mistakes that occur in these emails, informal style is fine, as long as it follows all spelling and punctuation conventions and doesn't include grammar mistakes (by which I don't mean splitting infinitives, but things that everyone would agree are wrong, like subject-verb agreement). In this email, for instance, it says it where it means its:

That one would already have been filtered out by the unconventional punctuation (capital letters on every word), however. This one, too, would be immediately filtered out due to its poor grasp of correct punctuation (lack of apostrophe on customers', comma splice):

Secondly, the email above has the US spelling center. The filter could detect that it supposedly comes from a UK company and block it because of a mis-match between origin and spelling convention. Likewise, the next one comes from PayPal UK. I don't know if they use UK spellings for their UK branch, but they should, so that we could filter out this one:

That would pretty much remove all spam emails. It's fairly hardline though, so while I think it wouldn't necessarily be a bad thing to force people to write properly, we all know what computers are like and it would be bound to block genuine emails because of some unusual sentence construction that it couldn't recognise (cf Word's grammar checker. Although that is the worst piece of crap ever unleashed on the world - I think letting linguists at it would massively improve it).

If we wanted to be a bit cleverer, well, that's where the forensic linguistics comes in. What if the filter scanned all emails and, if they come from someone who has sent emails before (which you could previously have confirmed as genuine, perhaps), it could compare the new one with previous emails from that person and determine if the new one is likely to be written by the same person. Take this common scam, in which you get an email supposedly from a friend, claiming to have been mugged and needing money:

If I got that email it would be very easy for me to know that it didn't come from the friend it's supposed to have come from. It contains all of the elements mentioned above:

  • grammar mistakes (the first sentence needs either that instead of about or an extra how, person mismatch in the coordination we got mugged and lost all my...); 
  • punctuation errors (comma splice, missing apostrophes); 
  • and US English (Madrid, Spain (UK usage would not usually include Spain - it's US usage to identify the larger area where a city is, especially using just a comma rather than in), cellphone, $$, write me back instead of write back). 
Even if my friend did have such terrible writing skills (which they wouldn't), I'd know from the Americanisms that it wasn't them. But I wouldn't need to know, because my new linguist-designed filter would have scanned it, immediately found that it contained several mismatches with the corpus of known emails from this friend and flagged it as spam.

Would a side effect of this new system mean that scammers get good at writing? Seeing as we're now using forensic linguistic techniques to determine whether two writers are the same person, they still wouldn't pass. The question is, why are they already so badly written? Surely they'd be more convincing if they were error-free?

No, actually. They'd be more convincing to me, maybe, but then I'd look at the sender's address and the address it links to, and quickly see it wasn't genuine. And for that matter, I'd be suspicious and not click links in such emails anyway. So I'm not really their target audience. The people likely to be duped are also likely not to notice such writing mistakes, so there's no benefit to writing well, as it will go unnoticed. But there might be an advantage in not writing well, in that it screens out people like me, who will not be taken in. This was suggested on Quora:
The obvious giveaways are used as a *pre-qualifier*, to ensure with the least possible effort that the ONLY people who respond to the scammers' initial mass mailings (and therefore have to be brought along individually during the later stages) are the absolutely most gullible, ignorant, susceptible, suckers they can find.

1 comment:

  1. I would never have thought of that that final point about the 'pre-qualifier' effect of various characteristic features of the English in spam e-mails. That's fascinating. I always assumed that some of those features that distinguish spam have to do with the text being generated somehow without much human intervention, and/or originating from not so proficient writers of English (perhaps via machine translation). Maybe the fact that such text also has a 'pre-qualifier' effect is a bonus, rather than actually being the spammers' aim all along.