engineers just want to have fun



[Monday, 20 April 2015 23:23]

Spammers like to place their links on legitimate sites, in order to piggyback on their user base and page rank; managing even this site, I have had the opportunity to see my share of junk traffic.

As years went by, spam comments started appearing once in a while; these comments contained pseudo-tags for URL publishing, such as [link][/link] and [url][/url], so I added a check to filter them out.

When that stopped being enough, I integrated reCAPTCHA by Google; a while ago it was based on words from scanned books, while lately is showing pictures from Street View. [recaptcha]
After a couple of happy years with reCAPTCHA, suddenly there was a spread of self-promotional comments by the XRumer system, and recently more standard spam messages... so it seems that Google defences has been breached, at last.

The interesting part is that the messages have taken a step further: they are not generic Thank you for this article, I like your site or automatically generated (and incoherent) groups of words; they are real messages, taken from who knows where and posted with some link. Moreover, it seems that messages are based on the content of the page; for example, on the post about my RPG graphic engine this comment was left, so relevant that I would have believed it to be legit if I hadn't known otherwise:

Yes, I would try out a game like that! I was thinking of mikang one, but with my projects and projects in planning, I think I'll just have to put it aside for a bit. I'll just take notes in the mean time. I'd like to see what you come up with!I was thinking about mikang it primarily exploration based, and possibly more about learning about the story/background from NPCs rather than just going in a dungeon, fighting enemies and repeat. Would help if I was any good at writing. :<

This was taken from a real comment somewhere on the web, and automatically placed on my post... and so where the other comments that where postes along with it on other pages.

In order to make it more difficult to identify them a spam, a trick was used... take this one for example:

Ciao Azzurra, come vedi faccio prosregsi..,..:) Lo sfondo per RIME BIMBE e8 tenerissimo, non mi stanco di guardarlo. Aspetto con pazienza la stanza dei giochi. Non preoccuparti, vedo che sei molto indaffarata un GRAZIE gigante e un bacio Rita

If you try to google the first part of it you won't find it anywhere... but if you leave out the wrongly spelled word ("prosregsi"), you will find it in different variations. Another example from the same batch of spam messages:

Sir, i can't see my result yet. Its full one day and i am very cuouris abt my result. The website is not working till the time. Plz help us.

If you look for the whole phrase you won't find anything, if you search bits of it you will find some variations. Why these changes? Because if some words are changed/shuffled, the final message will seems like it contains an error, but it would still appear legitimate, and it will be more difficult to find every connected spam message.

So I guess that spammer are parsing online comments, or stealing emails from violated accounts. Then they index them by topic and language, and store them. Finally they crawl sites, identifying input forms and automatically filling them with an appopriate message; this require beating reCAPTCHA challenge, and hiding the original IP addess (proxy? botnet? IP spoofing?).

[nocaptcha] In order to beef up the defenses against this level of attack, I updated the captcha challenge to the last version offered by Google: No CAPTCHA reCAPTCHA. Let's see how long it will take spammers to beat it...

Lastly, an experiment: every spam comment posted on the site will be directed to this post, and will be shown in a secure way... So yeah, this site is now a small spam honeypot. For science!

Edit 2015-05-12

So far, only boring samples... I'll raise the bar, and filter out the URL pseudo-tags.

Edit 2015-11-21

Time for some cleaning! I removed spam comments with duplicated sites (159 out of 254 samples = 95 left), and a new filter will bounce spam if it uses a website already present in the dataset.

Edit 2016-01-05

I added a new rule that will discard comments with:

  • A single capitalized word as name
  • A text with starts with:
    • From 2 to 12 words
    • From 8 to 50 characters
    followed by an URL
This pattern has been derived from the last hundred comments, which where all quite similar. There where 104 comments following this pattern, I deleted 100 of them out of 124 total samples (= 24 left).

[email protected]

[Share on Twitter] [Share on Google Plus] [Share on Twitter]

[Mauro] Mauro scrive:
Good luck with your honey pot and may the SPAM be with you! "While the new reCAPTCHA API may sound simple, there is a high degree of sophistication behind that modest checkbox. CAPTCHAs have long relied on the inability of robots to solve distorted text. However, our research recently showed that today’s Artificial Intelligence technology can solve even the most difficult variant of distorted text at 99.8% accuracy. Thus distorted text, on its own, is no longer a dependable test."
Tuesday, 21 April 2015 08:12

[Giova] Giova scrive:
I'm sure it will ;) unless they are scared away simply by the presence of the new reCAPTCHA... but that would be an interesting outcome nonetheless
Tuesday, 21 April 2015 23:47

Apart from the samples you can find below, 2135 spam comments had been discarded since 2016-01-05 (roughly 3 per day)