This is a “paper” which means it’s longer and formatted differently than typical warandcode.com content. May be a lot for one sitting, better treated as reference material.

Table of contents

1. Introduction

If you run an application that’s accessible from public Internet, there is no longer a question of whether automatons will hit it up. The question is when.

Here come the bots

Have logs of incoming HTTP requests? Then what follows might look familiar —

185.88.103.161 - [07/Jul/2021:19:12:06 +0000] "POST /api/login HTTP/1.1" 401 17 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
154.201.43.251 - [07/Jul/2021:19:12:07 +0000] "POST /api/login HTTP/1.1" 401 17 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
168.81.130.57 - [07/Jul/2021:19:12:07 +0000] "POST /api/login HTTP/1.1" 401 17 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
94.231.218.87 - [07/Jul/2021:19:12:09 +0000] "POST /api/login HTTP/1.1" 401 17 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
170.84.230.163 - [07/Jul/2021:19:12:09 +0000] "POST /api/login HTTP/1.1" 401 17 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
2605:6440:3003:1::2:82b4 - [07/Jul/2021:19:12:10 +0000] "POST /api/login HTTP/1.1" 401 17 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
160.116.245.18 - [07/Jul/2021:19:12:11 +0000] "POST /api/login HTTP/1.1" 401 17 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"

There are “good” “benign” programs scouring the Internet, like Google bots or whatever search indexing things. We’re going to talk about defending against unwanted or malicious automation, and bots.

Specifically cases where you run a consumer-facing web app that takes usernames, passwords, payment methods. You won’t just get some probing HTTP requests trying to GET generic stuff. You’ll get full-blown attacks.

There is no universal solution to this. Anti-bot tech is BIG MONEY — its salespeople are eager to scare you, providing what they’ll deem a cure-all.

Pushy anti-bot vendors

I am not selling you anything and we’ll be vendor-agnostic here.

Really. Don’t @ me over my Google reCAPTCHA blog post, podcast, or conference talk. 😛

Randy hearts Google

Let’s develop a generally applicable anti-automation and botting defense strategy.

Disclaimer that this should not be considered reflective of my employer(’s|s’) views or their security program(s). Rather this is speaking from my own software business Kapotel, putting out consumer-facing web apps and doing e-commerce.

2. Tools

In my HOPE 2020 talk about defending HBO Max’s launch, we covered basic tools or pieces to support your fight against automation.

Save yourself 45 minutes, here’s a more up-to-date take —

  • CDN fronting and/or edge computing
    • Typically the first thing traffic hits
      • May be mixed with “API gateway” and/or “web application firewall”
    • Examples: Netlify, KeyCDN, Fastly + Compute@Edge, AWS CloudFront + Lambda@Edge
  • API gateway
    • Wrapper of integrity around services with straight logical focus
      • May be mixed with “CDN fronting”, “edge computing” and/or “web application firewall”
    • Examples: AWS API Gateway, Apigee, Kong API Gateway
  • Web application firewall (WAF)
    • Wrapper of integrity around services with security focus
      • May be mixed with “CDN fronting”, “edge computing” and/or “API gateway”
    • Examples: AWS WAF, Signal Sciences, Shape Security
  • In-client bot detection
    • Client-side code executes and generates telemetry data you then evaluate server-side with some closed-box model
      • Typically handed off to a vendor
    • Examples: hCaptcha, Google reCAPTCHA, Akamai Bot Management SDK, Shape Security, GeeTest CAPTCHA, WhiteOps HUMAN Application Integrity
  • Service-based logic
    • Something developed in-house or at least pretty customizable/configurable
      • Typically whatever “origin” is to you
    • Examples: Rate limits or geo-blocking written into your own software
  • Log aggregation
    • However you keep and look at logs from all the other things above
    • Examples: Kibana, Splunk
  • Logging
    • You might’ve inferred this from Log aggregation but you need sufficient logging to understand you’re being attacked and how
      • Even Splunk cannot magically produce logs out of thin air
      • Common Log Format level details at a minimum
    • Examples: 7 days of HTTP logs in Common Log Format like Apache can provide, 30 days of HTTP logs as Signal Sciences WAF implicitly provides

Here’s how these pieces diagram out for the flow of HTTP traffic when disparate —

Diagram

Just note they won’t always be disparate. For example, Akamai will combine the first 3-4 items depending on how much you pay them.

If you have none of these pieces right now, start with logging.

If you have logging and nothing else, add log aggregation because you’re probably not looking at your logs well enough.

If you’ve got those two things, either you now understand the need for other pieces or you’ve managed to build an application not worth attacking.

3. Threat profiling

What is malicious automation trying to accomplish on the Internet?

  • Login attacks
    • Credential stuffing
    • Bruteforce -> increasingly rare though
  • Payment attacks
    • Credit card fraud
    • Gift card bruteforce
    • Promo code bruteforce
  • Destructive or for-ransom attacks
    • Injection
    • Denial of service
  • Content scraping

My take is that the above gets driven mostly by blackhat economics 💰 with “you pissed someone off” as a second place reason.

“Someone” might be a team, state, or other non-singular person entity.

There is an underlying cost to attack in either time/effort or money. This is why “someone wants to watch the world burn” is an even more distant third.

What does the biggest, baddest attacker look like?

  • Tons of network resources
    • High-bandwidth residential proxies to send traffic through whichever geographic origins they want
    • Easily dodge rate limits based on IP address because they have many
  • Tons of “dark matter” resources
    • Whatever feeder content is needed to make attacking happen
      • Fresh credential dumps → login attacks
      • Fresh stolen credit cards or CVV2s → payment attacks
    • Here’s some math because I am more familiar with login attacks than payment attacks
      • Assume a 1% success rate for “back of the napkin” math on threat economics
      • Takes 100 requests on average to get into 1 account via credential reuse
        • “Average” also means customer/user profile — i.e. SecurityTrails may have a more security-savvy user base than say walmart.com
  • Intimate knowledge of how your application works
    • Not just spraying the whole Internet
    • Might know it better than you your developers!
  • Inferred knowledge of what “normal” HTTP traffic looks like
    • Generally HTTP traffic follows a normalized wave where it gets heavier at night
      • “Night” being your main service area
        • Especially assume United States Eastern Time when your web app’s in English, a .com, and doesn’t suggest otherwise
      • Send constant attack traffic “low and slow” or follow that wave
    • Mix up headless browsers
      • Requires more resources but can then skip the next 3 bullets
    • Mix up User-Agent strings as appropriate
      • Least sophisticated way to say “here’s what device I am!” 🤡
    • Vary TLS fingerprint as appropriate
    • Vary request header order as appropriate
      • AOL patented this as a blocking technique in 2008
  • Programming skills or resources
    • Script kiddie programs like OpenBullet can only achieve so much advanced logic in their configs
    • To properly capitalize on their knowledge of your application and your traffic, the ideal threat actor runs custom code against you
      • Laziness dictates just how custom it is and/or how infatuated they are with you versus any other targets

Screenshot of openbullet.store

MBAs, idealists, and security people who aren’t yet battle-scarred jaded think our goal is to keep this super threat actor out entirely. To stop them from being able to do anything whatsoever.

This is wrong.

The rest of us want to make attacks versus our platforms relatively expensive. Just as blackhat economics drive most attack activity, they suggest most threat actors will move on to less costly targets.

This especially holds for e-commerce, SaaS, consumer entertainment. In such fields you might not need to execute 100% on this guide to make attackers go away.

Effective bot defense

However, there are more sophisticated predators than ones trying to resell goods or accounts for small cash, spraying everybody. Remember that “you pissed someone off” category.

In such cases you need to invest more time here. Our plan will still hold, you’ve just got to go further than other security people.

4. Tactical maneuvers

A comprehensive anti-automation and botting strategy will employ all of the maneuvers in this subsection. However, you’ll find that less mature teams do much more of the early stuff and vice versa.

For each category of maneuver we’ll consider more individual plays. i.e. “Ad hoc blocks” → “Blocking request header orders”

We will also cover which tools from earlier support each case. I should call out that logging and log aggregation apply to all because those allow you to research, visualize, understand what goes on across your attack surface.

Supplementary content and links will be provided. Blindly following me is not ideal. Do your own research, trust gut feelings, your mileage may vary, etcetera. 🧠

4.1 Ad hoc blocks

What falls under this category —

  • Blocking static IP addresses, subnets, CIDRs, and/or ASNs
    • Can flow into our Adaptive blocks category with dynamically updated feeds
  • Blocking request header orders
  • Blocking TLS fingerprints or other device-specific generated fingerprints
  • Blocking “WAF patterns” *
    • Depending on your toolset this might be adaptive with automatic updates

Tools that can support this category —

  • CDN fronting and/or edge computing
  • API gateway
  • Web application firewall (WAF)
  • Service-based logic

Supplementary content —

This is Security Operations 101. Manual and reactive defense work. “An IP address is attacking us, block it!”

That’s where this starts. You react to some IP address attacking you by blocking it, and you only know it has attacked you by reading logs.

After some time you might realize all of these baddie IP addresses fall within a shared subnet or Autononomous System Number (ASN). You can start to get ahead by outright blocking those. Your enemies might need to set up with a new hosting provider or proxy list.

In that same train of thought are various fingerprint blocks. These might be the order of incoming request headers, derived from TLS, or some closed-box vendor voodoo. These too are less obvious to attackers so you can get ahead of them to some extent by blocking on these things. You need to be set up with a supportive architecture and toolset though.

For example, when CDN fronts your backend services, request header order must be preserved through there then “passed downstream” towards origin. Like you might hash it into a new request header set via the edge.

As a random grab bag thing — when you pay for some WAF, chances are it has a bunch of (relatively dumb) attack traffic patterns in there. Like to block incoming HTTP requests with a query param like input=<script>alert(1)</script>.

Is this is constructive for a contemporary application security program? Many have wondered. We won’t get into that hot debate here.

I don’t know your application or business like you do. That is worth explicitly saying for this class of maneuver and for the next “adaptive blocks” class too.

Perhaps you run SaaS that ultimately exposes a REST API, totally valid for customers to hit from a datacenter. You wouldn’t want to block Digital Ocean via ASN because then customers can’t hit you via cheap-o droplets.

On the other hand, if you just host consumer content and/or have service region restrictions for whatever you’re doing, then datacenter or known VPN traffic may be undesirable. You might use VPN network lists from GitHub, hook into FireHOL IP or CleanTalk feeds.

The latter suggestion starts to walk into our next section actually.

4.2 Adaptive blocks

What falls under this category —

  • Blocking geographic traffic labels
  • Blocking VPN, proxy, and/or traffic type labels
  • Blocking other dynamically updated feeds
    • Can (again) flow into our Ad hoc blocks section
  • Blocking “WAF patterns” *
    • Depending on your toolset this might be ad hoc and be purely manual
  • Blocking based on anomalous behavior and/or a machine learning model *
    • The latter enters “in-client bot detection” territory but may not explicitly be that

Tools that can support this category —

  • CDN fronting and/or edge computing
  • Web application firewall (WAF)
  • Service-based logic

Supplementary content —

Similar intentionally to the last section, but less reactive in nature.

Let’s say we want to block consumer VPNs and other potential traffic sources that can obscure real customer locations. Again this may or may not be realistic for your business but stay with me.

You could go through your logs to see lots of different user accounts popping out of a single IP address. After some further investigation, you deem that as a public proxy then block it. That would’ve fallen classically into section 4.1.

With adaptive blocks, we hook up to some reliable data source that maintains what VPNs and proxies might be. Perhaps we pay MaxMind for this information — they can label most of the Internet’s network space, we determine which labels are suitable for our application, then block everything else.

Because MaxMind regularly updates, we’d either hook into a live feed/API if provided by our licensing or download updates from them at some periodicity (i.e. once per week).

Perhaps we only serve the United States, with legal obligations of some kind to prohibit access from other countries. This task could also be accomplished with MaxMind or one of their competitors, like GeoComply.

Traffic labeling for geography and nature (i.e. VPN, proxy, datacenter) is often integrated with WAFs like Signal Sciences or edge computing like Fastly already. Check into that before you go paying for anything.

Depending on accuracy requirements to meet your use case, check into free and open resources like FireHOL IP feeds before paying for something like GeoComply too.

Adaptive blocking can take some other forms besides straight labeling IPs or subnets. Your WAF might get regular updates for incoming HTTP patterns to block.

We are not so much focused on that type of automated attack with this plan, and the value of blocking such things is up for debate, but it is adaptive.

You might consider blocking traffic if a user or session — define those how you wish — demonstrates sufficiently aberrant behavior. This could break out into two topics (rate limiting and in-client bot detection) we’ll address as their own maneuvers, though.

So consider other things. I think of this as scrutinizing chains of actions.

There are new vendors popping up in this area to hook into your logs then pump action streams into a magic machine learning model. You might consider setting up your own system of this type, which would yield some cutting-edge blocking tech. Certainly not reactive.

I have not done something like this myself. However, there are some relevant demos here like fraud-detection-using-machine-learning from AWS Labs. It’s doable.

4.3 Rate limits

What falls under this category —

  • Blocking an IP address after it exceeds some rate/threshold of action
  • Blocking a specific username/email after it exceeds some rate/threshold of action
  • Blocking a specific payment method after it exceeds some rate/threshold of action
  • Blocking whatever other unique identifiers can be effective, exceeding some rate/threshold of action

Tools that can support this category —

  • CDN fronting and/or edge computing
  • API gateway
  • Web application firewall (WAF)
  • Service-based logic

Supplementary content —

Rate limiting is sort of “dumb” detection of erratic behavior, perfectly capable of implementing without any machine learning sorcery whatsoever.

Many different tools allow at least rate limiting based on IP address. For example, if a single IP address submits an unsuccessful payment attempt 10 times within 10 minutes, block any subsequent HTTP requests like that for an hour.

There are lots of settings to break out in that example. You have to define what a failed payment attempt looks like, which probably includes considering request path, request method, and response status code.

That “10 times within 10 minutes” is a threshold I made up for behavior that’d appear illegitimate. You might arbitrarily pick something that seems generous to start, then pare it down based on “normal” user behavior.

Then finally — what defensive action do you take when this line is crossed? My example says just block follow-up payment attempts for an hour. You might block for longer, you might block all traffic or at least some other parts of the attack surface, etcetera.

Rate limiting based on what an IP address does is most widely supported by different tools as far as I can tell. Being able to consider unique values within HTTP request headers, or especially request bodies, is rarer.

This is why at least some rate limiting might be custom written by you or your development staff. This logic might live within your own middleware or be pure, in-service logic. There you can rate limit on whatever unique things you want but this defense will create origin load at least to start.

Rate limiting is one maneuver that can be hard to do all at the edge. You might have logic live very close to origin, then send block commands out to an edge worker if that’s supported by your setup.

Later in this guide we more fully discuss the pros and cons of having your defenses live in different parts of the stack.

4.4 In-client bot detection

This category hinges entirely on in-client bot detection tooling. Blocking based on the output of a closed-box model which takes in client-based data you’ve collected.

Tools that can support this category —

  • CDN fronting and/or edge computing
  • Web application firewall (WAF)
  • Service-based logic

Supplementary content —

We’re going to run some client-side code to fingerprint our users. You can do this with the open-source FingerprintJS project. On the backend, you expect this information to come along on at least fraud-prone requests.

There you’ll assess this client-derived data using either your own proprietary model — you certainly could rig something up around FingerprintJS if being cheap — or call out to a vendor. In the latter case, what you earlier executed client-side will need to include vendor script.

What does the server-side model give back? A confidence that the telemetry data came from a human and/or a desirable place. Those may not be the same thing in exotic cases where you have expected or “good” bots then malicious actors.

Really spitballing here and getting into Blade Runner territory, bear with me.

You can then act on that data. For example, legacy Google reCAPTCHA v3 and older would come with a default threshold recommendation of blocking confidence scores less than 0.5 (50% confident the assessed payload came from a human). You would block traffic that came up short there.

How do you block? When do you block? Do you challenge the user? It’s possible to use in-client bot detection completely invisibly, even though folks best know Google reCAPTCHA from its older versions that required clicking at least a checkbox, usually also bikes or traffic lights.

Such invisibility is better for user experience but decreases accuracy to some extent.

I have never split-tested on the same attack surfaces so cannot even say what that extent is.

Do you go right to blocking or start out simply logging results from these bot assessments? How long do you do that for? This then leads into where you put your threshold, spending more effort than just accepting whatever the default recommendation is.

At this point I hope you see how you don’t “just” roll out this in-client bot detection stuff as an easy fix. There are lots of interconnected decisions to be made by you, your team, maybe management, and/or consultants and/or vendor reps who butt their heads in somehow.

Put easiness aside. Let us now consider whether in-client bot detection is an infallible fix. Attackers (can at least try to) skirt it in a number of ways.

You can apply “stealth” techniques like puppeteer-extra-plugin-stealth to headless Chrome/Chromium, Firefox, whatever.

Detecting headless Chromium is hopefully one plus of paying for reCAPTCHA Enterprise. The author of that Puppeteer Extra Stealth project has been quoted as saying it’s “probably impossible” to 100% evade detection. Google wünderkid Paul Irish runs this headless-cat-n-mouse project.

You can pay for humans to generate in-client bot detection payloads (i.e. Anti Captcha, 2Captcha). This may be a totally valid and inescapable way to skirt this security control.

However, if that becomes necessary to attack you, we’ve increased monetary expense and effort required by your attackers on a per-request basis.

As stated earlier with our threat modeling — that is a realistic goal. Making it harder and more expensive to attack you.

Cause some headaches!

4.5 Bot traps

What we will call “bot traps” are like barriers before or wrappers around in-client bot detection.

The latter thing is usually billed on a per-assessment basis. Your conveying of the client-derived payload to the bot detection vendor is relatively expensive not just in that sense, but also latency and service load.

These bot traps do not involve client fingerprinting or direct customer interaction. They’re not a replacement for straight-up bot detection because they can be worked around — given enough time — in a way that cannot.

Tools that can support this category —

  • CDN fronting and/or edge computing
  • Web application firewall (WAF)
  • Service-based logic

Supplementary content —

4.5.1 Missing bot detection payload

For at least certain actions, you now expect incoming requests from the client to include a bot detection payload. If that’s missing, you can assume such traffic is tampered with or unwanted then drop it there.

Most often this telemetry info is sent from client to server via request header (i.e. X-BotDetectionVendorName). Checking request header presence is a simple enough thing to do, you’ve got your choice of dropping traffic from edge logic all the way down to origin. It depends on your architecture and how much visibility you want.

Should the telemetry info be coming in a request body field, that might limit where in the stack you can act on it.

Besides just the header or field being present, depending on your vendor, the format of the payload might be fairly predictable. You can pattern-match on that from your choice of place without straight-up conducting a vendor assessment.

4.5.2 Client Puzzle Protocol

The Client Puzzle Protocol U.S. patent 7197639 expired in 2020 so let’s get wild 😛 implementing a proof-of-work system.

Assumption here that your client-side code is either explicitly obfuscated (i.e. JScrambler) or implicitly obfuscated (i.e. webpack). If you’ve just got vanilla JavaScript that is easy to follow, you can still increase effort-to-attack in this way, but maybe not enough to make it worthwhile.

Basic concept here is to prompt the client to complete a puzzle which you then expect server-side. The puzzle output should be unguessable without reversing or deep code review effort. Your typical attacker probably will not bother — that’s the point.

The non-typical attacker who does figure out this control will find it more difficult to write up an OpenBullet config or other “script kiddie enabler” because of the more advanced logic it now necessitates.

An additional positive I should point out with this is support for client device types that in-client bot detection might not. For example, as of this writing, Google reCAPTCHA Enterprise can support web browsers, Android, or iOS. Targeting more exotic devices? Chances are those can support puzzles, should you feel the need.

“should you feel the need” → proxying or reversing even for Roku is a huge pain in my experience so you might be going overboard

What does this look like in practice? Let us say your client ultimately makes a POST /api/login request. You support web browsers and Roku devices. In the browser, clicking “Sign In” logically jumps right to that POST with in-client bot detection payload included via header.

For Roku, or whatever other devices lack that header, POST /api/login gets a response saying to solve a puzzle. The response includes a seed value and timestamp, let us say, which the client understands to use with its obfuscated, closed-box logic. The result will be included on a retry of the original request.

This second POST /api/login sports 3 request headers ['X-PuzzleSeed', 'X-PuzzleTime', 'X-PuzzleAnswer'] detailing the puzzle matter. Now the login request is processed.

You could even use an edge worker to make this happen. Any POST /api/login requests missing X-BotDetectorPayload or all 3 of those headers get an automatic response from the edge with puzzle seed and timestamp. The closed-box logic from the client is replicated there — when all 3 puzzle headers are present, the worker determines they represent a valid result.

Otherwise no traffic is forwarded on towards origin.

Later in this guide we elaborate on pros and cons of enforcing security versus proximity to origin.

Do check out Scott Contini’s Fighting Bots with the Client-Puzzle Protocol (September 2020) for further reading here.

Technically Hashcash [+1][+2] also aligns with this vision.

4.6 Fluffing up your blocks

All of the maneuvers we’ve just covered tend to lead one place — blocking traffic.

So now we know how to get there, and will continue on to consider where you do it with or which appliance. But what should an effective block look like?

This is an art and maneuver of its own. We’ll go from most orthodox to most fluffy.

4.6.1 Obvious HTTP block

The default from any security appliances in your stack, I would bet. You block HTTP traffic with some 4XX response code and an empty body or something else obvious.

Less common but still quite obvious would be returning some other response code that isn’t so clearly bad. However, the response body in those situations tends to give things away (i.e. with an error message).

4.6.2 Tarpit block

Configurable with many tools, a “tarpit” holds open the TCP connection. The result is that whatever unofficial client code is being used against you hangs up unless logic has specifically been written for it to time out (i.e. after no response in 10 seconds).

This tarpit period may just be indefinite or can be configurable in other cases.

When you have block logic more prone to false positives or negatives plus an indefinite tarpit, the experience for your legitimate customers can be crappy unless their client code is adapted for this. It reflects poorly on you when it seems the backend is unresponsive to desirable users. A good compromise between slowing down attackers and avoiding this situation would be a tarpit of ~1-3 seconds.

Note maybe that ceases to be a “tarpit” by definition and “slowdown” is more appropriate, you decide.

That can provide appreciable relief when blocking thousands to millions of requests. However, note this is still a pretty noticeable block behavior, especially if you always let the connection hang or add a precise N seconds.

Consider a defined delay period that varies between some range (i.e. ~1-3 seconds again). This makes thing less noticeable, and we’ll expound on the role variance can play in block maneuvers shortly.

4.6.3 Spoof block

This is a step forward from “obvious HTTP block” that can be paired with tarpit/slowdown behavior that yields an HTTP response.

The idea is making your response look as much as possible like a unblocked response.

For example, let’s say you have a login pathway experiencing scripted attacks. Legitimate credential matches yield a 200 OK status while invalid attempts yield 401 Unauthorized. You have rate limiting somewhere in that stack. When a rate limit kicks on to block the client, those responses send back 429 Too Many Requests.

The only party who needs to know you’re blocking someone is you. So in the case above, you would do well to simply change 429 response codes into 401 via edge logic or someplace else that makes sense.

This was a simplified example that discussed HTTP status only. You’ve also got response headers and body to consider. Your approach here should be just like you’re pen-testing a forgotten password flow, trying to assure username enumeration isn’t an issue.

Scrutinize the whole “normal” negative response then make your blocks look as much as possible like that.

When done well, this confuses opposition below a certain level of sophistication and “gets more mileage” out of your block logic.

An attacker may eventually realize they’re blocked if retrying any successful pre-block hits (or just known valid data) then you negate that. They might also spread traffic across network resources with some distributed logic to check any username-password pair multiple times.

Could you whip up some really advanced logic around that? I guess so but it would be effortful relative to other propositions from our guide here.

4.6.4 Random confusing block

Here’s where you mix up all of the above to confuse attackers. Maybe with even additional randomness, giving back unpredictable responses, wacky status codes, varying tarpit, etcetera.

Some vendor products allow for this to some extent — i.e. tarpit 70% of traffic, give one static response to 20%, another to 10%.

4.7 Non-block actions

After detecting aberrant behavior, there is a lot of potential nuance between general logging and full-on blocking.

The following is paraphrased from the talk HUMAN and Google Cloud Platform delivered at Google’s 2021 Cloud Security Summit —

  • Block the input and/or don’t deliver the output
  • Terminate the active user session
  • Lock the user account from additional logins
  • Present the user with an in-session cognitive challenge
  • Challenge the user with an MFA/2FA request
  • Block the IP address that performed the submission
  • Heighten logging associated with the active user session
  • Flag transactions performed by the user for review by your security or fraud personnel
  • Place the user into a “virtual waiting room” and/or introduce a time delay between allowed transactions
  • Adjust the user’s in-app permissions and/or available application features
  • Increment a counter towards some decision you make after collecting more evidence of automation or abuse
  • Notify the end user via email, text message, or otherwise that their account appears to be compromised
  • Request or force user account password reset
  • Limit the number and/or size of transactions that can be performed within the user account or active session

I like that list and think it’s pretty comprehensive. ❤️

Car stopped at a barrier

4.8 Product security

There is no magic vendor-provided pill to get us away from good ol’ fashioned (??) product security features. That is a bitter thing to swallow but it’s true.

Can Google reCAPTCHA Enterprise do leaked password checking a la Have I Been Pwned and slap two-factor authentication on your login too? Technically yes.

Should you implement those things in that way versus building them into your platform a more native way?

Should you be using Magic Link authentication instead of taking username or email addresses plus passwords at all? What about WebAuthn?

I am a huge Magic fan for password-less tech and do not like HYPR even if its founder Bojan and I are both Aspect Security alumni. 🤐

Quandries like this are why we product security people are employed. And why total compensation is different than pen tester consultants who needn’t think so hard.

Pen testers versus product security meme

Anyway, at best product security features allow you to do a lot less blocking, rate limiting, and in-client bot detection. They do not render any of those completely irrelevant.

At worst you’ve got yourself a bunch of “defense in depth” and successful attacks against you cause less damage than they otherwise would have. Nice.

Start with the multi-factor authentication and defense in depth subsections of OWASP’s credential stuffing cheatsheet.

Continue by reviewing the actual user flows of your application. It might be helpful to — *gasp* — traverse them yourself, signing up from scratch and everything.

5. Patterns

These are some architectural patterns for completing the above maneuvers. Generally, we’ll ascend in sophistication and performance.

5.1 One heavy lifter near origin

My guess is that most organizations start here then never mature from here. Depending on your performance requirements and/or traffic, this can be enough.

Let us say you’re using Fastly at the edge and Signal Sciences WAF “wrapping” origin. Fastly acquired Signal Sciences because they can work well together.

By default you get basic WAF attack pattern blocking and rate limit capabilities via Signal Sciences. You can block traffic tagged by it as “datacenter”, you can do geo-blocking or straight-up block IPs, CIDRs.

Fastly can set some request headers at the edge that Signal Sciences is able to act on. Block on. This requires, of course, that attack traffic is restricted from bypassing Fastly to hit origin straight-up.

  • Request header order
    • Important to “preserve” this from the edge
      • With Fastly VCL you can do the following in sub vcl_recv
        • set req.http.x-HO = original_req.header_list("|");
    • Signal Sciences has no current support to consider request header order if you hit it “naked” as far as I know
      • But you can have a list of blocked header orders then assess on that x-HO set through Fastly
  • Autonomous System Number (ASN)
    • Technically you could break these down into CIDRs via Signal Sciences blocklists
    • Fastly VCL example for sub vcl_recv
      • set req.http.x-ASN = client.as.number;
      • Assess on x-ASN via Signal Sciences
  • Proxy type
    • Fastly VCL example for sub vcl_recv
      • set req.http.x-Proxy-Type = client.geo.proxy_type;
      • Assess on x-Proxy-Type via Signal Sciences
  • Proxy description
    • Fastly VCL example for sub vcl_recv
      • set req.http.x-Proxy-Description = client.geo.proxy_description;
      • Assess on x-Proxy-Description via Signal Sciences

The latter couple bullets on “proxies” are exposed via Fastly integration with a traffic-labeling vendor. You might block more obvious traffic anonymization attempts that way, should it fit your threat profile.

In short, we’ve got Signal Sciences WAF doing virtually all of our heavy lifting here. It is dependent on traffic going through Fastly to add supplementary data. This is an “easy” pattern but not the most robust or performant.

5.2 Edge it and forget it

Let us now pretend we’ve got just Akamai at the edge. This has some WAF features via Kona Site Defender that I don’t think Fastly provides.

Now at the edge we can do rate limits. We can block ASNs and header orders and TLS fingerprints. If memory serves, Kona will do cookie-cutter WAF blocking like <script>alert(1)</script> too.

This is all pretty performant because it’s right at the edge. However, you pay a price in visibility. Your metrics and logging from here likely boast an impractical signal-to-noise ratio.

Now you’re either spending a lot of time in the Akamai web console or doing guesswork about which things are blocking what, even just how many requests are denied. Akamai doesn’t provide a nice API. Web UI is clunky. You can SIEM pipe some logging on everything coming through to somewhere but good luck with that.

Something more contemporary like Fastly could also be your heavy-lifting edge. The same stuff we were setting in request headers for Signal Sciences to act on, you just block it with VCL logic instead.

Visibility may still not be ideal without much configuring. Perhaps each block type uses a custom vcl_error code to help you quantify the action via logging later.

Fastly might have WAF pattern-based stuff. It does lack rate-limiting by itself, to my knowledge.

I put this pattern ahead of the last one because it’s still lots of dependence on one appliance but you pick up performance. And with greater effort you can have visibility.

In most cases these blocks will be “out of sight, out of mind” though. Let’s be real.

5.3 Manage at origin, move to edge

Consider the strong points of the previous setups here — “best of both worlds” might be to start blocks at or near origin then “move out” to the edge.

We’ll say here that you have Fastly at the edge, Signal Sciences WAF, and Google reCAPTCHA Enterprise present in your application environment.

reCAPTCHA Enterprise can integrate with a Fastly worker to “move out” block logic as you configure. For example, repeated failures on your reCAPTCHA assessments from a specific IP address or subnet. This gets you nice visibility on the start of some concentrated attack effort followed by performance as blocking moves outward.

Here too, you’d have Signal Sciences starting blocks, which might be fully Splunk-indexed. Then Fastly blocks would perform better under extended or heavy attacking, but perhaps those logs are not so accessible in your setup. You move out via edge worker.

Consider allowlist logic as another example. This might live in Signal Sciences then sync out to Fastly too.

I am not aware of these products integrating so tightly yet.

Nothing is stopping you from rolling your own edge worker then hitting up the Signal Sciences API at some periodicity to update Fastly. Both have had quality APIs even before the latter acquired the former.

6. Examples

Here are some full-blown example walkthroughs to illustrate how all of the above works together. “Teamwork makes the dream work” or whatever. 👺

6.1 Example A

“Loud IPs from favorite cloud platform”

This is perhaps the most softball case for an attack. You start seeing a half million requests per hour towards a login endpoint from a single IP address.

You notice User-Agent header is static, credentials attempted are in alphabetical or near-alphabetical order being read in from some list, this is not legit traffic.

A first baby step might be just adding single IPs to a blocklist. However, within 10 minutes of you doing that, you find a different IP address under a nearby subnet or the same ASN continues on.

In an effort to be somewhat more proactive than single IP blocks, you might graduate to bigger net blocks or actioning the whole ASN. The latter is if you have no business case for, say, Microsoft Azure’s ASN coming in.

6.2 Example B

“Lots of proxies but not scripting skill”

You have an alert set up in case the success rate on your login service descends below some percentage. This alert goes off, and it becomes clear by crunching your logs that a sudden flood of credential stuffing attempts is coming in.

You don’t have any in-client bot detection.

Many IP addresses are in use. You don’t see a clear subnet pattern, and from either already-available ASN info or pulling some IP samples then looking this up, there is a mix of datacenter and residential network space in use.

Thus it does not look like we can do blanket network blocking. You might get some relief by introducing detection and countermeasures of that non-residential ISP traffic if your business case allows for it.

The rate at which IP addresses are rotated and/or rate of requests sent from each does not lend itself to IP-based rate limiting. You do not have anything more advanced in-place for your stack so more thought on that isn’t really warranted.

You notice the User-Agent strings on all this traffic are consistent, which is almost always weird, even if it’s a specific browser and platform version instead of an obvious python-requests.

You have request header order data available. Through back-analysis, it becomes clear that only this malicious traffic presents itself with headers ordered this way. The fingerprint does not overlap with your legitimate user pool.

You block the header order via your WAF, which seems to clear up the issue and stabilize your alerting.

Either the attacker doesn’t figure it out and gives up here or you engage in a longer-term Whack-a-Mole battle.

6.3 Example C

“Manual efforts with heavier-than-normal IP address preference”

You recently expanded into global availability for your online business. This meant dropping geo-blocks that had previously been for regions beyond North America.

Within a short time, one particularly fraud prone region has slammed you with payment attacks. Your general success rate for raw customer payment attempts tanked from 95% to 25%.

You notice single IP addresses sending many failed payment requests. A given IP will register a new account, then traverse your payment portal, trying a single payment card over and over per account. Eventually they abandon that card and account, might use the same IP several more cycles, switch and repeat.

These IP addresses stem from mostly residential network space in this region. 90% residential ASNs, 10% hosting or datacenter. You imagine the threat actor is signed up for residential proxies through Bright Data or something.

You have in-client bot detection in place which has served you well until this threat scenario. You find examples of this activity scoring 90% human confidence scores and better. Request headers vary somewhat but suggest much overlap in individual actors, as the IP address preference might’ve done too.

You have a special anti-fraud product covering payments but stopping attacks here are relatively expensive. This product has behavior-based rules for payments, and you need to evolve on a behavior-based approach perhaps to stop this attack traffic.

What would be most basic? Use your WAF to rate limit on failed payment attempts per IP address at least. It would be even better to rate limit per tokenized payment card and/or user account but some of that logic might not be doable at the WAF. You’d need to do some amount of in-service rate limiting, at least for the per-account stuff.

What would be more advanced? Well, for example, if your in-client bot detection is Google reCAPTCHA Enterprise then you can take advantage of their complementary add-ons like Account Defender and/or Payment Defender. These take a more behavior-based approach than just reCAPTCHA’s client/device fingerprints.

If you’ve got the engineering talent on-hand for this, arguably nothing stops you from “rolling your own” machine learning around user actions to block certain accounts. A middle ground between that and relying totally on a vendor would be “wrapping” something like Amazon Fraud Detector.

“Hold on, these are manual attacks, why does it fall on my anti-bot team?”

I have never been lucky enough to work someplace with a team — even a single engineer — dedicated entirely to anti-bot.

So this feels like a bigger product security issue. Besides, whoever sits more closely to payments will bring this to you just by the virtue of it looking automated, to begin with. Only after evaluating the in-client bot detection results plus request headers will it become more apparent this is manual.

6.4 Example D

“New bot detection revelation”

Let’s say you cater to millions of average consumers. Not techie-only people.

Before ever putting up your login service, you’d have guessed a “normal” success rate to be 99% for this.

After a week of data rolls in you’re at ~70%. And basically stay there for the next several months. You say, “ah well, I’d never run a big app like this before, maybe 70% is industry-standard.” SHRUG.

Every so often you see some loud and blatant bot attacks. These have coerced you, over time, into adding a fancy, paid in-client bot detection product for your login flow.

This bot detector’s default setting for further traffic action is 50%. You’re running in logging-only mode for a while, but scratch your head.

Like 25% of your traffic has crappy bot detection scores and would’ve been blocked at that threshold.

You take a bunch of samples via whatever HTTP logs or visibility you’ve got. This big traffic chunk is very unsuccessful at logging in. But its User-Agents, request header orders, IP addresses, and ASNs are very mixed up. Maybe a little too perfectly mixed.

Is this legitimate traffic you’d been blocking? Or had you missed well-disguised bot traffic all along, after it “spun up” early on you and never stopped?

You’re lucky enough to work in one of those “move fast and break things!!” companies so management lets you just turn on blocking for the new in-client bot detector. Sure enough, your overall login service success rate shoots up.

Your customer support team doesn’t receive any complaints. There was a lot more bot traffic than you ever realized, simply from blending into the noise.

6.5 Example E

“Bot detection labor farm thwarted by puzzle”

As mentioned earlier, 2Captcha and Anti-Captcha are pretty valid bypasses for commercial bot detection.

A human with a “real” device, browser and/or mobile app generates a payload then it’s shared with your script to use, for a small fee.

Imagine the same scenario from our last example but you never see low scores because a labor farm is at work.

Can you stop that threat actor? Yes. They’re still writing a script, maybe sharing with others via OpenBullet. Make it really painful to write a script against your site at all.

Earlier we discussed puzzle protocols and expecting simple-but-effortful-to-reverse-engineer earmarks on incoming traffic. That means having to re-tool attack software in a way just looping in CAPTCHA solvers does not.

In this situation, that extra step may have the threat actor giving up.

6.6 Example F

“Super resourced mega threat actor gives up” ❤️

We’ll throw together a bunch of the previous examples to try ending on a high note here. 🌞

You do a lot of security “on your toes” the year after launching a big consumer web app.

You started off with no in-client bot detection or bot traps. You were missing a ton of basic product security features —

  • Account change notification emails
  • Unusual/suspicious login location emails
  • New device usage emails
  • One-time codes emailed to authorize customers changing their password or personal info
  • Useful list of devices linked to this account and ability to terminate sessions
  • Leaked password checks (i.e. Have I Been Pwned)
  • Other password complexity checks
  • Instant and proactive account resets on aberrant activity
  • Multi-factor authentication

And so on and so forth.

You were very WAF reliant, blocking things there and not moving anything out to the edge in a performant way. Your blocks were obvious to attackers.

Alerts would wake you up in the middle of the night, or holidays, then you’d try to block stuff as best you could. Reactively.

Like our last two examples, you add in-client bot detection then bot puzzle traps after some widespread, nasty account takeovers. This onslaught also motivates your organization to add those missing product security features.

At this point it’s difficult to attack you or do anything meaningful with successfully compromised accounts.

Then finally, your organization decides to do away with passwords altogether using Magic Link authentication.

Nobody shows up to credential stuff you anymore. Your payments logic has been retooled in such a way that it’s really not prudent for testing stolen cards.

All is quiet on the bot protection front. You taste victory.

7. Conclusion

Hopefully this improved on my older HOPE talk for helping strategize against malicious automation and botting.

Back when I was more or less a full-time pen tester, it never occurred to me what being in-house product security was like on this front. Even after writing popular scrapers of my own.

It’s now something that has consumed much of my thoughts. Spewing those out on the (web) page here has been good.

Anyway — **cue Terminator closing music**

This is pretty good for 2021 but our fight against malicious automation will continue. See you on the battlefield.

Terminator: The Sarah Connor Chronicles poster with relevant anti-bot logos layered over top


Randy Gingeleski - GitHub - gingeleski.com - LinkedIn