Bot Management platforms will tell you their bot mitigation is 100% successful, or sometimes, to be more ‘realistic’ the magic 99.99% success rate is quoted. “Trust us, trust us”, they say, “we’ve got this one.”
Some vendors work the math the other way, and show that their False Positive rate (the amount of humans that get mistaken for bots) magically comes out to less than 0.01%, but somehow completely fail to mention their False Negative rate (the amount of bots that are mistaken for humans), which is the one figure you really should care about. For a discussion about False Positive and False Negative rates, please see Bot Detection Accuracy here.
Bots as a Service (BaaS) boasts 99.99% success rate
Meanwhile the latest Bots as a Service (BaaS) boast that they have a 99.99% success rate at avoiding this bot detection in the first place.
They both can’t be right. So what on earth is going on?
Our response is quite simple. Please don’t trust us.
We have a zero trust model for a reason.
Our “playback” features allows our customers to see exactly which visitors were blocked and why, so they can validate and independently check at their SOC, or use a SIEM or other analysis tools they may have.
Headline figures such as 99.99% effective rates, or 0.01% false positives rates, really don't mean anything. Just because we are 99.99% effective for all bot detection across all customers is meaningless. That 0.01% could be the really malicious bot that’s currently exfiltrating your data.
Adopting a zero trust model, means we offer our customers a systematic way of measuring and validating bot traffic.
One of the ways we do this is to constantly measure our performance against the latest threats.
Bots as a Service (Baas) Provider threat.
We reviewed 10 of the Bots as a Service (BaaS) providers and chose some of the best ones at avoiding bot detection. While we can’t go into each and everyone, we chose to use Brightdata, as their platform seems to be robust and claims to be the most effective at avoiding bot detection. Brightdata claims a healthy 99.99% success rate against web sites, and not only that, they specifically claim they have the highest success rate in the ‘industry’.
We set-up a real live test to see if we could bypass our own bot defenses using Brightdata in a Red Team Versus Blue Team live bot video. (for the video please see here).
You can see in the screenshot the 99.99% success rate claimed by Brightdata. They have a set of templates for extremely well known websites, such as Amazon, Linkedin, Zara, Hermes, Ikea, google, Yelp, TrustPilot, AirB&B, - hundreds of extremely well-known sites organised by category.
Problems with JavaScript Fingerprinting for Bots
Old school bot detection works primarily with a JavaScript fingerprint, that ID’s each incoming request. Just like a club bouncer checking drinkers for ID, the fingerprint launches on each incoming client, and takes a snapshot of the platform, and sets an ID for that visitor.
There are four immediate problems with this Fingerprinting approach.
- The JavaScript is publicly available and can be reverse engineered. Although the JS is obfuscated, given enough patience it can be decoded to reveal the range of values it requires to pass the fingerprint tests.
- The visitor has to fingerprint checked at least once, before the signature rules can be applied. If you simply rotate each visitor after the first visit, you never show up as a repeat visitor in the first place, and just bypass the fingerprint. This means a very high degree of rotation, and necessitates a large pool of proxies.
- Using real devices that actually have valid fingerprints again can neatly bypass the fingerprint detection. All the associated canvas, mouse and platform checks will pass.
- Instead of hitting the actual site with the Fingerprint, hitting a cached CDN server will bypass the Fingerprint.
So how does Brightdata Avoid Bot Detection
Brightdata has a wide range of proxies and a large pool of IPs addresses, so that it can be set to rotate agents very quickly, so each visit is a first time visit. As we have seen, this is an effective bypass for the fingerprint JS agents. It’s going to bypass them each time. This is much easier than JavaScript deobfuscation and means you don’t have to reverse engineer the fingerprint expected values. Traditionally, this meant having your own botnet, or acquiring access to one, so that you launch millions of attacks from a new IP each time. This was expensive, time consuming, and not entirely effective, as the botnet IP are quickly picked up by IP reputation services, if they are being used in large scale attacks over time.
Proxy Bot Infrastructure as a Service
Brighdata uses many proxy types, allowing you to choose the right combination for your selected target. For example, you can buy a package of mobile or residential ISP proxies, and set the pool large enough so that you can rotate the bots each time. As you can see in the screenshots, just select the package that you need, and you’re all set to go. With millions of dometic IP, or mobile gateways using the large ASN mobile gateways, it's next to impossible to use IP reputation services to stop these bot attacks.
Mobile Proxies for Bot Attacks
Mobile proxies are amongst the most effective, and also the most expensive as you can see on the price list. These use real mobiles, organized into click farms, and usually Zip-tied to a moving bar to trigger the accelerometer to fool the fingerprint into believing they are being actively used by human users. E-com sites often find their most valuable customers shop on mobile devices, and prioritize mobile visitors accordingly. The proxies use real devices, so again, the fingerprinting in all likelihood will fail. Worse, a bot has now been classified as a human visitor.
Residential IP Proxies for Bot Attacks
Cheaper options but still very effective are residential Proxies using real devices. The real device passes the fingerprint checks and residential IP can’t be blocked with old school IP reputation without causing many false positives.
Included in the list are data center IPs, which at first seems counter-intuitive? Humans don’t live in data centers, so why have this option? Data center IPs can be used, for example, for an API data mining attack. The API is expecting bots from data centers, and may block residential IPs.
Once the proxies are set according to the target victim vulnerabilities, the next stage is to deploy the bot scripts.
BOT Scripts
Brightdata has a series of templates to make targeting websites much easier. These are organized by category as you can see below, and include some of the largest ecom and general datasets in the world. These scripts have been customized for each site, for example for Single Page Application (SPA) sites, or other more complex applications, where a simple crawl of each URL isn’t possible. Brightdata also claims to bypass CAPTCHA.
For our testing, we deployed a simple scraping script and just edited the fields to start the scraping process.
Bot Attack Threat
Armed with our bot script, and the proxy infrastructure, we can now launch our bot attack. Although we have picked scraping, the bots can be targeted to perform any custom scripts, to target whatever you like on the target’s infrastructure. Just to recap, the bot attack now is going to bypass the following old school bot detection techniques:
❌ IP Reputation fails with millions of residential and mobile proxies
❌ JS signature fails as the bots rotate each end every time
❌ WAF rate limiting is bypassed by slowing the bots to mimic human visits with a custom script. The bots don’t care, they can go slow and low.
❌ Throwing CAPTCHA on all visits - bots bypass the CAPTCHA.
❌ Issuing a challenge page for every request to further fingerprint the client is going to make the site unusable and the proxy clients using real devices may well pass the Fingerprinting test
How Does VerifiedVisitors work?
VerifiedVisitors learns from your traffic with our AI platform so that we can not only help you manage bot threats, but ensure your customers are prioritized and treated like VIPs, rather than something less than human with the current CAPTCHA methods.
VerifiedVisitors AI platform identifies the visitors and places them into cohorts as set out in the screenshot. You can clearly see, by risk type, each cohort broken down by actual threats which are dynamically verified over time. This allows us to trust but verify, for example for repeat visitors and known good bots, we are able to use the ML to track the behavior over time to ensure they are legitimate verified visitors that we actually want.
To stop the Brightdata attacks, we then need two dynamic rules to be in place:
- Rule 1 is to select the first-time visitor cohort, of visitors that we have never seen before. Inevitably this will include human visitors as well as the bots.
- Rule 2 serves a challenge page to just this new visitor cohort, and performs a footprint check of the client to determine if it’s human or bot. At the first visitor stage, these checks allow us to look for the tell tale signs of the bot platform itself used to launch the bot attacks. We use hundreds of signals to look for these signs. The Bots as a Service platforms have to get every signal value correct - we just have to detect one or two errors and inconsistencies in the platform footprint.
This attack type is quite extreme. It’s rare for bot attacks to just send one request and rotate each and every time. However, even in this extreme case, we are able to identify the threat cohort, and then successfully mitigate the attack without affecting the legitimate repeat visitors and regular users of the service.
Benefits of AI Cohort BOT protection?
✅ You treat your customers like VIPs and ensure they are not affected by any rules
✅ Bot traffic can be blocked before it hits the website so you don't suffer any spikes, additional CPU or bandwidth, and the bot simply fails.
✅ The holding page is quick, usually takes 1-2 seconds and doesn’t require a CAPTCHA or other challenge. It can also be custom designed with messaging, product pages, service updates or other valuable information that you want to give to the client
✅ Filtering out the bots makes it much easier to see the real visitors and understand your analytics to help you convert. For example, every time has a proportion of quick abandonment rates, usually under 30 seconds. How much of this traffic is bot traffic, or a tell tale sign that customers really don’t like your website design? Understanding the real verified visitors hitting your site, also allows the AI to spot anomalies. For example, a large spike in first time visitors that never convert, but simply are distributed across the site, crawling pages sequentially over-time is a sure sign this kind of bot scraping attack is taking place.
How does the Bots as a Service (BaaS) 99.99% success rate stack up?
Now that we’ve shown you the detailed walk through of Brightdata, how does its 99.99% rate stack up? Without definitely measuring across a benchmarked set of target endpoints, it's hard to say for sure. As we have seen it's definitely detectable, but we can certainly say it would defeat the old school bot detection methods pretty handily as we’ve shown above. Serving CAPTCHA or a challenge page for every visitor would make the site unusable.
Many companies simply regard their website content as marketing content that has been approved for release to the public. If it’s scraped, then so be it. However, what this fails to take into consideration is the systematic data mining of the entire datasets involved. For example, AirB&B may not worry about web marketing or a few listings being scraped, but certainly the systematic data mining of every single listing in a specific country or region represents a serious threat to its business model and IP.
Bot Detection Accuracy Metrics
The vast majority of bot detection vendors do not have a robust confusion matrix model, as they aren’t using Machine Learning at the heart of their detection model. Their models aren’t looking at the whole picture.
When VerifiedVisitors develops our models we want to prioritise detection of false negatives over false positives.
Why? The reason is simple, if we challenge a small additional percentage of humans we don’t breach our zero trust. When we label a bot as a human and create the false negative, we need to avoid that at all costs. The false negatives create the real problem, as we’ve breached our zero trust principles, and allowed a bot access to our protected space.
Trust the Playback Mode
VerifiedVisitors has a playback mode which allows you to set up the rules and cohorts you want, and then verify the results of the AI platform detectors. This allows you to measure the quantifiable effect of the bot mitigation and ensure the quality of the detectors. The only true measure that counts is how effective the bot prevention is on your endpoints. Focusing on a structured and analytical framework for measuring that is what is important. All the rest is marketing fluff. You can see a sample playback table below, which includes all the detailed analytics and detector types, so you can verify the efficacy of the bot detections.
We’ve all seen the various web speed detectors on the Internet. Google in particular has been increasing the importance of page load speed to its overall page rankings, in a simple recognition that slow load speeds kill views.
Many bots target IT infrastructure to understand the full tech stack and all components used. In many cases this can be harmless data. Which webserver, Content Distribution Network (CDN), or e-commerce platform aren’t exactly state secrets. So what right? Legitimate commercial services such as Built-With then package the data up, allowing sales and marketing teams to precisely target domains with the exact spec and build they have solutions for. It can be helpful to the entire supply chain - sellers get precise targeting, and buyers get solutions, that god forbid, they may actually need. However, on the illegitimate side, you can easily see the opportunity for hackers who can target known infrastructure vulnerabilities. They can launch illegal bots to quickly and easily find compromised versions and weak tech stacks across the web. Of course, this is just another reason to ensure we’re always updating software, and we have robust version controls in place, but we all know that’s not always the reality. Bots can extract very detailed information right down to specific releases and versions. Often these generic crawlers will hijack an existing common user agent string, pretending to be a legitimate search or media crawler.
Security companies use bots for looking at web platform stacks, vulnerabilities and software utilisation metrics. These bots can discover a surprising amount of detail on the tech stack. Obviously front-end components, widgets, Content Management platforms etc. are the easiest to detect, but it doesn’t stop there. Infrastructure details, the CDN, web servers, cloud provider, e-commerce platform, operating system, WAF versions are all accessible with the right tools. This whole area is simply a two-edged sword. On the one hand research tools that assess, find and quantitate vulnerabilities are vital and necessary. On the other hand, do you really want this data in the public domain?
Specialised search engines have crawlers that scrape data for a particular vertical market or sector. Typical applications are for company data monitoring, for example amongst SME companies, government sectors, and other vertical markets. Typically, these are very niche players, and won't add much if any traffic, unless in a very niche sector, where they can add value.
If you use tools tools such as Slack, they operate bots which take meta-tag data and image proxy data to preview the supported webpage directly in the tools. You can safely select all.
Bots are often used for translation services, They look for actual examples of multi-lingual usage, to enhance their knowledge base.
Select the monitoring platforms you actually use. Most of the monitoring tools use very light pings to determine status, but some of them can be surprisingly heavy, or can be programmed to check for many resources at very short intervals
Often these bots gather, cache and display info about your content such as the title, description, and thumbnail image from your pages so that rich content links can be viewed in the social media platform. These are the minor social media platforms. They probably won't generate much traffic, but don't cause harm.
Often these bots gather, cache and display info about your content such as the title, description, and thumbnail image from your pages so that rich content links can be viewed in the social media platform. These are the major social media platform including Facebook, and it's safe to select all.
Select the monitoring platforms you actually use. Most of the monitoring tools use very light pings to determine status, but some of them can be surprisingly heavy, or can be programmed to check for many resources at very short intervals
Select the monitoring platforms you actually use. Most of the monitoring tools use very light pings to determine status, but some of them can be surprisingly heavy, or can be programmed to check for many resources at very short intervals
Helpful to preview your content on partner platforms.
Scraping tools and scraping packages have evolved to make life easier for data scraping. There are a wide-range of available scraping tools available depending on your language preference and skill level, all the way from point and click SaaS type packages such as Brightdata, to python libraries, puppeteer for node JS, and OpenBullet for the .net crowd. It’s important to note that some of these platforms have been built to avoid detection, and if the scraper is using a custom script they can easily take out any easily identifiable signature data from the platform. For example, Brightdata uses a very large database of millions of domestic IP and actively seeks to avoid detection. VerifiedVisitors recommends blocking any known scraping tools and services, as well as using our automated detectors to prevent scrapers, and those services trying to hide their platform origins.
These bot services are useful for displaying thumbnails of your content as a preview. If you are a content provider, or promoting a product or service and want to distribute your content you will most likely select all
Pen test tools and vulnerability scanners are an essential part of a cybersecurity program. We recommend you just allow access to the tools you actually use, and block the rest. In our database of pentest tools, we currently have over 50 identified bots in the wild that can put a strain on system resources. Many of these are legitimate players active in promoting best cybersecurity practice If you are running a set of bespoke pen tests, then you can also use our custom whitelist to ensure just the particular pen test suite you use is allowed
If you publish content and have RSS or other feeds. these bots are useful and helps to distribute and aggregate content. If you don't have any feeds, then don't select.
Many of these bots will use scrapers to take news content from your site, and display it, often without attribution. However, they can sometimes significantly boost the distribution of news articles, and push traffic. If you want to get your content out there you can select all, or just cherry pick the top news aggregators e.g. Google News, Apple News, Medium & Lexus Nexus.
These bots work to accelerate mobile pages (AMP) with caching to speed up mobile page speed. Google is increasingly making page load speed a major factor in SEO rankings, so it well worth optimising your content for mobile accelerators.
Minor search engines include services such as ask.com, Duckduckgo, Mojeek, Neeva, and Qwant. Many of these are focused around privacy or offer more specialised search functions. For example Qwant, unlike many of the major search engines, doesn’t track searches, set cookies, or resell personal data to advertisers and others for enhanced targeting. Most sites tend to include the minor search engines - although they produce a fraction of the traffic of the major search engines, they nevertheless do bring in potential extra customers. They are less likely to be targeted by bot impersonators
These bots can help boost social media traffic to your site. They often add some type of data enrichment and display the enriched data as a service and supply the new content feeds into a wide range of content partners
The minor international search engines are often territory or region specific. Allowing these bots will be largely determined by the countries your company trades in.
Although it goes without saying that most sites will want to ensure the major search engines are crawling your website effectively, fake bots impersonating the major search engines are very common. Why? Cybercriminals know that you won’t want to block the major search engines, and won’t bother to verify the bot to ensure it’s legitimate. Worse still, you whitelist a malicious impersonator bot. Search crawlers, will often crawl your entire site. Bing in particular has a very heavy footprint as it crawls. Seeing widespread crawling is ‘normal’ behaviour from search engines. However, once whitelisted, the malicious bot can hide in this otherwise legitimate bot traffic and simply clone your entire site, and set-up a sophisticated bait and switch credential attack on your customer base. It’s free to crawl and steal content, IP, images, prices, or other commercially sensitive data. Verifying search bots is also far from straightforward. Bing helpfully publishes a list of IPs that can be used to authenticate it’s origins. This list needs to be constantly updated as ranges change over time. Just recently, we also tracked IPs that are not in the list, that are nevertheless legitimate. Very frustrating that Bing can’t sort out the basics of it’s validation. VerifiedVisitors authenticates all the major search engine bots, and ensures that the bot is valid. We look at the digital provenance, and also at the actual bot behaviour to ensure that only the verified search engine bots you want are crawling your site. Each of the search engines does have specific guides on how to verify the user agent and bot origination. Although to date, its only Bing that failed its own verification, the verification data frequently changes, and its only too easy to whitelist what looks like a legitimate bot with auto-mated tools checking and verifying each crawler for you. Using VerifiedVisitors also gives you detailed information on each search engine, and the crawling activity. You can use the search engine panel to see the last crawled dates, requests made and crawl volume, which can be helpful to see how often your site is indexed.
Looking for broken links and link integrity is a vital part of web site health. Crawler bots are used to check for link integrity, and ensure the correct links are maintained over time. Please also see the SEO category, where link checkers are integrated into comprehensive platforms. Most users will allow link checkers. Disabling them may well result in 404 from the link - and actually resuling in your site losing links and domain authority.
These bots target your HR vacancy / job pages section on your website.Smaller companies don’t tend to advertise new job openings, and recruiters try and exploit the potential opportunity by aggregating the data from millions of HR / vacancy / Open positions bot searches.
For many clients, this boils down to where you trade and your target audience. If you’re not interested in for example Chinese or Russian clients, or you actively don’t want international traffic, then it also doesn’t make sense to let Yandex and Badu crawl. Although you certainly can block bots in robots.txt, remember that this presumes the bots obey your instructions - often they just don’t. VerifiedVisitors actually enforces your instructions and verifies you are blocking the bots you don’t want. There are a huge amount of local international search engines as well that you’ve probably never heard of.
Copyright protection services use bots to act as enforcement agents, ensuring website content is not in breach of any royalty licences and that they are not stealing copyrighted content or abusing trademarks. These automated tools can lead to legal cease and desist letters for potential brand and trade mark infringement, which need to be responded to, and can therefore cause legal costs to be incurred to defend against the claims. They can also result in copyright claims if, for example a copyrighted image is used with the correct licence. It’s a complex are as image licences often vary by image according to the context of its usage. For example, a corporate brand may licence PR images for news and media distribution only - and not for use on commercial web sites. You can fall foul of these complex licence agreements by using popular images on the Internet without checking. VerifiedVisitors has identified 36 IP/ Brand protection bots currently actively crawling the internet. Although we do recommend discretion for these bots, we would strongly recommend you do an internal IP audit and ensure your content isn’t in breach of any brand compliance laws and regulations first. Larger companies with their own content libraries manage the process automatically, but smaller companies struggle with compliance. The widespread use of royalty free services such as Unsplash helps to alleviate the risk, but it’s only too easy to use a copyrighted image in a blog.
Alexa rankings and other web visitor statistics services can help to determine the legitimacy of your web visitors and provide independent verification of your audience reach, which is useful for advertisers and trading partners. Most web site owners will want to select all of these bots. If you don't want these services ranking your web visitor levels, you can exclude them.
The bots are often looking to scrape personal details of key decision makers leadership team, and senior stakeholders to use in marketing related service aggregations. Most companies will keep their management team and staff pages up-to-date on their own websites, but will exclude any contact information to prevent spamming. The scraped website data is then typically aggregated with yet more scraped data from linked-in, contact databases, and a huge variety of other direct marketing data, to enrich both contact data and decision makers for the marketeers.
Specialist image and video search engines include the major Google image and video services. If you have lots of video and image content, its going to be essential to have these indexed properly for your SEO.
These historical web indexing bots are archiving Bots or change recording bots that examine web content over time. Most users will know Way Back When and archive.org which both record changes to your website over time.
Services provided by finance partners to provide monitoring or alerting to financial transactions on your site.
These bots are used by e-mail providers and ISPs to support their individual account offerings, they will perform tasks such as domain verification and ownership checks. Select the appropriate bot if you use the services of any of the listed providers
These webmaster tools are typically suites of network tools, batch editors, DNS services etc. that may ping your site. You can select the ones you use on your own site, or select all
If you still use cronjob services, or have legacy cron-jobs you need to support you can consider enabling specific cronjob services you use
These bots are used by Content Distribution Networks (CDN's) to speed up page serving from the desktop or mobile. Safe to select all.
Accessibility bots are bots that specifically check compliance for best accessibility practice. This involves basic checks for dynamic font size and legibility, suitable alt.text for images, keyboard navigation alternatives, support for assistive technology, and ensuring that e.g. CAPTCHA isn’t just constrained to visual and has some multi-media choices. Some bots specifically check against the Web Content Accessibility Guidelines (WCAG) - which is a list of criteria your website or mobile application may need to meet in order to be legally compliant in your country. Although most responsible webmasters at larger companies will always do their best to ensure maximum accessibility, many smaller sites aren’t aware of the issues, and / or don’t have the resources. However, there are many online accessibility checking scanners, that can automate reporting on the compliance process. There are also advocacy services that specifically target non-compliant sites. The bots will collect proof of non-compliance.
AI and digital content enrichment bots supplement structured data sources with additional enrichments from scraped web data and other sources. For example, marketling services offer additional insights into companies and individuals from scraping the company websites and linkedin pages, so marketters can offer more 'personalised" outreach. Other bots add contextual information, or more data labelling, allows the AI algorithms to perform better.
AI Assistants enrich services such as Alexa by using a web crawler to provide additional context to its AI assistant services.
If you are e-commerce vendor interested in promoting affiliate relationships, then select, otherwise you can safely exclude
If you don't have advertising on your web site, you can safely exclude all of these bots. If you are running ad campaigns, the bots can help to validate your ads and the page context, so your ad inventory can potentially reach a larger audience. Advertising bots check for invalid traffic (IVT) to help alleviate ad fraud.
Now that ChatGPS is busy writing homework, detecting plagiarism is harder than reading Finnegans Wake. These largely academic bots crawl web content, and are looking at word utilisation, contemporary expressions of new words and phrases, as well as plagiarised content.