Web Site Data Collection
Many bots target IT infrastructure to understand the full tech stack and all components used. In many cases this can be harmless data. Which webserver, Content Distribution Network (CDN), or e-commerce platform aren’t exactly state secrets. So what right? Legitimate commercial services such as Built-With then package the data up, allowing sales and marketing teams to precisely target domains with the exact spec and build they have solutions for. It can be helpful to the entire supply chain - sellers get precise targeting, and buyers get solutions, that god forbid, they may actually need. However, on the illegitimate side, you can easily see the opportunity for hackers who can target known infrastructure vulnerabilities. They can launch illegal bots to quickly and easily find compromised versions and weak tech stacks across the web. Of course, this is just another reason to ensure we’re always updating software, and we have robust version controls in place, but we all know that’s not always the reality. Bots can extract very detailed information right down to specific releases and versions. Often these generic crawlers will hijack an existing common user agent string, pretending to be a legitimate search or media crawler.
Vendor
Bot Service
Recommendation
Description
Dataprovider.com
Dataprovider site explorer
Recommended
Not recommended
ZoomInfo Powered by DiscoverOrg
Datanyze
Recommended
Not recommended
Datanyze is a worldwide leader in technographics. The company uses machine learning and proprietary methodologies to capture technologies that are used or implemented by more than 35 million companies globally. Part of ZoomInfo, a leading B2B Growth Acceleration Platform for sales and marketing teams.Datanyze is a solution which gathers information about your website technology and business in order to allow is customers a more complete picture of your business to support their sales and marketing efforts. It's like you'll be crawled if one of their customers is interested in your company or market segment.
Yandex
Yandex Webmaster Bot
Recommended
Not recommended
The Yandex.Webmaster indexing robot. This provides data for the Yandex webmaster platform.
Wappalyzer
Wappalyzer
Recommended
Not recommended
Wappalyzer is a cross-platform utility that uncovers the technologies used on websites. It detects content management systems, ecommerce platforms, web frameworks, server software, analytics tools and many others.
Spaziodati
Spaziodati
Recommended
Not recommended
A big data company who help businesses with Enterprise Data solutions and B2B Lead Generation solutions, they will come looking for publicly available data to support their product offerings.
Similartech
Similartech
Recommended
Not recommended
Similartech crawl your site to add to their database of websites, and the technologies used to build them. They claim to scan more than 30 billion web pages per month. It monitors and analyzes over 317 million domains.
SafeDNS
Categorization Crawler
Recommended
Not recommended
SafeDNS offer a wide selection of secure, fast and reliable solutions for content and web filtering. The main reason for us at SafeDNS to collect web pages, is to correctly categorize the Internet resources and to develop new technologies and products for SafeDNS.
Nominet
Nominet .UK Domain Registry
Recommended
Not recommended
If you use a .UK domain you may wish to allow this both from Nominet, who are one of the largest .UK domain registries. This bot collects data, on a regular basis, on whether .UK domain names resolve, where they are hosted, whether they are used for email and whether a website is in place. As part of this, they collect information about the landing page and About Us or Contact Us pages of your website so that they can categorize the website type (e.g. blog, parking page etc). They may also perform additional checks such as whether you have an SSL certificate and whether there is a matching domain name in a different top level domain (e.g. .com), and collect similar information about them in order to see how they differ from the .UK domain name. In addition they also collect information on which content management systems (CMS) websites are using along with version numbers in an attempt to identify security vulnerabilities. This bot should be well behaved, for example, they will restrict the number of times we visit websites which use the same IP address. Any information gathered is used to help Nominet better understand how .UK domains are used by registrants, identify security vulnerabilities and identify changes over time.
Neticle Labs
Neticle
Recommended
Not recommended
Neticle provide solutions to read and analyze website text. Neticles crawler will make a small series of GET requests to your site to take content for text analysis for their various analysis products, which are typically used by research and comms teams.
Net Systems Research
Net Systems Research
Recommended
Not recommended
Net Systems Research is an independent research organization focusing on a range of topics in internet security. Their crawler is used to survey and analyze real world network systems to better understand and study internet security problems. The crawler will make a small number of requests to your site, spread over a few minutes.
Netcraft
Netcraft
Recommended
Not recommended
Netcraft has explored the internet since 1995 and is a respected authority on the market share of web servers, operating systems, hosting providers, ISPs, encrypted transactions, electronic commerce, scripting languages and content technologies on the internet.
hyScore.io
hyScore.io
Recommended
Not recommended
hyScore.io offer a platform allowing Companies to understand and structure text, website, documents, pictures, audio and video data based on content. hyScore.io is used by businesses to fetch and analyse website content using a crawler which behaves similar to many search engine crawlers. Pages are only ever visited on demand, so if the hyScore.io Crawler has visited your site then this means someone (in your company or external) requested the content analysis and insights for that page where the hyScore.io information was either not yet available or needed to be refreshed. For this reason, you will often see a request from the hyScore.io crawler shortly after a user has visited a page. They state that the Crawler is engineered to be as friendly as possible, such as limiting request rates to any specific site, automatically backing away if a site is down or slow or is repeatedly returning non-200 (OK) responses.As this solution is used by a number of third party platforms, such as Data Management Platforms (DMP) or Demand Side Platforms (DSP) and many others. These systems are often used by other third-party systems (Adserver, DMP, Brand Safety, Ad Fraud…) as part of the customers’ strategy (Agencies, Brands, Publishers, etc.).Check with your team if they use hyScore.io or a platform which uses it before deciding to allow this bot.
Headline
Headline
Recommended
Not recommended
Headline are a VC Investment company based in San Fransisco. Their crawler is designed to trawl websites for publicly available information like home page content, job postings, team pages, location references. They do this to discover interesting companies.
GoogleOther
Recommended
Not recommended
GoogleOther is a new crawler from Google. Its a Generic crawler that may be used by Googles internal product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development. The GoogleOther crawler always obeys robots.txt rules.
GetProxi.es
GetProxi.es
Recommended
Not recommended
Spanish site which uses a bot to check for proxy sites that are active and working
Censys
Internet Measurement
Recommended
Not recommended
Internet Measurement is operated by Driftnet.io. The purpose of this crawler is to measure services that network owners and operators have publicly exposed. If you don't want this third party crawling your site, do not allow this crawler.
BuiltWith
BuiltWith
Recommended
Not recommended
BuiltWith allows anyone to interrogate a website to find out what technologies are in use on it, users can access this both from the BuiltWith website and via browser plugins with a Freemium offering and paid for tiers too. If you see this bot either someone in your team or a partner or potential partner maybe checking your web tech stack out!
Babbar
Babbar
Recommended
Not recommended
Babbar is crawling the web in order to measure it, calculating helpful metrics (popularity, trust, categorization) along the way. Its goal is to allow its users to estimate the trust, popularity and topic of any website. Babbar helps find the best media to put your ads or links.
1&1
IONOS
Recommended
Not recommended
IONOS Crawler is the web crawler of IONOS. Its job is to constantly crawl the web in order gather information to allow 1&1 to improve their hosting offering. If you are happy for 1&1 to gather information on your site then please allow this bot.