In this article, you’ll learn about the Web Crawler in Pulse, what it does, and how it can improve your chatbot’s knowledge by gathering the latest information from your website.
Note: The Web Crawler is available in the Premium, Business and Enterprise plans
What is the Web Crawler?
The Web Crawler is a tool that collects content from your website and integrates it into your chatbot’s knowledge. This means that your chatbot will learn from the latest information on your site. Whether it’s product pages, blogs, or FAQs, the Web Crawler helps keep your chatbot accurate and responsive with the most current data.
You can use the Web Crawler to fetch URLs from your site and then crawl through those URLs to gather content. Depending on your plan, there are limits to the number of URLs you can fetch and the number of crawls you can perform each month.
How does the Web Crawler work?
The Web Crawler fetches URLs from your website in several ways:
- You can upload your sitemap to ensure the most complete results.
- Alternatively, you can add a root domain, and the Web Crawler will attempt to find all the URLs across the site.
- You can also add individual URLs manually.
Once the URLs are fetched, you can choose whether to include the content from specific URLs in the chatbot’s knowledge base or exclude certain URLs. The Web Crawler then crawls the selected URLs, gathering relevant content that your chatbot will use to answer questions.
In this article, you can read how to use the Web Crawler.
How does the Web Crawler send requests to websites?
The Web Crawler is not an ordinary visitor; it has a few clever tricks to stay unnoticed. The requests the scraper sends don't all come from one address; that's the IP address. The IP address is like your home address on the internet. Just as a letter is sent to your home address, most internet requests come from one fixed IP address.
But the Web Crawler is smarter. It goes through multiple proxies before sending requests to websites. A proxy is like an intermediary that forwards the requests on behalf of the scraper. This makes it look like the requests are coming from different addresses, namely the addresses of the proxies. This makes it harder to figure out the real sender and helps to remain more anonymous.
And then there's something called the "user agent". A user agent is like a disguise for the scraper's internet browser. Normally, a browser tells the website which browser it is, like Chrome or Firefox. But the Web Crawler disguises itself by using a dynamic user agent. It tells the website: "Hello there, I'm just a regular visitor looking around!" This makes the website think that the scraper is just an ordinary visitor and not a special scraper.
What to consider when using the Web Crawler
When using the Web Crawler, it is crucial to minimize the load on the servers of the website being crawled. Sending too many requests in a short period can overload the server and lead to timeouts or errors, such as a 504 status code. This often occurs with websites that have limited server capacity (CPU, memory, or bandwidth). Additionally, large sitemaps with many URLs can quickly strain a server if crawled without restrictions.
To prevent this:
- Limit the number of concurrent requests by crawling URLs in smaller groups. Select a specific number of URLs and click "crawl" to process them incrementally.
- Check whether the server of the target website can handle the load.
- Adhere to any restrictions outlined in the website’s robots.txt file.
By managing server capacity and limiting concurrent requests, you can prevent technical issues on the website being crawled.
How many URLs and crawls can I use?
Each plan has a limit on the number of URLs you can fetch and how many crawls you can perform per month. You can find an overview of this on our pricing page, and here below.
- Premium plan: 10.000 crawls per month
- Business plan: 25.000 crawls per month
- Enterprise plan: Custom limits based on your specific needs. Reach out to our support team for tailored options.
One crawl stands for crawling one URL.
If you reach your URL or crawl limit for the month, you’ll be notified and given the option to upgrade your plan or wait for the next billing cycle.
When you fetch more URLs than the limit, the number of URLs will appear in red. You can then choose to remove some of the retrieved URLs before starting the crawling process.