This article explains why a URL might not be crawled in the Web Crawler in Pulse.
The inability to crawl a URL can have various causes, ranging from technical limitations to settings on the target website. Below, we outline the most common reasons and how to resolve them.
What does "unable to crawl a URL" mean?
If a URL cannot be crawled, it means the Web Crawler cannot access the content of that URL. As a result, the information from the page cannot be added to your chatbot's knowledge base.
There can be several reasons why a URL is inaccessible, often related to website settings or technical issues.
Common reasons why a URL cannot be crawled
1. Robots.txt restrictions
Some websites have a file called robots.txt. This file provides instructions to crawlers about which parts of the website they are allowed or not allowed to access.
Solution:
Check the robots.txt file of the website to see if the URL is excluded from crawlers. You can do this by appending /robots.txt
to the URL (e.g., www.website.com/robots.txt
).
2. Incorrect URL input
A URL cannot be crawled if it is entered incorrectly, such as a typo, missing subdomain, or using HTTP instead of HTTPS.
Solution:
Verify that the URL is entered correctly. Try opening the URL in a browser to check if the page is accessible.
3. IP or access restrictions
Websites may block certain IP addresses or restrict access based on geographic location, often as an anti-scraping measure.
Solution:
Temporarily lift the block or restriction during the crawling process. Using a proxy to bypass IP or geographic restrictions can also help.
4. CAPTCHA or anti-scraping measures
Some websites employ advanced techniques, such as CAPTCHA challenges, to block bots. These mechanisms can prevent the Web Crawler from accessing the content.
Solution:
If a website uses CAPTCHAs, it is unfortunately not possible to crawl the URL automatically. In this case, you can:
- Add the content manually to the chatbot.
- Temporarily disable anti-scraping measures to grant the crawler access.
5. Website server errors
Websites can be temporarily unavailable due to server issues. Common errors include:
- 404 - Not Found: The page no longer exists, or the URL is incorrect.
- 500 - Internal Server Error: The server encountered an issue processing the request.
- 504 - Timeout: The server took too long to respond.
Solution:
- Check if the website is accessible by opening the URL in a browser.
- Wait for a while and try again later, as temporary server issues often resolve themselves.
6. Server limitations on the website
If a server receives too many requests in a short time, it can become overloaded and block the crawler.
Solution:
- Limit the number of simultaneous crawls in Pulse. Avoid crawling all URLs at once and instead crawl them in smaller batches.
Need help?
If you’re still experiencing issues crawling a URL, don’t hesitate to contact our support team at support@watermelon.ai. We’re happy to assist you!