Want to add web pages to your chatbot’s knowledge? This article explains how you can easily do this using the Web Crawler.
Note: The Web Crawler is only available in the Premium, Business, and Enterprise plans.
In this article, you'll learn more about what exactly a Web Crawler is and how it works (technically).
Besides adding knowledge through instructions, AI searches, or the Document Scraper, you might also want to integrate information from your website into your chatbot. The Web Crawler does exactly that. It first retrieves all URLs you want to crawl, then processes the pages to add their content to your chatbot's knowledge. This simplifies chatbot maintenance by removing the need to manually add data.
1. Accessing the Web Crawler
To get started, open your chatbot and go to ‘Sources’ in Pulse to find the Web Crawler.
2. Adding your website
In the Web Crawler, you can easily add and manage your website’s URLs. There are three ways to add URLs, and you can combine these options:
- Entire sitemap: This is our recommended method, as it provides the most comprehensive list of URLs. Learn more about how to create a sitemap here.
Add the URL of the sitemap here, without '/' after it. So for example: https://website.com/sitemap.xml but not https://website.com/sitemap.xml/
- Fetch URLs by the root domain: This will attempt to find URLs across the site.
- Add individual URLs manually: This option can be used when you only want to insert the information of specific pages of your site, but not the entire website.
After you added the URL in the bar, click "Fetch links". The Web Crawler will display the fetched URLs in the table. Within the table the time of when the URL was added is displayed. Depending on the amount of URLs, fetching might take some time.
Fetching the URLs is the first step. Once the URLs have been fetched, you can decide for each URL if you want to:
- Include or exclude URLs from crawling (determines whether the page content is added to the chatbot’s knowledge).
- Enable or disable the chatbot’s use of the URL in answers (determines whether the chatbot shares the URL during conversations).
When you’ve made these decisions, you can start crawling:
- To crawl all included URLs at once, click “Start Crawling” in the top right.
- To crawl selected URLs, choose the URLs and click “Crawl link” in the menu.
Note: If crawling a root domain or sitemap finishes very quickly (within seconds), it may indicate that only a small part of the website was crawled. This can happen if the site is technically difficult to crawl or not fully accessible. Contact our support team for assistance in such cases.
The Web Crawler cannot retrieve information hidden behind buttons or dropdown menus. Such information must be added manually in the chatbot's instructions.
Crawl statuses
The URLs in the list can have different statuses. Here’s an overview of the statuses and their meanings:
Status | Meaning |
Crawled | The URL has been added to the chatbot’s knowledge |
Not crawled | The URL has not yet been crawled |
Queued* | The URL is waiting to be crawled |
Excluded | The URL has been excluded from crawling |
How long does crawling take?
Crawling a website can take up to 24 hours, depending on the structure of the site and how easily it can be crawled. During this time, the Web Crawler will attempt to crawl inaccessible URLs up to 50 times. While this process is ongoing, the status of these URLs will be shown as ‘Queued’ (*). If a URL cannot be successfully crawled within 24 hours, the crawl for that URL will fail.
You don’t need to keep the screen open while crawling. The process will continue even if you navigate to another page or log out of Watermelon. Once the crawl is complete, you’ll receive an email summarizing the results, including the number of URLs that failed to crawl.
Learn more about why a URL might not be crawled in this article.
3. Crawling URLs again
When your website’s content changes, you can easily update your chatbot’s knowledge by clicking “Start Crawling” (for all included URLs) or “Crawl link” (for selected URLs). This ensures that new or updated content is integrated into your chatbot’s knowledge base. Be mindful of your crawl limits when choosing to re-crawl all URLs.
4. Deleting a crawled URL
To remove a specific URL from the Web Crawler, click the three dots next to the URL and select Delete. You can also use the multi-select option and click Delete in the menu to delete multiple URLs simultaneously.
Note: Deleting a URL will also erase all the knowledge that the chatbot acquired from that specific URL.
5. Testing the chatbot with website knowledge
Once crawling is complete, you can test your chatbot with the newly acquired knowledge in the Interactive tester. This allows you to see how the chatbot uses the website’s content during conversations.
Important: If the information on your website conflicts with manually added instructions in your chatbot, the chatbot may mix knowledge sources, resulting in inconsistent answers.
Other
- Copying a link: Use the button to the left of the URL to copy it easily. This is helpful when checking the URL in your browser to determine whether it should be included in the chatbot’s knowledge.
- Hovering over long URLs: When hovering over a long URL with your mouse, the full URL will be displayed.
Need help?
If the Web Crawler results are not as expected, contact us at support@watermelon.ai. Our support team is happy to assist you!