How to use the Web Crawler

Note: The Web Crawler is only available in the Premium, Business, and Enterprise plans.

In this article, you'll learn more about what exactly a Web Crawler is and how it works (technically).

Besides adding knowledge through instructions, AI searches, or the Document Scraper, you might also want to integrate information from your website into your AI Agent. The Web Crawler does exactly that. It first retrieves all URLs you want to crawl, then processes the pages to add their content to your AI Agent's knowledge. This simplifies AI Agent maintenance by removing the need to manually add data.

Check out this video:

1. Accessing the Web Crawler

To get started, open your AI Agent and go to Sources to find the Web Crawler.

2. Adding your website

In the Web Crawler, you can easily add and manage your website’s URLs. There are three ways to add URLs, and you can combine these options:

Entire sitemap: This is our recommended method, as it provides the most comprehensive list of URLs. Learn more about how to create a sitemap here.

Add the URL of the sitemap here, without '/' after it. So for example: https://website.com/sitemap.xml but not https://website.com/sitemap.xml/

Fetch URLs by the root domain: This will attempt to find URLs across the site.
Add individual URLs manually: This option can be used when you only want to insert the information of specific pages of your site, but not the entire website.

After you added the URL in the bar, click Fetch links. The Web Crawler will display the fetched URLs in the table. Within the table the time of when the URL was added is displayed. Depending on the amount of URLs, fetching might take some time.

Fetching the URLs is the first step. Once the URLs have been fetched, you can decide for each URL if you want to:

Include or exclude URLs from crawling (determines whether the page content is added to the AI Agent's knowledge).
Enable or disable the AI Agent's use of the URL in answers (determines whether the AI Agent shares the URL during conversations).

When you’ve made these decisions, you can start crawling:

When your website contains JavaScript, it is important to enable the toggle for ‘Render JavaScript’. Keep in mind that JavaScript takes more time to crawl, as it requires additional processing to display the content.
To crawl all included URLs at once, click Start Crawling in the top right.
To crawl selected URLs, choose the URLs and click Crawl link in the menu.

Note: If crawling a root domain or sitemap finishes very quickly (within seconds), it may indicate that only a small part of the website was crawled. This can happen if the site is technically difficult to crawl or not fully accessible. Contact our support team for assistance in such cases.

The Web Crawler cannot retrieve information hidden behind buttons or dropdown menus. Such information must be added manually in the AI Agent's instructions.

Crawl statuses

The URLs in the list can have different statuses. Here’s an overview of the statuses and their meanings:

Status	Meaning
Crawled	The URL has been added to the AI Agent's knowledge
Not crawled	The URL has not yet been crawled
Queued*	The URL is waiting to be crawled
Excluded	The URL has been excluded from crawling

How long does crawling take?

Crawling a website can take up to 24 hours, depending on the structure of the website and how easy it is to crawl. During this period, the Web Crawler will attempt up to 50 times to crawl URLs that are not immediately accessible. While this is happening, the status of the URL is shown as Queued. If a URL is still not successfully crawled after 24 hours, the crawl for that URL will fail.

Sometimes, the crawl status may appear stuck between 90% and 100%. This means the Web Crawler is still trying to access a small number of remaining URLs. These URLs may be temporarily unavailable or require additional attempts to be successfully crawled.

You don’t need to keep the page open while the crawl is in progress. The process continues automatically, even if you navigate to another page or log out of Watermelon.

If you don’t want the crawl to continue, you can manually cancel it. The knowledge from all URLs that were successfully crawled up to that point will already be added to the AI Agent’s knowledge base. This means you can immediately start working with the available information.

Once the crawl is completed or canceled, you will receive an email with a summary of the results, including the number of URLs that failed to crawl.

Learn more about why a URL might not be crawled in this article.

3. Cancelling a Crawl

Do you want to stop the crawling process while it’s still running? You can manually cancel it:

Click Cancel during the crawl.
All requests that were already in the queue will still be completed. After that, the process will stop automatically.
The knowledge from all URLs that have been successfully crawled up to that point will be retained and immediately available in your AI Agent.

Note: It is not possible to identify which IP addresses the Web Crawler uses to visit your website.

4. Crawling URLs again

When your website’s content changes, you can easily update your AI Agent's knowledge by clicking Start Crawling (for all included URLs) or Crawl link (for selected URLs). This ensures that new or updated content is integrated into your AI Agent's knowledge base. Be mindful of your crawl limits when choosing to re-crawl all URLs.

5. Deleting a crawled URL

To remove a specific URL from the Web Crawler, click the three dots next to the URL and select Delete. You can also use the multi-select option and click Delete in the menu to delete multiple URLs simultaneously.

Note: Deleting a URL will also erase all the knowledge that the AI Agent acquired from that specific URL.

6. Testing the AI Agent with website knowledge

Once crawling is complete, you can test your AI Agent with the newly acquired knowledge in the Interactive tester. This allows you to see how the AI Agent uses the website’s content during conversations.

If the information on your website conflicts with manually added instructions in your AI Agent, the AI Agent may mix knowledge sources, resulting in inconsistent answers.

Other

Copying a link: Use the button to the left of the URL to copy it easily. This is helpful when checking the URL in your browser to determine whether it should be included in the AI Agent's knowledge.
Hovering over long URLs: When hovering over a long URL with your mouse, the full URL will be displayed.

Need help?

If the Web Crawler results are not as expected, contact us at support@watermelon.ai. Our support team is happy to assist you!