Site Crawl

Crawl your website to automatically build your AI agent's knowledge base. Access it from Admin > Site Crawl in your admin panel.

Overview

The site crawler visits your website pages, extracts their content, and creates searchable AI embeddings. This lets your AI agent answer questions based on the information on your website.

How It Works

The crawler runs a 3-phase pipeline:

Phase	Description	Progress
1. Crawl	Visits pages and saves HTML	0-50%
2. Analyze	Detects repeating content (headers, footers, menus)	50-60%
3. Embed	Generates searchable AI embeddings	60-100%

Crawler Settings

Setting	Description
Home Page URL	Starting URL for the crawl
Max Pages	Maximum pages to crawl (default: 100)
Max Depth	How many links deep to follow (default: 3)
Fresh Start	Delete all existing crawled content before crawling
Use Browser	Use headless browser for JavaScript-rendered sites
Custom Cookies	Add cookies for authentication or session management

Use Browser Option

Enable Use Browser when crawling:

Single-page applications (SPAs) built with React, Vue, Angular
Sites that load content dynamically with JavaScript
Pages requiring client-side rendering

When enabled, the crawler uses Puppeteer (headless Chrome) instead of simple HTTP requests. This is slower but handles JavaScript-rendered content.

Custom Cookies

Add custom cookies when crawling:

Sites requiring authentication
Pages behind login walls
Sessions with specific preferences

To add cookies:

Enter the Cookie name (e.g., session_id, auth_token)
Enter the Cookie value
Click the + button to add
Repeat for additional cookies

Cookies persist across all requests during the crawl, including redirects.

Running a Crawl

Enter or verify your Home Page URL
Adjust Max Pages and Max Depth as needed
Enable Fresh Start only if replacing all content
Enable Use Browser for JavaScript-heavy sites
Add Custom Cookies if authentication is required
Click Start Crawl
Monitor progress through all 3 phases
Review results when complete

The crawl runs in the background — you can close the page and return later to check progress.

Failed URLs

Some URLs may fail during crawling:

Pages blocked by robots.txt
Login-required pages
Dead links (404 errors)

Failed URLs are displayed during and after the crawl for review.

Best Practices

Crawl first, then fine-tune — Start with a broad crawl, then add documents and FAQs to fill gaps
Use Fresh Start sparingly — Only when completely replacing all crawled content
Check your robots.txt — Make sure it doesn't block the crawler from important pages
Enable Use Browser for SPAs — If your site is built with a JavaScript framework, the browser option captures dynamic content
Monitor failed URLs — Review failures to identify pages that need cookies or browser mode

AI Documents - Upload PDFs and documents
AI FAQs - Add FAQs to improve responses
Crawler Options - Advanced crawler settings
Improve AI Responses - Tips for better AI answers