Site Crawl
Crawl your website to automatically build your AI agent's knowledge base. Access it from Admin > Site Crawl in your admin panel.
Overview
The site crawler visits your website pages, extracts their content, and creates searchable AI embeddings. This lets your AI agent answer questions based on the information on your website.
How It Works
The crawler runs a 3-phase pipeline:
| Phase | Description | Progress |
|---|---|---|
| 1. Crawl | Visits pages and saves HTML | 0-50% |
| 2. Analyze | Detects repeating content (headers, footers, menus) | 50-60% |
| 3. Embed | Generates searchable AI embeddings | 60-100% |
Crawler Settings
| Setting | Description |
|---|---|
| Home Page URL | Starting URL for the crawl |
| Max Pages | Maximum pages to crawl (default: 100) |
| Max Depth | How many links deep to follow (default: 3) |
| Fresh Start | Delete all existing crawled content before crawling |
| Use Browser | Use headless browser for JavaScript-rendered sites |
| Custom Cookies | Add cookies for authentication or session management |
Use Browser Option
Enable Use Browser when crawling:
- Single-page applications (SPAs) built with React, Vue, Angular
- Sites that load content dynamically with JavaScript
- Pages requiring client-side rendering
When enabled, the crawler uses Puppeteer (headless Chrome) instead of simple HTTP requests. This is slower but handles JavaScript-rendered content.
Custom Cookies
Add custom cookies when crawling:
- Sites requiring authentication
- Pages behind login walls
- Sessions with specific preferences
To add cookies:
- Enter the Cookie name (e.g.,
session_id,auth_token) - Enter the Cookie value
- Click the + button to add
- Repeat for additional cookies
Cookies persist across all requests during the crawl, including redirects.
Running a Crawl
- Enter or verify your Home Page URL
- Adjust Max Pages and Max Depth as needed
- Enable Fresh Start only if replacing all content
- Enable Use Browser for JavaScript-heavy sites
- Add Custom Cookies if authentication is required
- Click Start Crawl
- Monitor progress through all 3 phases
- Review results when complete
The crawl runs in the background — you can close the page and return later to check progress.
Failed URLs
Some URLs may fail during crawling:
- Pages blocked by robots.txt
- Login-required pages
- Dead links (404 errors)
Failed URLs are displayed during and after the crawl for review.
Best Practices
- Crawl first, then fine-tune — Start with a broad crawl, then add documents and FAQs to fill gaps
- Use Fresh Start sparingly — Only when completely replacing all crawled content
- Check your robots.txt — Make sure it doesn't block the crawler from important pages
- Enable Use Browser for SPAs — If your site is built with a JavaScript framework, the browser option captures dynamic content
- Monitor failed URLs — Review failures to identify pages that need cookies or browser mode
Related Pages
- AI Documents - Upload PDFs and documents
- AI FAQs - Add FAQs to improve responses
- Crawler Options - Advanced crawler settings
- Improve AI Responses - Tips for better AI answers