Site Crawl

Crawl your website to automatically build your AI agent's knowledge base. Access it from Admin > Site Crawl in your admin panel.

Overview

The site crawler visits your website pages, extracts their content, and creates searchable AI embeddings. This lets your AI agent answer questions based on the information on your website.

How It Works

The crawler runs a 3-phase pipeline:

PhaseDescriptionProgress
1. CrawlVisits pages and saves HTML0-50%
2. AnalyzeDetects repeating content (headers, footers, menus)50-60%
3. EmbedGenerates searchable AI embeddings60-100%

Crawler Settings

SettingDescription
Home Page URLStarting URL for the crawl
Max PagesMaximum pages to crawl (default: 100)
Max DepthHow many links deep to follow (default: 3)
Fresh StartDelete all existing crawled content before crawling
Use BrowserUse headless browser for JavaScript-rendered sites
Custom CookiesAdd cookies for authentication or session management

Use Browser Option

Enable Use Browser when crawling:

  • Single-page applications (SPAs) built with React, Vue, Angular
  • Sites that load content dynamically with JavaScript
  • Pages requiring client-side rendering

When enabled, the crawler uses Puppeteer (headless Chrome) instead of simple HTTP requests. This is slower but handles JavaScript-rendered content.

Custom Cookies

Add custom cookies when crawling:

  • Sites requiring authentication
  • Pages behind login walls
  • Sessions with specific preferences

To add cookies:

  1. Enter the Cookie name (e.g., session_id, auth_token)
  2. Enter the Cookie value
  3. Click the + button to add
  4. Repeat for additional cookies

Cookies persist across all requests during the crawl, including redirects.

Running a Crawl

  1. Enter or verify your Home Page URL
  2. Adjust Max Pages and Max Depth as needed
  3. Enable Fresh Start only if replacing all content
  4. Enable Use Browser for JavaScript-heavy sites
  5. Add Custom Cookies if authentication is required
  6. Click Start Crawl
  7. Monitor progress through all 3 phases
  8. Review results when complete

The crawl runs in the background — you can close the page and return later to check progress.

Failed URLs

Some URLs may fail during crawling:

  • Pages blocked by robots.txt
  • Login-required pages
  • Dead links (404 errors)

Failed URLs are displayed during and after the crawl for review.

Best Practices

  1. Crawl first, then fine-tune — Start with a broad crawl, then add documents and FAQs to fill gaps
  2. Use Fresh Start sparingly — Only when completely replacing all crawled content
  3. Check your robots.txt — Make sure it doesn't block the crawler from important pages
  4. Enable Use Browser for SPAs — If your site is built with a JavaScript framework, the browser option captures dynamic content
  5. Monitor failed URLs — Review failures to identify pages that need cookies or browser mode

Related Pages