Crawler Advanced Options
Advanced Options control how the crawler fetches and processes your website pages. These settings are available in two places:
| Location | Path | When to Use |
|---|---|---|
| Data Uploads | Admin > Uploads > Website Crawler | Full control over crawling with detailed progress |
| Setup Wizard | Admin > Setup Wizard > Website step | Quick setup with AI-assisted configuration |
Both pages share the same underlying settings and backend. Changes made in one are reflected in the other.
Crawler Settings
These core settings appear above the Advanced Options section.
Home Page URL
The starting URL for the crawl. The crawler begins here and follows links to discover other pages on the same domain.
Why: The crawler needs a starting point to discover your site's content. Your home page typically links to all major sections.
How it works:
- The crawler visits this URL first
- It extracts all links from the page
- It follows those links if they are on the same domain
- The process repeats for each discovered page
Tips:
- Use your main domain (e.g.,
https://example.com) - The crawler only follows links within the same domain
- Subdomains are treated as separate domains
Max Pages
The maximum total number of pages to crawl. Default: 100. Range: 1-1000.
Why: Prevents runaway crawls on large sites. Controls cost and processing time since every page is analyzed and embedded.
How it works: The crawler stops discovering new pages once this limit is reached. Pages already in the queue are not fetched once the limit is hit.
Tips:
- Small sites (under 50 pages): set to 50-100
- Medium sites (100-500 pages): set to 200-500
- Large sites: start with 100 to test, then increase
- This is the total count, not additional pages on top of existing ones
Max Depth
How many links deep to follow from the starting URL. Default: 3. Range: 1-10.
Why: Controls how far the crawler ventures from your home page. Deeper pages are often less relevant (privacy policies, terms of service, archived blog posts).
How it works:
- Depth 0: Only the home page
- Depth 1: Home page + pages linked directly from it
- Depth 2: All of the above + pages linked from depth-1 pages
- And so on
Tips:
- Depth 3 covers most site content
- Increase to 5+ only for deeply nested sites (e.g., documentation with many sub-sections)
- Lower depth produces a more focused knowledge base
Fresh Start
When enabled, deletes all existing crawled pages and embeddings before starting a new crawl.
Why: Ensures your knowledge base contains only current content. Without this, old pages that no longer exist on your site remain in the knowledge base.
When to use:
- Your site has been significantly restructured
- You want to remove outdated content
- Previous crawls included unwanted pages
When NOT to use:
- You want to keep previously uploaded documents and products (Fresh Start only deletes crawled website content)
- You are incrementally adding to your knowledge base
Advanced Options
These settings are hidden behind the Advanced Options toggle.
Use Browser
A checkbox that switches the crawler from simple HTTP requests to a headless browser (Puppeteer/Chrome).
Why: Many modern websites render content with JavaScript. A simple HTTP request only gets the initial HTML, which may be empty or incomplete. A headless browser executes JavaScript and waits for the page to fully render.
When to enable:
- Single-page applications (React, Vue, Angular, Next.js client-side)
- Sites that load content dynamically via JavaScript
- Pages where the HTML source shows loading spinners instead of actual content
When to leave disabled:
- Traditional server-rendered websites (WordPress, static HTML)
- Sites that return complete HTML in the initial response
Trade-offs:
- Enabled: slower (each page takes 2-5 seconds instead of under 1 second), uses more server resources
- Disabled: faster, but may miss JavaScript-rendered content
How it works: The server launches a headless Chrome instance, navigates to each URL, waits for the page to load, then captures the rendered HTML. This is the same technique used by search engine crawlers.
This option is hidden when the Browser Extension is installed, since extension-based crawling handles JavaScript rendering naturally through your browser.
Use Browser Extension
A checkbox that appears only when the XInfer Crawler browser extension is installed.
Why: Some websites block requests from cloud/datacenter IP addresses. Since the server-side crawler runs from a datacenter, these sites return errors or CAPTCHAs. The browser extension fetches pages from your own browser using your residential IP address, bypassing this blocking.
When to enable:
- The server-side crawl fails with many blocked or empty pages
- The target site is known to block datacenter IPs (e.g., university websites, government sites)
How it works:
- The server manages the crawl queue (which URLs to visit, which have been visited)
- Your browser extension fetches each URL using your local network connection
- The extension sends the HTML back to the server
- The server saves the content and extracts links for the next pages
Requirements:
- The XInfer Crawler browser extension must be installed in Chrome
- You must keep the admin page open while the crawl runs
- Your browser handles JavaScript rendering automatically
Trade-offs:
- Pro: bypasses IP blocking, handles JavaScript-rendered sites naturally
- Con: requires keeping your browser open, crawl speed depends on your connection
When this option is checked, Use Browser and Custom Cookies are hidden because they are not applicable.
Custom Cookies
Name-value pairs that are sent with every HTTP request during the crawl.
Why: Some websites require authentication or specific session cookies to access content. Without the right cookies, the crawler may see login pages instead of actual content.
When to use:
- Sites behind a login wall
- Intranet or internal sites requiring session tokens
- Sites that show different content based on cookie preferences (e.g., language, region)
How to get cookie values:
- Log into the target website in your browser
- Open browser DevTools (F12) > Application > Cookies
- Find the relevant cookie names and values
- Enter them in the Custom Cookies section
How it works: Before the crawl begins, all custom cookies are loaded into a cookie jar scoped to the target domain. Every HTTP request includes these cookies in the Cookie header. Response Set-Cookie headers are also stored, so session cookies are maintained throughout the crawl.
Tips:
- Only add cookies that are necessary for accessing content
- Session cookies may expire; start the crawl soon after copying cookie values
- Cookie names are case-sensitive
This option is hidden when Use Browser Extension is enabled because the extension uses your browser's own cookies automatically.
Skip URLs
URL patterns that the crawler should ignore during crawling. Matching URLs are not fetched or added to the crawl queue.
Why: Some pages are not useful for a knowledge base: login pages, user dashboards, admin panels, search result pages, or sections in a different language.
How patterns work:
Each pattern is a path relative to your site's domain. The domain prefix is shown automatically (e.g., https://example.com/). You only enter the path portion.
| Pattern | Matches | Does Not Match |
|---|---|---|
admin/* | /admin/settings, /admin/users | /admin/settings/advanced (only one segment) |
admin/** | /admin/settings, /admin/a/b/c | /admins |
**/logout | /logout, /en/logout, /a/b/logout | /logout/confirm |
blog/2023/** | /blog/2023/post-1, /blog/2023/01/post | /blog/2024/post-1 |
*/feed | /blog/feed, /news/feed | /blog/rss/feed (two segments before) |
Wildcards:
*matches any single path segment (everything between two/characters)**matches any number of path segments (zero or more), at any depth
Tips:
- Patterns are case-insensitive
- You do not need to include the domain; it is added automatically
- Start with broad patterns and refine if needed
- Common patterns to skip:
**/login,**/logout,**/search,**/admin/**,**/wp-admin/**
Browser Extension
The XInfer Crawler browser extension is a Chrome extension that enables client-side page fetching during crawls.
Installation
Option 1: Chrome Web Store
- Open the XInfer Crawler page on the Chrome Web Store
- Click Add to Chrome
- Click Add extension in the confirmation dialog
- The XInfer Crawler icon appears in your Chrome toolbar
Option 2: From Source
- Clone the repository:
git clone https://github.com/xinferai/xinfer-browser-extension.git - Open
chrome://extensionsin Chrome (or your Chromium-based browser) - Enable Developer mode (toggle in the top-right)
- Click Load unpacked and select the
srcfolder inside the cloned directory - The XInfer Crawler icon should appear in your extensions toolbar
To update from source, pull the latest changes and click the refresh icon on the extension card in chrome://extensions.
How It Works
The extension uses a three-layer design: the admin page communicates with a content script, which relays messages to a background service worker.
- When you open the admin crawler page, the extension announces its presence via a ping/pong handshake
- The Use Browser Extension checkbox appears in Advanced Options
- When a crawl starts with this option enabled:
- The server queues URLs to crawl
- The extension opens a new browser tab and navigates to the first URL
- You can complete any login, CAPTCHA, or consent steps in the tab, then continue from the admin page
- The extension captures the fully rendered HTML from the tab and sends it to the server
- For subsequent URLs, the extension navigates the same tab automatically
- The server extracts links and adds new URLs to the queue
- When the crawl is complete, the tab is closed
- The server runs analysis and embedding phases
When to Use
Use the extension when the server-side crawler cannot reach the target website:
- Sites that block cloud/datacenter IP addresses
- Sites requiring browser-level authentication (SSO, SAML)
- Sites with aggressive bot detection
Limitations
- You must keep the browser tab open during the crawl
- Crawl speed depends on your internet connection
- The extension only handles the fetch phase; analysis and embedding still run server-side
Related Pages
- Data Uploads - Full crawler interface with document and product uploads
- Setup Wizard - AI-assisted initial configuration
- App Config - Additional crawler and RAG settings