Crawler Advanced Options

Advanced Options control how the crawler fetches and processes your website pages. These settings are available in two places:

Location	Path	When to Use
Data Uploads	Admin > Uploads > Website Crawler	Full control over crawling with detailed progress
Setup Wizard	Admin > Setup Wizard > Website step	Quick setup with AI-assisted configuration

Both pages share the same underlying settings and backend. Changes made in one are reflected in the other.

Crawler Settings

These core settings appear above the Advanced Options section.

Home Page URL

The starting URL for the crawl. The crawler begins here and follows links to discover other pages on the same domain.

Why: The crawler needs a starting point to discover your site's content. Your home page typically links to all major sections.

How it works:

The crawler visits this URL first
It extracts all links from the page
It follows those links if they are on the same domain
The process repeats for each discovered page

Tips:

Use your main domain (e.g., https://example.com)
The crawler only follows links within the same domain
Subdomains are treated as separate domains

Max Pages

The maximum total number of pages to crawl. Default: 100. Range: 1-1000.

Why: Prevents runaway crawls on large sites. Controls cost and processing time since every page is analyzed and embedded.

How it works: The crawler stops discovering new pages once this limit is reached. Pages already in the queue are not fetched once the limit is hit.

Tips:

Small sites (under 50 pages): set to 50-100
Medium sites (100-500 pages): set to 200-500
Large sites: start with 100 to test, then increase
This is the total count, not additional pages on top of existing ones

Max Depth

How many links deep to follow from the starting URL. Default: 3. Range: 1-10.

Why: Controls how far the crawler ventures from your home page. Deeper pages are often less relevant (privacy policies, terms of service, archived blog posts).

How it works:

Depth 0: Only the home page
Depth 1: Home page + pages linked directly from it
Depth 2: All of the above + pages linked from depth-1 pages
And so on

Tips:

Depth 3 covers most site content
Increase to 5+ only for deeply nested sites (e.g., documentation with many sub-sections)
Lower depth produces a more focused knowledge base

Fresh Start

When enabled, deletes all existing crawled pages and embeddings before starting a new crawl.

Why: Ensures your knowledge base contains only current content. Without this, old pages that no longer exist on your site remain in the knowledge base.

When to use:

Your site has been significantly restructured
You want to remove outdated content
Previous crawls included unwanted pages

When NOT to use:

You want to keep previously uploaded documents and products (Fresh Start only deletes crawled website content)
You are incrementally adding to your knowledge base

Advanced Options

These settings are hidden behind the Advanced Options toggle.

Use Browser

A checkbox that switches the crawler from simple HTTP requests to a headless browser (Puppeteer/Chrome).

Why: Many modern websites render content with JavaScript. A simple HTTP request only gets the initial HTML, which may be empty or incomplete. A headless browser executes JavaScript and waits for the page to fully render.

When to enable:

Single-page applications (React, Vue, Angular, Next.js client-side)
Sites that load content dynamically via JavaScript
Pages where the HTML source shows loading spinners instead of actual content

When to leave disabled:

Traditional server-rendered websites (WordPress, static HTML)
Sites that return complete HTML in the initial response

Trade-offs:

Enabled: slower (each page takes 2-5 seconds instead of under 1 second), uses more server resources
Disabled: faster, but may miss JavaScript-rendered content

How it works: The server launches a headless Chrome instance, navigates to each URL, waits for the page to load, then captures the rendered HTML. This is the same technique used by search engine crawlers.

This option is hidden when the Browser Extension is installed, since extension-based crawling handles JavaScript rendering naturally through your browser.

Use Browser Extension

A checkbox that appears only when the XInfer Crawler browser extension is installed.

Why: Some websites block requests from cloud/datacenter IP addresses. Since the server-side crawler runs from a datacenter, these sites return errors or CAPTCHAs. The browser extension fetches pages from your own browser using your residential IP address, bypassing this blocking.

When to enable:

The server-side crawl fails with many blocked or empty pages
The target site is known to block datacenter IPs (e.g., university websites, government sites)

How it works:

The server manages the crawl queue (which URLs to visit, which have been visited)
Your browser extension fetches each URL using your local network connection
The extension sends the HTML back to the server
The server saves the content and extracts links for the next pages

Requirements:

The XInfer Crawler browser extension must be installed in Chrome
You must keep the admin page open while the crawl runs
Your browser handles JavaScript rendering automatically

Trade-offs:

Pro: bypasses IP blocking, handles JavaScript-rendered sites naturally
Con: requires keeping your browser open, crawl speed depends on your connection

When this option is checked, Use Browser and Custom Cookies are hidden because they are not applicable.

Custom Cookies

Name-value pairs that are sent with every HTTP request during the crawl.

Why: Some websites require authentication or specific session cookies to access content. Without the right cookies, the crawler may see login pages instead of actual content.

When to use:

Sites behind a login wall
Intranet or internal sites requiring session tokens
Sites that show different content based on cookie preferences (e.g., language, region)

How to get cookie values:

Log into the target website in your browser
Open browser DevTools (F12) > Application > Cookies
Find the relevant cookie names and values
Enter them in the Custom Cookies section

How it works: Before the crawl begins, all custom cookies are loaded into a cookie jar scoped to the target domain. Every HTTP request includes these cookies in the Cookie header. Response Set-Cookie headers are also stored, so session cookies are maintained throughout the crawl.

Tips:

Only add cookies that are necessary for accessing content
Session cookies may expire; start the crawl soon after copying cookie values
Cookie names are case-sensitive

This option is hidden when Use Browser Extension is enabled because the extension uses your browser's own cookies automatically.

Skip URLs

URL patterns that the crawler should ignore during crawling. Matching URLs are not fetched or added to the crawl queue.

Why: Some pages are not useful for a knowledge base: login pages, user dashboards, admin panels, search result pages, or sections in a different language.

How patterns work:

Each pattern is a path relative to your site's domain. The domain prefix is shown automatically (e.g., https://example.com/). You only enter the path portion.

Pattern	Matches	Does Not Match
`admin/*`	`/admin/settings`, `/admin/users`	`/admin/settings/advanced` (only one segment)
`admin/**`	`/admin/settings`, `/admin/a/b/c`	`/admins`
`**/logout`	`/logout`, `/en/logout`, `/a/b/logout`	`/logout/confirm`
`blog/2023/**`	`/blog/2023/post-1`, `/blog/2023/01/post`	`/blog/2024/post-1`
`*/feed`	`/blog/feed`, `/news/feed`	`/blog/rss/feed` (two segments before)

Wildcards:

* matches any single path segment (everything between two / characters)
** matches any number of path segments (zero or more), at any depth

Tips:

Patterns are case-insensitive
You do not need to include the domain; it is added automatically
Start with broad patterns and refine if needed
Common patterns to skip: **/login, **/logout, **/search, **/admin/**, **/wp-admin/**

Browser Extension

The XInfer Crawler browser extension is a Chrome extension that enables client-side page fetching during crawls.

Installation

Option 1: Chrome Web Store

Open the XInfer Crawler page on the Chrome Web Store
Click Add to Chrome
Click Add extension in the confirmation dialog
The XInfer Crawler icon appears in your Chrome toolbar

Option 2: From Source

Clone the repository:

git clone https://github.com/xinferai/xinfer-browser-extension.git

Open chrome://extensions in Chrome (or your Chromium-based browser)
Enable Developer mode (toggle in the top-right)
Click Load unpacked and select the src folder inside the cloned directory
The XInfer Crawler icon should appear in your extensions toolbar

To update from source, pull the latest changes and click the refresh icon on the extension card in chrome://extensions.

How It Works

The extension uses a three-layer design: the admin page communicates with a content script, which relays messages to a background service worker.

When you open the admin crawler page, the extension announces its presence via a ping/pong handshake
The Use Browser Extension checkbox appears in Advanced Options
When a crawl starts with this option enabled:
- The server queues URLs to crawl
- The extension opens a new browser tab and navigates to the first URL
- You can complete any login, CAPTCHA, or consent steps in the tab, then continue from the admin page
- The extension captures the fully rendered HTML from the tab and sends it to the server
- For subsequent URLs, the extension navigates the same tab automatically
- The server extracts links and adds new URLs to the queue
When the crawl is complete, the tab is closed
The server runs analysis and embedding phases

When to Use

Use the extension when the server-side crawler cannot reach the target website:

Sites that block cloud/datacenter IP addresses
Sites requiring browser-level authentication (SSO, SAML)
Sites with aggressive bot detection

Limitations

You must keep the browser tab open during the crawl
Crawl speed depends on your internet connection
The extension only handles the fetch phase; analysis and embedding still run server-side

Data Uploads - Full crawler interface with document and product uploads
Setup Wizard - AI-assisted initial configuration
App Config - Additional crawler and RAG settings

Crawler Advanced Options

Crawler Settings

Home Page URL

Max Pages

Max Depth

Fresh Start

Advanced Options

Use Browser

Use Browser Extension

Custom Cookies

Skip URLs

Browser Extension

Installation

How It Works

When to Use

Limitations

Related Pages