Documentation

Crawler Browser Extension

The XInfer Crawler is a Chrome browser extension that fetches web pages through your browser session, using the same cookies, login state, and network path you already use while browsing. Instead of trying to reproduce complex website behaviors from a server (which often breaks on sign-in pages, session checks, bot defenses, or region-restricted content), the extension opens URLs in a real browser tab and captures the fully rendered HTML.

Why You Might Need It

The standard crawler runs on cloud servers. Some websites detect that requests come from a datacenter IP address and block them — returning errors, CAPTCHAs, or empty pages instead of actual content.

Common examples of sites that block datacenter traffic:

  • University websites (e.g., .edu domains)
  • Government websites (e.g., .gov domains)
  • Sites with aggressive bot protection (Cloudflare, Akamai)
  • Sites that require browser-level authentication (SSO, SAML)

When this happens, your crawl may complete but with most pages failing or returning no content.

How It Solves the Problem

The extension opens URLs in a real browser tab using your current network connection. To the target website, the requests look like a normal person browsing — because they are. Your residential IP address, browser cookies, and session state are all used naturally.

This is especially useful for sites that require authentication or interactive steps. If a page needs you to sign in, accept consent banners, or complete a CAPTCHA, the tab is opened so you can handle those steps normally. Once access is granted, the extension continues capturing authorized page content automatically.

The server still manages everything else: which URLs to visit, extracting links, saving content, analyzing pages, and generating embeddings. Only the page-fetching step moves to your browser.

When to Use It

Use the extension when:

  • A standard crawl produces many failed or empty pages
  • The target website is known to block cloud/server traffic
  • You see errors like "403 Forbidden" or "Access Denied" in crawl results
  • The site requires you to be logged in through your browser

You do not need the extension for most websites. Try a standard crawl first. If it works, the extension adds no benefit.

Installation

Option 1: Chrome Web Store

  1. Open the XInfer Crawler page on the Chrome Web Store
  2. Click Add to Chrome
  3. Click Add extension in the confirmation dialog
  4. The XInfer Crawler icon appears in your Chrome toolbar

Option 2: From Source

  1. Clone the repository:
    git clone https://github.com/xinferai/xinfer-browser-extension.git
    
  2. Open chrome://extensions in Chrome (or your Chromium-based browser)
  3. Enable Developer mode (toggle in the top-right)
  4. Click Load unpacked and select the src folder inside the cloned directory
  5. The XInfer Crawler icon should appear in your extensions toolbar

To update, pull the latest changes and click the refresh icon on the extension card in chrome://extensions.

To confirm it is working, visit your admin panel. You should see a Use Browser Extension checkbox in the crawler's Advanced Options.

Using the Extension

Once installed, a new option appears in the crawler's Advanced Options on both the Data Uploads page and the Setup Wizard.

  1. Open the crawler page in your admin panel
  2. Expand Advanced Options
  3. Check Use Browser Extension
  4. Configure your crawl settings (URL, max pages, etc.) as usual
  5. Click Start Crawl

The crawl runs in your browser tab. You will see the same progress indicators as a standard crawl — pages found, current URL, and phase progress.

Important: Keep the browser tab open while the crawl is running. If you close the tab, the crawl pauses until you reopen the page.

What Happens During an Extension Crawl

  1. The server creates a queue of URLs to visit, starting with your home page
  2. The extension opens a new browser tab and navigates to the first URL
  3. If the page requires login, a CAPTCHA, or consent, you complete it in the tab as you normally would, then continue from the admin page
  4. The extension captures the fully rendered HTML from the tab and sends it to the server
  5. The server saves the page, extracts links, and adds new URLs to the queue
  6. For each subsequent URL, the extension navigates the same tab, waits for the page to render, and captures HTML automatically
  7. When all pages are crawled (or the max pages limit is reached), the tab is closed
  8. The server runs the analysis and embedding phases (same as a standard crawl)

Stopping a Crawl

Click the Stop button at any time. The extension stops fetching new pages and the crawl ends. Any pages already fetched are kept in your knowledge base.

Troubleshooting

The "Use Browser Extension" checkbox does not appear

  • Make sure the extension is installed and enabled in Chrome
  • Refresh the admin page
  • Check that you are using Google Chrome (the extension is Chrome-only)

The crawl is slow

Extension crawls are limited by your internet connection speed. Each page is fetched one at a time with a short delay between requests. This is normal and expected — it ensures the target website is not overwhelmed.

Pages still fail during an extension crawl

Some pages may fail for reasons unrelated to IP blocking: the page does not exist (404), requires a specific login, or returns non-HTML content. Check the failed URLs list after the crawl for details.

Related Pages