Website URLs
Train on live web content — single pages or automatic site-wide crawls.
URLs are the highest-leverage knowledge source when your content already lives on a public site. ChatbotGen scrapes the page, extracts readable text, and embeds it alongside your other training data.
The two modes
The URLs tab has two modes: Add URL for a single page, Crawl Website for bulk discovery.
┌─ Knowledge › URLs ────────────────────────────────────────┐
│ [ Add URL ] [ Crawl Website ] │
│ ───────── │
│ │
│ (switches between two forms based on the mode) │
└───────────────────────────────────────────────────────────┘
Add URL
The simpler form. Takes a single page URL.
┌─ Add URL ─────────────────────────────────────────────────┐
│ Add a specific page URL. The content will be extracted │
│ and used for training. │
│ │
│ [ example.com/about ] [ Add URL ]
└───────────────────────────────────────────────────────────┘
If you leave off the scheme, it's normalized to https://.
Crawl Website
For whole-site coverage, use Crawl Website. Paste the homepage and ChatbotGen discovers pages automatically — first from the sitemap, then falling back to HTML link scraping.
┌─ Crawl Website ───────────────────────────────────────────┐
│ Enter your website's homepage. We'll check the sitemap │
│ and discover pages automatically. │
│ │
│ [ toptive.co ] [ Start Crawl ] │
│ │
│ Include paths (optional) Exclude paths (optional) │
│ [ /docs, /help ] [ /blog, /admin ] │
│ Only URLs starting with Skip URLs starting with │
│ these paths these paths │
└───────────────────────────────────────────────────────────┘
After clicking Start Crawl you'll see:
Crawling website... URLs will appear as they're discovered.
Pages show up in the URL list as the worker finds and queues them. Expect a few minutes for medium-sized sites.
Include / exclude paths
Both filters are comma-separated path prefixes.
-
Include paths — only keep URLs whose path starts with one of these. Example:
/help, /docskeeps help center + docs, skips everything else. -
Exclude paths — drop URLs whose path starts with one of these. Example:
/blog, /adminskips blog posts and admin pages.
Leave both blank to crawl everything the sitemap exposes (up to the hard ceiling — see below).
Hard crawl ceiling
A single crawl is capped at 1,000 URLs regardless of plan. Bigger sites need to be narrowed via include paths, or multiple targeted crawls from different roots.
Your plan's per-chatbot URL limit applies on top of this — see Plans & pricing.
Excluding a URL after the fact
If a crawled page is noisy, open it and flip the Exclude from search toggle. The URL stays in your list (and in the database) but the retriever ignores it. Useful when you want an option to re-include later without re-crawling.
Deleting the URL permanently removes it and frees up the character budget.
Re-crawling
URLs don't refresh automatically. To pick up changes on a page, use the row's Re-crawl action or re-trigger a crawl from scratch. For frequently-changing content (prices, inventory), consider a Q&A pair or tool integration instead.
Troubleshooting
- Empty extraction — the page is gated behind a login, or rendered in a format our scraper can't read. Try a different URL or paste the content as a text snippet.
-
Crawl returned 0 URLs —
robots.txtblocked us, or the homepage returns a 4xx/5xx. Confirm the site loads in an incognito window. -
Status stuck on
crawling— the worker is still running. For large sites this can take a few minutes.