scrape-url
ai-research-agent / extractor/url-to-content/scrape-url
Extract
scrapeURL()
function scrapeURL(url, options?): Promise<string>
Tardigrade the Web Crawler
-
Use Fetch API, check for bot detection. Scrape any domain's URL to get its HTML, JSON, or arraybuffer.
Scraping internet pages is a free speech right . -
Features: timeout, redirects, default UA, referer as google, and bot detection checking.
-
If fetch method does not get needed HTML, use Docker proxy as backup.
-
Setup Docker container with NodeJS server API renders with puppeteer DOM to get all HTML loaded by secondary in-page API requests after the initial page request, including user login and cookie storage.
-
Bypass Cloudflare bot check: A webpage proxy that request through Chromium (puppeteer) - can be used to bypass Cloudflare anti bot using cookie id javascript method.
-
Send your request to the server with the port 3000 and add your URL to the "url" query string like this:
http://localhost:3000/?url=https://example.org
-
Optional: Setup residential IP proxy to access sites that IP-block datacenters and manage rotation with Scrapoxy. Recommended: Hypeproxy NinjasProxy Proxy-Cheap LiveProxies
Parameters
Parameter | Type | Description |
---|---|---|
|
| any domain's URL |
| { | |
|
| default=true - set referer as google |
|
| default=true - check for bot detection messages |
|
| default=false - check robots.txt rules |
|
| default=3 - max redirects to follow |
|
| default=false - use proxy url |
|
| default=5 - abort request if not retrived, in seconds |
|
| default=0 - index of [google bot, default chrome] |
Returns
Promise
<string
>
- HTML, JSON, arraybuffer, or error object
Example
await scrapeURL("https://hckrnews.com", {timeout: 5, userAgentIndex: 1})
Author
Other
fetchScrapingRules()
function fetchScrapingRules(url): Promise<Object>
Fetches and parses the robots.txt file for a given URL.
Parameters
Parameter | Type | Description |
---|---|---|
|
| The base URL to fetch the robots.txt from. |
Returns
Promise
<Object
>
A JSON object representing the parsed robots.txt.