scrape-url

Documentation / extractor/url-to-content/scrape-url

Extract

scrapeURL()

function scrapeURL(url: string, options?: object): Promise<string>;

Defined in: extractor/url-to-content/scrape-url.js:44

Tardigrade the Web Crawler

Use Fetch API, check for bot detection. Scrape any domain's URL to get its HTML, JSON, or arraybuffer.
Scraping internet pages is a free speech right .
Features: timeout, redirects, default UA, referer as google, and bot detection checking.
If fetch method does not get needed HTML, use Docker proxy as backup.
Setup Docker container with NodeJS server API renders with puppeteer DOM to get all HTML loaded by secondary in-page API requests after the initial page request, including user login and cookie storage.
Bypass Cloudflare bot check: A webpage proxy that request through Chromium (puppeteer) - can be used to bypass Cloudflare anti bot using cookie id javascript method.
Send your request to the server with the port 3000 and add your URL to the "url" query string like this: http://localhost:3000/?url=https://example.org
Optional: Setup residential IP proxy to access sites that IP-block datacenters and manage rotation with Scrapoxy. Recommended: Hypeproxy NinjasProxy Proxy-Cheap LiveProxies

Parameters

Parameter	Type	Description
`url`	`string`	any domain's URL
`options?`	{ `changeReferer`: `number`; `checkBotDetection`: `number`; `checkRobotsAllowed`: `boolean`; `maxRedirects`: `number`; `proxy`: `string`; `timeout`: `number`; `userAgentIndex`: `number`; }
`options.changeReferer?`	`number`	default=true - set referer as google
`options.checkBotDetection?`	`number`	default=true - check for bot detection messages
`options.checkRobotsAllowed?`	`boolean`	default=false - check robots.txt rules
`options.maxRedirects?`	`number`	default=3 - max redirects to follow
`options.proxy?`	`string`	default=false - use proxy url
`options.timeout?`	`number`	default=5 - abort request if not retrived, in seconds
`options.userAgentIndex?`	`number`	default=0 - index of [google bot, default chrome]

Returns

Promise<string>

HTML, JSON, arraybuffer, or error object

Example

await scrapeURL("https://hckrnews.com", {timeout: 5, userAgentIndex: 1})

Author

ai-research-agent (2024)

Other

fetchScrapingRules()

function fetchScrapingRules(url: string): Promise<any>;

Defined in: extractor/url-to-content/scrape-url.js:211

Fetches and parses the robots.txt file for a given URL.

Parameters

Parameter	Type	Description
`url`	`string`	The base URL to fetch the robots.txt from.

Returns

Promise<any>

A JSON object representing the parsed robots.txt.

scrapeJINA()

function scrapeJINA(url: string): Promise<string>;

Defined in: extractor/url-to-content/scrape-url.js:132

As backup, scrape with JINA to get html

Parameters

Parameter	Type	Description
`url`	`string`

Returns

Promise<string>

Extract​

scrapeURL()​

Tardigrade the Web Crawler​

Parameters​

Returns​

Example​

Author​

Other​

fetchScrapingRules()​

Parameters​

Returns​

scrapeJINA()​

Parameters​

Returns​

Extract

scrapeURL()

Tardigrade the Web Crawler

Parameters

Returns

Example

Author

Other

fetchScrapingRules()

Parameters

Returns

scrapeJINA()

Parameters

Returns