scrape-url
Documentation / extractor/url-to-content/scrape-url
Extract​
scrapeURL()​
function scrapeURL(url: string, options?: object): Promise<string>;
Defined in: extractor/url-to-content/scrape-url.js:44
Tardigrade the Web Crawler​

-
Use Fetch API, check for bot detection. Scrape any domain's URL to get its HTML, JSON, or arraybuffer.
Scraping internet pages is a free speech right . -
Features: timeout, redirects, default UA, referer as google, and bot detection checking.
-
If fetch method does not get needed HTML, use Docker proxy as backup.
-
Setup Docker container with NodeJS server API renders with puppeteer DOM to get all HTML loaded by secondary in-page API requests after the initial page request, including user login and cookie storage.
-
Bypass Cloudflare bot check: A webpage proxy that request through Chromium (puppeteer) - can be used to bypass Cloudflare anti bot using cookie id javascript method.
-
Send your request to the server with the port 3000 and add your URL to the "url" query string like this:
http://localhost:3000/?url=https://example.org
-
Optional: Setup residential IP proxy to access sites that IP-block datacenters and manage rotation with Scrapoxy. Recommended: Hypeproxy NinjasProxy Proxy-Cheap LiveProxies
Parameters​
Parameter | Type | Description |
---|---|---|
|
| any domain's URL |
| { | |
|
| default=true - set referer as google |
|
| default=true - check for bot detection messages |
|
| default=false - check robots.txt rules |
|
| default=3 - max redirects to follow |
|
| default=false - use proxy url |
|
| default=5 - abort request if not retrived, in seconds |
|
| default=0 - index of [google bot, default chrome] |
Returns​
Promise
<string
>
- HTML, JSON, arraybuffer, or error object
Example​
await scrapeURL("https://hckrnews.com", {timeout: 5, userAgentIndex: 1})
Author​
Other​
fetchScrapingRules()​
function fetchScrapingRules(url: string): Promise<any>;
Defined in: extractor/url-to-content/scrape-url.js:211
Fetches and parses the robots.txt file for a given URL.
Parameters​
Parameter | Type | Description |
---|---|---|
|
| The base URL to fetch the robots.txt from. |
Returns​
Promise
<any
>
A JSON object representing the parsed robots.txt.
scrapeJINA()​
function scrapeJINA(url: string): Promise<string>;
Defined in: extractor/url-to-content/scrape-url.js:132
As backup, scrape with JINA to get html
Parameters​
Parameter | Type | Description |
---|---|---|
|
|
Returns​
Promise
<string
>