Skip to main content

scrape-url

Documentation / extractor/url-to-content/scrape-url

Extract​

scrapeURL()​

function scrapeURL(url: string, options?: object): Promise<string>;

Defined in: extractor/url-to-content/scrape-url.js:44

Tardigrade the Web Crawler​

  1. Use Fetch API, check for bot detection. Scrape any domain's URL to get its HTML, JSON, or arraybuffer.
    Scraping internet pages is a free speech right .

  2. Features: timeout, redirects, default UA, referer as google, and bot detection checking.

  3. If fetch method does not get needed HTML, use Docker proxy as backup.

  4. Setup Docker container with NodeJS server API renders with puppeteer DOM to get all HTML loaded by secondary in-page API requests after the initial page request, including user login and cookie storage.

  5. Bypass Cloudflare bot check: A webpage proxy that request through Chromium (puppeteer) - can be used to bypass Cloudflare anti bot using cookie id javascript method.

  6. Send your request to the server with the port 3000 and add your URL to the "url" query string like this: http://localhost:3000/?url=https://example.org

  7. Optional: Setup residential IP proxy to access sites that IP-block datacenters and manage rotation with Scrapoxy. Recommended: Hypeproxy NinjasProxy Proxy-Cheap LiveProxies

Parameters​

ParameterTypeDescription

url

string

any domain's URL

options?

{ changeReferer: number; checkBotDetection: number; checkRobotsAllowed: boolean; maxRedirects: number; proxy: string; timeout: number; userAgentIndex: number; }

options.changeReferer?

number

default=true - set referer as google

options.checkBotDetection?

number

default=true - check for bot detection messages

options.checkRobotsAllowed?

boolean

default=false - check robots.txt rules

options.maxRedirects?

number

default=3 - max redirects to follow

options.proxy?

string

default=false - use proxy url

options.timeout?

number

default=5 - abort request if not retrived, in seconds

options.userAgentIndex?

number

default=0 - index of [google bot, default chrome]

Returns​

Promise<string>

  • HTML, JSON, arraybuffer, or error object

Example​

await scrapeURL("https://hckrnews.com", {timeout: 5, userAgentIndex: 1})

Author​

ai-research-agent (2024)

Other​

fetchScrapingRules()​

function fetchScrapingRules(url: string): Promise<any>;

Defined in: extractor/url-to-content/scrape-url.js:211

Fetches and parses the robots.txt file for a given URL.

Parameters​

ParameterTypeDescription

url

string

The base URL to fetch the robots.txt from.

Returns​

Promise<any>

A JSON object representing the parsed robots.txt.


scrapeJINA()​

function scrapeJINA(url: string): Promise<string>;

Defined in: extractor/url-to-content/scrape-url.js:132

As backup, scrape with JINA to get html

Parameters​

ParameterTypeDescription

url

string

Returns​

Promise<string>