Skip to main content

scrape-url

ai-research-agent / extractor/url-to-content/scrape-url

Extract

scrapeURL()

function scrapeURL(url, options?): Promise<string>

Tardigrade the Web Crawler

  1. Use Fetch API, check for bot detection. Scrape any domain's URL to get its HTML, JSON, or arraybuffer.
    Scraping internet pages is a free speech right .

  2. Features: timeout, redirects, default UA, referer as google, and bot detection checking.

  3. If fetch method does not get needed HTML, use Docker proxy as backup.

  4. Setup Docker container with NodeJS server API renders with puppeteer DOM to get all HTML loaded by secondary in-page API requests after the initial page request, including user login and cookie storage.

  5. Bypass Cloudflare bot check: A webpage proxy that request through Chromium (puppeteer) - can be used to bypass Cloudflare anti bot using cookie id javascript method.

  6. Send your request to the server with the port 3000 and add your URL to the "url" query string like this: http://localhost:3000/?url=https://example.org

  7. Optional: Setup residential IP proxy to access sites that IP-block datacenters and manage rotation with Scrapoxy. Recommended: Hypeproxy NinjasProxy Proxy-Cheap LiveProxies

Parameters

ParameterTypeDescription

url

string

any domain's URL

options?

{ changeReferer: number; checkBotDetection: number; checkRobotsAllowed: boolean; maxRedirects: number; proxy: string; timeout: number; userAgentIndex: number; }

options.changeReferer?

number

default=true - set referer as google

options.checkBotDetection?

number

default=true - check for bot detection messages

options.checkRobotsAllowed?

boolean

default=false - check robots.txt rules

options.maxRedirects?

number

default=3 - max redirects to follow

options.proxy?

string

default=false - use proxy url

options.timeout?

number

default=5 - abort request if not retrived, in seconds

options.userAgentIndex?

number

default=0 - index of [google bot, default chrome]

Returns

Promise<string>

  • HTML, JSON, arraybuffer, or error object

Example

await scrapeURL("https://hckrnews.com", {timeout: 5, userAgentIndex: 1})

Author

ai-research-agent (2024)

Other

fetchScrapingRules()

function fetchScrapingRules(url): Promise<Object>

Fetches and parses the robots.txt file for a given URL.

Parameters

ParameterTypeDescription

url

string

The base URL to fetch the robots.txt from.

Returns

Promise<Object>

A JSON object representing the parsed robots.txt.