Function scrapeURL

scrapeURL(url, options?): Promise<any>
Tardigrade the Web Crawler
1. Use Fetch API, check for bot detection. Scrape any domain's URL to get its HTML, JSON, or arraybuffer.
  Scraping internet pages is a free speech right globally.
2. Features: timeout, redirects, default UA, referer as google, and bot detection checking.
3. If fetch method does not get needed HTML, use Docker proxy as backup.
4. Setup Docker container with NodeJS server API renders with puppeteer DOM to get all HTML loaded by secondary in-page API requests after the initial page request, including user login and cookie storage.
5. Bypass Cloudflare bot check: A webpage proxy that request through Chromium (puppeteer) - can be used to bypass Cloudflare anti bot using cookie id javascript method.
6. Send your request to the server with the port 3000 and add your URL to the "url" query string like this: http://localhost:3000/?url=https://example.org
Parameters
- url: string
  any domain's URL
- Optionaloptions: {
      timeout: number;
      maxRedirects: number;
      checkBotDetection: number;
      changeReferer: number;
      userAgentIndex: number;
      useCORSProxy: number;
      proxy: string;
      checkRobotsAllowed: boolean;
  } = {}
  - timeout: number
    default=5 - abort request if not retrived, in seconds
  - maxRedirects: number
    default=3 - max redirects to follow
  - checkBotDetection: number
    default=true - check for bot detection messages
  - changeReferer: number
    default=true - set referer as google
  - userAgentIndex: number
    default=0 - index of [google bot, default chrome]
  - useCORSProxy: number
    default=false - use 60%-working corsproxy.io (in frontend JS)
  - proxy: string
    default=false - use proxy url
  - checkRobotsAllowed: boolean
    default=false - check robots.txt rules
Returns Promise<any>
- HTML, JSON, arraybuffer, or error object
Example
```
await scrapeURL("https://hckrnews.com", {timeout: 5, userAgentIndex: 1})
```
Author
ai-research-agent (2024)
- Defined in extractor/url-to-content/scrape-url.js:36