Function scrapeURL

    1. Use Fetch API, check for bot detection. Scrape any domain's URL to get its HTML, JSON, or arraybuffer.
      Scraping internet pages is a free speech right globally.

    2. Features: timeout, redirects, default UA, referer as google, and bot detection checking.

    3. If fetch method does not get needed HTML, use Docker proxy as backup.

    4. Setup Docker container with NodeJS server API renders with puppeteer DOM to get all HTML loaded by secondary in-page API requests after the initial page request, including user login and cookie storage.

    5. Bypass Cloudflare bot check: A webpage proxy that request through Chromium (puppeteer) - can be used to bypass Cloudflare anti bot using cookie id javascript method.

    6. Send your request to the server with the port 3000 and add your URL to the "url" query string like this: http://localhost:3000/?url=https://example.org

    Parameters

    • url: string

      any domain's URL

    • Optionaloptions: {
          timeout: number;
          maxRedirects: number;
          checkBotDetection: number;
          changeReferer: number;
          userAgentIndex: number;
          useCORSProxy: number;
          proxy: string;
          checkRobotsAllowed: boolean;
      } = {}
      • timeout: number

        default=5 - abort request if not retrived, in seconds

      • maxRedirects: number

        default=3 - max redirects to follow

      • checkBotDetection: number

        default=true - check for bot detection messages

      • changeReferer: number

        default=true - set referer as google

      • userAgentIndex: number

        default=0 - index of [google bot, default chrome]

      • useCORSProxy: number

        default=false - use 60%-working corsproxy.io (in frontend JS)

      • proxy: string

        default=false - use proxy url

      • checkRobotsAllowed: boolean

        default=false - check robots.txt rules

    Returns Promise<any>

    • HTML, JSON, arraybuffer, or error object
    await scrapeURL("https://hckrnews.com", {timeout: 5, userAgentIndex: 1})
    

    ai-research-agent (2024)