Function scrapeURL

  • Scrape any domain's URL to get its HTML, JSON, or arraybuffer.
    Features: timeout, redirects, default UA, referer as google, and bot detection checking.
    Scraping internet pages is a free speech right globally.

    1. Docker container with NodeJS server API renders with puppeteer DOM to get all HTML loaded by secondary in-age API requests after the initial page request, including user login and cookie storage.
    2. Bypass Cloudflare bot check: A webpage proxy that request through Chromium (puppeteer) - can be used to bypass Cloudflare anti bot using cookie id javascript method.
    3. Send your request to the server with the port 3000 and add your URL to the "url" query string like this: http://localhost:3000/?url=https://example.org

    Parameters

    • url: string

      any domain's URL

    • Optionaloptions: {
          timeout: number;
          maxRedirects: number;
          checkBotDetection: number;
          changeReferer: number;
          userAgentIndex: number;
          useCORSProxy: number;
          urlProxy: string;
      } = {}
      • timeout: number

        default=5 - abort request if not retrived, in seconds

      • maxRedirects: number

        default=3 - max redirects to follow

      • checkBotDetection: number

        default=true - check for bot detection messages

      • changeReferer: number

        default=true - set referer as google

      • userAgentIndex: number

        default=0 - index of [google bot, default chrome]

      • useCORSProxy: number

        default=false - use 60%-working corsproxy.io (in frontend JS)

      • urlProxy: string

        default=false - use proxy url

    Returns Promise<any>

    • HTML, JSON, arraybuffer, or error object
    await scrapeURL("https://hckrnews.com", {timeout: 5, userAgentIndex: 1})
    

    Gulakov, A. (2024)