Function extractMainContentFromHTML

  • The function extracts main content with regex patterns, cleaning HTML, scoring nodes based on content indicators like paragraphs and id/class names, selecting the top candidate, extracting it, and cleaning up content around it.

    1. Define regular expressions:

      • Various regex patterns are defined to identify content and non-content areas.
    2. Define helper functions:

      • normalizeSpaces: Normalizes whitespace in a string.
      • stripTags: Removes all HTML tags from a string.
      • getTextLength: Calculates the length of text after stripping tags.
      • calculateLinkDensity: Calculates the ratio of link text to total text.
    3. Clean HTML:

      • Remove unlikely candidates (e.g., ads, sidebars) from the HTML.
    4. Define scoring function:

      • scoreNode: Assigns a score to an HTML node based on content and attributes.
      • Increases score for positive indicators (e.g., article, body, content tags).
      • Decreases score for negative indicators (e.g., hidden, footer, sidebar tags).
      • Adds to score based on paragraph tags and text length.
    5. Find and score candidate nodes:

      • Identify potential content nodes in the cleaned HTML.
      • Score each node using the scoreNode function.
    6. Select top candidate:

      • Sort candidates by score and select the highest-scoring node.
    7. Extract content:

      • Use regex to extract content around the top candidate node.
    8. Clean up extracted content:

      • Remove script and style tags and their contents.
      • Process anchor tags based on content density.
      • Keep only specific HTML tags (a, p, img, h1-h6, ul, ol, li).
      • Remove excess whitespace from the final content.

    Article Extraction Benchmark

    Parameters

    • html: any
    • Optionaloptions: {
          minContentLength: number;
          minScore: number;
          minTextLength: number;
          retryLength: number;
      } = {}
      • minContentLength: number

        default=140 - Minimum length of content to be considered valid

      • minScore: number

        default=20 - Minimum score for content to be considered valid

      • minTextLength: number

        default=25 - Minimum length of text to be considered valid

      • retryLength: number

        default=250 - Length to retry content extraction if initial attempt fails

    Returns Element

    Extracted HTML element of main content such as article body

    var url = "https://www.nytimes.com/2024/08/28/business/telegram-ceo-pavel-durov-charged.html"
    const html = await (await fetch(url)).text();
    var articleContent = extractMainContentFromHTML(html);

    Based on Mozilla Readability (2015), Arc90 (2010)