Skip to main content

extract-content-readability

ai-research-agent / extractor/html-to-content/extract-content/extract-content-readability

Extract

extractMainContentFromHTML()

function extractMainContentFromHTML(html, options?): Element

HTML-to-Main-Content Extractor #1

The function extracts main content with regex patterns, cleaning HTML, scoring nodes based on content indicators like paragraphs and id/class names, selecting the top candidate, extracting it, and cleaning up content around it.

  1. Define regular expressions:

    • Various regex patterns are defined to identify content and non-content areas.
  2. Define helper functions:

    • normalizeSpaces: Normalizes whitespace in a string.
    • stripTags: Removes all HTML tags from a string.
    • getTextLength: Calculates the length of text after stripping tags.
    • calculateLinkDensity: Calculates the ratio of link text to total text.
  3. Clean HTML:

    • Remove unlikely candidates (e.g., ads, sidebars) from the HTML.
  4. Define scoring function:

    • scoreNode: Assigns a score to an HTML node based on content and attributes.
    • Increases score for positive indicators (e.g., article, body, content tags).
    • Decreases score for negative indicators (e.g., hidden, footer, sidebar tags).
    • Adds to score based on paragraph tags and text length.
  5. Find and score candidate nodes:

    • Identify potential content nodes in the cleaned HTML.
    • Score each node using the scoreNode function.
  6. Select top candidate:

    • Sort candidates by score and select the highest-scoring node.
  7. Extract content:

    • Use regex to extract content around the top candidate node.
  8. Clean up extracted content:

    • Remove script and style tags and their contents.
    • Process anchor tags based on content density.
    • Keep only specific HTML tags (a, p, img, h1-h6, ul, ol, li).
    • Remove excess whitespace from the final content.

Article Extraction Benchmark

Parameters

ParameterTypeDescription

html

any

options?

{ minContentLength: number; minScore: number; minTextLength: number; retryLength: number; }

options.minContentLength?

number

default=140 - Minimum length of content to be considered valid

options.minScore?

number

default=20 - Minimum score for content to be considered valid

options.minTextLength?

number

default=25 - Minimum length of text to be considered valid

options.retryLength?

number

default=250 - Length to retry content extraction if initial attempt fails

Returns

Element

Extracted HTML element of main content such as article body

Example

var url = "https://www.nytimes.com/2024/08/28/business/telegram-ceo-pavel-durov-charged.html"
const html = await (await fetch(url)).text();
var articleContent = extractMainContentFromHTML(html);

Author

ai-research-agent (2024) Based on Mozilla Readability (2015), Arc90 (2010)