Skip to main content

extract-content-readability

Documentation / extractor/html-to-content/extract-content/extract-content-readability

Extract

extractMainContentFromHTML()

function extractMainContentFromHTML(html: any, options?: object): Element;

Defined in: extractor/html-to-content/extract-content/extract-content-readability.js:60

HTML-to-Main-Content Extractor #1

The function extracts main content with regex patterns, cleaning HTML, scoring nodes based on content indicators like paragraphs and id/class names, selecting the top candidate, extracting it, and cleaning up content around it.

  1. Define regular expressions:

    • Various regex patterns are defined to identify content and non-content areas.
  2. Define helper functions:

    • normalizeSpaces: Normalizes whitespace in a string.
    • stripTags: Removes all HTML tags from a string.
    • getTextLength: Calculates the length of text after stripping tags.
    • calculateLinkDensity: Calculates the ratio of link text to total text.
  3. Clean HTML:

    • Remove unlikely candidates (e.g., ads, sidebars) from the HTML.
  4. Define scoring function:

    • scoreNode: Assigns a score to an HTML node based on content and attributes.
    • Increases score for positive indicators (e.g., article, body, content tags).
    • Decreases score for negative indicators (e.g., hidden, footer, sidebar tags).
    • Adds to score based on paragraph tags and text length.
  5. Find and score candidate nodes:

    • Identify potential content nodes in the cleaned HTML.
    • Score each node using the scoreNode function.
  6. Select top candidate:

    • Sort candidates by score and select the highest-scoring node.
  7. Extract content:

    • Use regex to extract content around the top candidate node.
  8. Clean up extracted content:

    • Remove script and style tags and their contents.
    • Process anchor tags based on content density.
    • Keep only specific HTML tags (a, p, img, h1-h6, ul, ol, li).
    • Remove excess whitespace from the final content.

Article Extraction Benchmark

Parameters

ParameterTypeDescription

html

any

options?

{ minContentLength: number; minScore: number; minTextLength: number; retryLength: number; }

options.minContentLength?

number

default=140 - Minimum length of content to be considered valid

options.minScore?

number

default=20 - Minimum score for content to be considered valid

options.minTextLength?

number

default=25 - Minimum length of text to be considered valid

options.retryLength?

number

default=250 - Length to retry content extraction if initial attempt fails

Returns

Element

Extracted HTML element of main content such as article body

Example

var url = "https://www.nytimes.com/2024/08/28/business/telegram-ceo-pavel-durov-charged.html"
const html = await (await fetch(url)).text();
var articleContent = extractMainContentFromHTML(html);

Author

ai-research-agent (2024) Based on Mozilla Readability (2015)

Other

classWeight()

function classWeight(
elem: Element,
positiveRe: RegExp,
negativeRe: RegExp): number;

Defined in: extractor/html-to-content/extract-content/extract-content-readability.js:249

Calculates the weight of an element based on its class and id attributes.

Parameters

ParameterTypeDescription

elem

Element

The element to calculate weight for

positiveRe

RegExp

Regular expression for positive indicators

negativeRe

RegExp

Regular expression for negative indicators

Returns

number

The calculated weight


getLinkDensity()

function getLinkDensity(elem: Element): number;

Defined in: extractor/html-to-content/extract-content/extract-content-readability.js:229

Calculates the link density of an element.

Parameters

ParameterTypeDescription

elem

Element

The element to calculate link density for

Returns

number

The link density (ratio of link text length to total text length)


sanitize()

function sanitize(
node: Element,
candidates: any,
videoRe: RegExp,
positiveRe: RegExp,
negativeRe: RegExp,
minTextLength: number): Element;

Defined in: extractor/html-to-content/extract-content/extract-content-readability.js:317

Sanitizes the content by removing unwanted elements and cleaning remaining elements.

Parameters

ParameterTypeDescription

node

Element

The node to sanitize

candidates

any

Object containing scored candidates

videoRe

RegExp

Regular expression for video URLs

positiveRe

RegExp

Regular expression for positive indicators

negativeRe

RegExp

Regular expression for negative indicators

minTextLength

number

Minimum text length to consider

Returns

Element

The sanitized node


scoreNode()

function scoreNode(
elem: Element,
positiveRe: RegExp,
negativeRe: RegExp): any;

Defined in: extractor/html-to-content/extract-content/extract-content-readability.js:270

Scores a node based on its tag name and attributes.

Parameters

ParameterTypeDescription

elem

Element

The element to score

positiveRe

RegExp

Regular expression for positive indicators

negativeRe

RegExp

Regular expression for negative indicators

Returns

any

An object containing the score and the element