extract-content-readability

Documentation / extractor/html-to-content/extract-content/extract-content-readability

Extract

extractMainContentFromHTML()

function extractMainContentFromHTML(html: any, options?: object): Element;

Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:60

HTML-to-Main-Content Extractor #1

The function extracts main content with regex patterns, cleaning HTML, scoring nodes based on content indicators like paragraphs and id/class names, selecting the top candidate, extracting it, and cleaning up content around it.

Define regular expressions:
- Various regex patterns are defined to identify content and non-content areas.
Define helper functions:
- normalizeSpaces: Normalizes whitespace in a string.
- stripTags: Removes all HTML tags from a string.
- getTextLength: Calculates the length of text after stripping tags.
- calculateLinkDensity: Calculates the ratio of link text to total text.
Clean HTML:
- Remove unlikely candidates (e.g., ads, sidebars) from the HTML.
Define scoring function:
- scoreNode: Assigns a score to an HTML node based on content and attributes.
- Increases score for positive indicators (e.g., article, body, content tags).
- Decreases score for negative indicators (e.g., hidden, footer, sidebar tags).
- Adds to score based on paragraph tags and text length.
Find and score candidate nodes:
- Identify potential content nodes in the cleaned HTML.
- Score each node using the scoreNode function.
Select top candidate:
- Sort candidates by score and select the highest-scoring node.
Extract content:
- Use regex to extract content around the top candidate node.
Clean up extracted content:
- Remove script and style tags and their contents.
- Process anchor tags based on content density.
- Keep only specific HTML tags (a, p, img, h1-h6, ul, ol, li).
- Remove excess whitespace from the final content.

Article Extraction Benchmark

Parameters

Parameter	Type	Description
`html`	`any`	‐
`options?`	{ `minContentLength`: `number`; `minScore`: `number`; `minTextLength`: `number`; `retryLength`: `number`; }
`options.minContentLength?`	`number`	default=140 - Minimum length of content to be considered valid
`options.minScore?`	`number`	default=20 - Minimum score for content to be considered valid
`options.minTextLength?`	`number`	default=25 - Minimum length of text to be considered valid
`options.retryLength?`	`number`	default=250 - Length to retry content extraction if initial attempt fails

Returns

Element

Extracted HTML element of main content such as article body

Example

var url = "https://www.nytimes.com/2024/08/28/business/telegram-ceo-pavel-durov-charged.html"
const html = await (await fetch(url)).text();
var articleContent = extractMainContentFromHTML(html);

Author

ai-research-agent (2024) Based on Mozilla Readability (2015)

Other

getLinkDensity()

function getLinkDensity(elem: Element): number;

Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:229

Calculates the link density of an element.

Parameters

Parameter	Type	Description
`elem`	`Element`	The element to calculate link density for

Returns

number

The link density (ratio of link text length to total text length)

classWeight()

function classWeight(
   elem: Element, 
   positiveRe: RegExp, 
   negativeRe: RegExp): number;

Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:249

Calculates the weight of an element based on its class and id attributes.

Parameters

Parameter	Type	Description
`elem`	`Element`	The element to calculate weight for
`positiveRe`	`RegExp`	Regular expression for positive indicators
`negativeRe`	`RegExp`	Regular expression for negative indicators

Returns

number

The calculated weight

scoreNode()

function scoreNode(
   elem: Element, 
   positiveRe: RegExp, 
   negativeRe: RegExp): any;

Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:270

Scores a node based on its tag name and attributes.

Parameters

Parameter	Type	Description
`elem`	`Element`	The element to score
`positiveRe`	`RegExp`	Regular expression for positive indicators
`negativeRe`	`RegExp`	Regular expression for negative indicators

Returns

any

An object containing the score and the element

sanitize()

function sanitize(
   node: Element, 
   candidates: any, 
   videoRe: RegExp, 
   positiveRe: RegExp, 
   negativeRe: RegExp, 
   minTextLength: number): Element;

Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:317

Sanitizes the content by removing unwanted elements and cleaning remaining elements.

Parameters

Parameter	Type	Description
`node`	`Element`	The node to sanitize
`candidates`	`any`	Object containing scored candidates
`videoRe`	`RegExp`	Regular expression for video URLs
`positiveRe`	`RegExp`	Regular expression for positive indicators
`negativeRe`	`RegExp`	Regular expression for negative indicators
`minTextLength`	`number`	Minimum text length to consider

Returns

Element

The sanitized node

Extract​

extractMainContentFromHTML()​

HTML-to-Main-Content Extractor #1​

Parameters​

Returns​

Example​

Author​

Other​

getLinkDensity()​

Parameters​

Returns​

classWeight()​

Parameters​

Returns​

scoreNode()​

Parameters​

Returns​

sanitize()​

Parameters​

Returns​

Extract

extractMainContentFromHTML()

HTML-to-Main-Content Extractor #1

Parameters

Returns

Example

Author

Other

getLinkDensity()

Parameters

Returns

classWeight()

Parameters

Returns

scoreNode()

Parameters

Returns

sanitize()

Parameters

Returns