extract-content-readability
Documentation / extractor/html-to-content/extract-content/extract-content-readability
Extract
extractMainContentFromHTML()
function extractMainContentFromHTML(html: any, options?: object): Element;
Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:60
HTML-to-Main-Content Extractor #1
The function extracts main content with regex patterns, cleaning HTML, scoring nodes based on content indicators like paragraphs and id/class names, selecting the top candidate, extracting it, and cleaning up content around it.
- 
Define regular expressions:
- Various regex patterns are defined to identify content and non-content areas.
 
 - 
Define helper functions:
- normalizeSpaces: Normalizes whitespace in a string.
 - stripTags: Removes all HTML tags from a string.
 - getTextLength: Calculates the length of text after stripping tags.
 - calculateLinkDensity: Calculates the ratio of link text to total text.
 
 - 
Clean HTML:
- Remove unlikely candidates (e.g., ads, sidebars) from the HTML.
 
 - 
Define scoring function:
- scoreNode: Assigns a score to an HTML node based on content and attributes.
 - Increases score for positive indicators (e.g., article, body, content tags).
 - Decreases score for negative indicators (e.g., hidden, footer, sidebar tags).
 - Adds to score based on paragraph tags and text length.
 
 - 
Find and score candidate nodes:
- Identify potential content nodes in the cleaned HTML.
 - Score each node using the scoreNode function.
 
 - 
Select top candidate:
- Sort candidates by score and select the highest-scoring node.
 
 - 
Extract content:
- Use regex to extract content around the top candidate node.
 
 - 
Clean up extracted content:
- Remove script and style tags and their contents.
 - Process anchor tags based on content density.
 - Keep only specific HTML tags (a, p, img, h1-h6, ul, ol, li).
 - Remove excess whitespace from the final content.
 
 
Parameters
| Parameter | Type | Description | 
|---|---|---|
  | 
  | ‐  | 
  | {   | |
  | 
  | default=140 - Minimum length of content to be considered valid  | 
  | 
  | default=20 - Minimum score for content to be considered valid  | 
  | 
  | default=25 - Minimum length of text to be considered valid  | 
  | 
  | default=250 - Length to retry content extraction if initial attempt fails  | 
Returns
Element
Extracted HTML element of main content such as article body
Example
var url = "https://www.nytimes.com/2024/08/28/business/telegram-ceo-pavel-durov-charged.html"
const html = await (await fetch(url)).text();
var articleContent = extractMainContentFromHTML(html);
Author
ai-research-agent (2024) Based on Mozilla Readability (2015)
Other
getLinkDensity()
function getLinkDensity(elem: Element): number;
Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:229
Calculates the link density of an element.
Parameters
| Parameter | Type | Description | 
|---|---|---|
  | 
  | The element to calculate link density for  | 
Returns
number
The link density (ratio of link text length to total text length)
classWeight()
function classWeight(
   elem: Element, 
   positiveRe: RegExp, 
   negativeRe: RegExp): number;
Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:249
Calculates the weight of an element based on its class and id attributes.
Parameters
| Parameter | Type | Description | 
|---|---|---|
  | 
  | The element to calculate weight for  | 
  | 
  | Regular expression for positive indicators  | 
  | 
  | Regular expression for negative indicators  | 
Returns
number
The calculated weight
scoreNode()
function scoreNode(
   elem: Element, 
   positiveRe: RegExp, 
   negativeRe: RegExp): any;
Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:270
Scores a node based on its tag name and attributes.
Parameters
| Parameter | Type | Description | 
|---|---|---|
  | 
  | The element to score  | 
  | 
  | Regular expression for positive indicators  | 
  | 
  | Regular expression for negative indicators  | 
Returns
any
An object containing the score and the element
sanitize()
function sanitize(
   node: Element, 
   candidates: any, 
   videoRe: RegExp, 
   positiveRe: RegExp, 
   negativeRe: RegExp, 
   minTextLength: number): Element;
Defined in: packages/ai-research-agent/src/extractor/html-to-content/extract-content/extract-content-readability.js:317
Sanitizes the content by removing unwanted elements and cleaning remaining elements.
Parameters
| Parameter | Type | Description | 
|---|---|---|
  | 
  | The node to sanitize  | 
  | 
  | Object containing scored candidates  | 
  | 
  | Regular expression for video URLs  | 
  | 
  | Regular expression for positive indicators  | 
  | 
  | Regular expression for negative indicators  | 
  | 
  | Minimum text length to consider  | 
Returns
Element
The sanitized node