extract-content-readability
ai-research-agent / extractor/html-to-content/extract-content/extract-content-readability
Extract
extractMainContentFromHTML()
function extractMainContentFromHTML(html, options?): Element
HTML-to-Main-Content Extractor #1
The function extracts main content with regex patterns, cleaning HTML, scoring nodes based on content indicators like paragraphs and id/class names, selecting the top candidate, extracting it, and cleaning up content around it.
-
Define regular expressions:
- Various regex patterns are defined to identify content and non-content areas.
-
Define helper functions:
- normalizeSpaces: Normalizes whitespace in a string.
- stripTags: Removes all HTML tags from a string.
- getTextLength: Calculates the length of text after stripping tags.
- calculateLinkDensity: Calculates the ratio of link text to total text.
-
Clean HTML:
- Remove unlikely candidates (e.g., ads, sidebars) from the HTML.
-
Define scoring function:
- scoreNode: Assigns a score to an HTML node based on content and attributes.
- Increases score for positive indicators (e.g., article, body, content tags).
- Decreases score for negative indicators (e.g., hidden, footer, sidebar tags).
- Adds to score based on paragraph tags and text length.
-
Find and score candidate nodes:
- Identify potential content nodes in the cleaned HTML.
- Score each node using the scoreNode function.
-
Select top candidate:
- Sort candidates by score and select the highest-scoring node.
-
Extract content:
- Use regex to extract content around the top candidate node.
-
Clean up extracted content:
- Remove script and style tags and their contents.
- Process anchor tags based on content density.
- Keep only specific HTML tags (a, p, img, h1-h6, ul, ol, li).
- Remove excess whitespace from the final content.
Parameters
Parameter | Type | Description |
---|---|---|
|
| ‐ |
| { | |
|
| default=140 - Minimum length of content to be considered valid |
|
| default=20 - Minimum score for content to be considered valid |
|
| default=25 - Minimum length of text to be considered valid |
|
| default=250 - Length to retry content extraction if initial attempt fails |
Returns
Element
Extracted HTML element of main content such as article body
Example
var url = "https://www.nytimes.com/2024/08/28/business/telegram-ceo-pavel-durov-charged.html"
const html = await (await fetch(url)).text();
var articleContent = extractMainContentFromHTML(html);
Author
ai-research-agent (2024) Based on Mozilla Readability (2015), Arc90 (2010)