Skip to main content

extract-content-mercury

ai-research-agent / extractor/html-to-content/extract-content/extract-content-mercury

Extract

extractMainContentFromHTML2()

function extractMainContentFromHTML2(html, opts?): string

HTML-to-Main-Content Extractor #2

  1. The algorithm starts by loading the HTML content using linkedom, a lightweight DOM parser for Node.js.
  2. It then applies a series of cleaning and scoring techniques to identify the main content of the page, starting with stripping unlikely candidates (e.g., elements with class names like "comment" or "sidebar").
  3. The HTML is converted into a series of paragraph elements, which are then scored based on various factors such as text length, number of commas, and the presence of certain class names or IDs.
  4. The algorithm assigns scores to parent and grandparent elements based on the scores of their children, with parents receiving the full score and grandparents receiving half.
  5. After scoring, the algorithm finds the top candidate element by selecting the node with the highest score.
  6. The top candidate's siblings are then examined to see if they should be included in the main content, based on their scores and other factors like link density.
  7. The algorithm then cleans the selected content by removing unnecessary tags, attributes, and empty elements.
  8. It also handles special cases like cleaning up header tags, images, and other potentially irrelevant content.
  9. Throughout the process, the algorithm uses various regular expressions and scoring heuristics to identify positive and negative indicators of content relevance.
  10. Finally, the cleaned and extracted content is returned as an HTML string, representing the main body of the article or webpage.

Article Extraction Benchmark

Parameters

ParameterTypeDescription

html

string

The HTML content to extract from.

opts?

{ cleanConditionally: boolean; stripUnlikelyCandidates: boolean; weightNodes: boolean; }

The options for content extraction.

opts.cleanConditionally?

boolean

default=true - Clean the node to remove superfluous content like forms, ads, etc. Initially, pass in the most restrictive options which will return the highest quality content. On each failure, retry with slightly more lax options.

opts.stripUnlikelyCandidates?

boolean

default=true - Remove elements that match non-article- like criteria first (e.g., elements with a classname of "comment").

opts.weightNodes?

boolean

default=true - Modify an element's score based on certain classNames or IDs (e.g., subtract if a node has a className of 'comment', add if a node has an ID of 'entry-content').

Returns

string

The extracted content as an HTML string, or null if extraction fails.

Author

ai-research-agent (2024) Based on Postlight Mercury Parser (2017-)

Example

var url =  "https://en.wikipedia.org/wiki/David_Hilbert"
var html = await (await fetch(url)).text();
var content = extractMainContentFromHTML(html);
console.log(content); // HTML content of main article body