Skip to main content

html-to-content

ai-research-agent / extractor/html-to-content/html-to-content

Functions

extractContentAndCite()

function extractContentAndCite(documentOrHTML, options): Object

Extracts the main content and citation information from a document or HTML string

Parameters

ParameterTypeDescription

documentOrHTML

string | object

The document or HTML string to extract content from

options

{ formatting: boolean; images: boolean; links: boolean; url: string; useExtractor2: boolean; }

Optional configuration options

options.formatting

boolean

default=true - Whether to preserve formatting in the extracted content

options.images

boolean

default=true - Whether to include images in the extracted content

options.links

boolean

default=true - Whether to include links in the extracted content

options.url

string

The URL of the original document, if available, for absolutify-ing URLs

options.useExtractor2

boolean

default=false - false uses Mozilla Readability, true uses Postlight Mercury. then use the alternate if the first returns less than 200 characters

Returns

Object

The extracted content and citation information

Author

ai-research-agent (2024)

Interfaces

ExtractedContent

Properties

author
author: string;

The author's name

author_cite
author_cite: string;

The full citation for the author

author_short
author_short: string;

A shortened version of the author's name

date
date: string;

The publication date

html
html: string;

The extracted main content in HTML format

source
source: string;

The source of the content

title
title: string;

The title of the content