Splits document text into semantic chunks based on various textual and structural elements.
This function performs a comprehensive tokenization of the input text, considering a wide range
of semantic elements and structural patterns commonly found in documents. It uses regular
expressions to identify and separate the following elements:
Headings (Setext-style, Markdown, and HTML-style)
Citations (e.g., [1])
List items (bulleted, numbered, lettered, or task lists, including nested up to three levels)
Block quotes (including nested quotes and citations, up to three levels)
Code blocks (fenced, indented, or HTML pre/code tags)
Tables (Markdown, grid tables, and HTML tables)
Horizontal rules (Markdown and HTML hr tag)
Standalone lines or phrases (including single-line blocks and HTML elements)
Sentences or phrases ending with punctuation (including ellipsis and Unicode punctuation)
Quoted text, parenthetical phrases, or bracketed content
Paragraphs
HTML-like tags and their content (including self-closing tags and attributes)
LaTeX-style math expressions (inline and block)
Any remaining content (fallback)
The function applies various length constraints to each type of element to ensure reasonable
chunk sizes. It also handles nested structures and special cases like code blocks and math
expressions.
consttext = "# Heading\n\nThis is a paragraph.\n\n- List item 1\n- List item 2\n\n"; constchunks = splitTextSemanticChars(text); console.log(chunks); // Output: ['# Heading', 'This is a paragraph.', '- List item 1', '- List item 2']
Splits document text into semantic chunks based on various textual and structural elements.
This function performs a comprehensive tokenization of the input text, considering a wide range of semantic elements and structural patterns commonly found in documents. It uses regular expressions to identify and separate the following elements:
The function applies various length constraints to each type of element to ensure reasonable chunk sizes. It also handles nested structures and special cases like code blocks and math expressions.
Sentence RAG Benchmarks