text-to-chunks
ai-research-agent / tokenize/text-to-chunks
Topics
splitTextSemanticChars()
function splitTextSemanticChars(text, options?): string[]
Split Text by Semantic Characters
Splits document text into semantic chunks based on various textual and structural elements like HTML, markdown, and paragraphs.
This function performs a comprehensive tokenization of the input text, considering a wide range of semantic elements and structural patterns commonly found in documents. It uses regular expressions to identify and separate the following elements:
- Headings (Setext-style, Markdown, and HTML-style)
- Citations (e.g., [1])
- List items (bulleted, numbered, lettered, or task lists, including nested up to three levels)
- Block quotes (including nested quotes and citations, up to three levels)
- Code blocks (fenced, indented, or HTML pre/code tags)
- Tables (Markdown, grid tables, and HTML tables)
- Horizontal rules (Markdown and HTML hr tag)
- Standalone lines or phrases (including single-line blocks and HTML elements)
- Sentences or phrases ending with punctuation (including ellipsis and Unicode punctuation)
- Quoted text, parenthetical phrases, or bracketed content
- Paragraphs
- HTML-like tags and their content (including self-closing tags and attributes)
- LaTeX-style math expressions (inline and block)
- Any remaining content (fallback)
The function applies various length constraints to each type of element to ensure reasonable chunk sizes. It also handles nested structures and special cases like code blocks and math expressions.
Parameters
Parameter | Type | Description |
---|---|---|
|
| The input text to be split into semantic chunks. |
|
| Optional configuration options (currently unused). |
Returns
string
[]
An array of text chunks, each representing a semantic unit of the document.
Author
Example
const text = "# Heading\n\nThis is a paragraph.\n\n- List item 1\n- List item 2\n\n";
const chunks = splitTextSemanticChars(text);
console.log(chunks);
// Output: ['# Heading', 'This is a paragraph.', '- List item 1', '- List item 2']