Skip to main content

text-to-chunks

ai-research-agent / tokenize/text-to-chunks

Topics

splitTextSemanticChars()

function splitTextSemanticChars(text, options?): string[]

Split Text by Semantic Characters

Splits document text into semantic chunks based on various textual and structural elements like HTML, markdown, and paragraphs.

This function performs a comprehensive tokenization of the input text, considering a wide range of semantic elements and structural patterns commonly found in documents. It uses regular expressions to identify and separate the following elements:

  1. Headings (Setext-style, Markdown, and HTML-style)
  2. Citations (e.g., [1])
  3. List items (bulleted, numbered, lettered, or task lists, including nested up to three levels)
  4. Block quotes (including nested quotes and citations, up to three levels)
  5. Code blocks (fenced, indented, or HTML pre/code tags)
  6. Tables (Markdown, grid tables, and HTML tables)
  7. Horizontal rules (Markdown and HTML hr tag)
  8. Standalone lines or phrases (including single-line blocks and HTML elements)
  9. Sentences or phrases ending with punctuation (including ellipsis and Unicode punctuation)
  10. Quoted text, parenthetical phrases, or bracketed content
  11. Paragraphs
  12. HTML-like tags and their content (including self-closing tags and attributes)
  13. LaTeX-style math expressions (inline and block)
  14. Any remaining content (fallback)

The function applies various length constraints to each type of element to ensure reasonable chunk sizes. It also handles nested structures and special cases like code blocks and math expressions.

Sentence RAG Benchmarks

Parameters

ParameterTypeDescription

text

string

The input text to be split into semantic chunks.

options?

Object

Optional configuration options (currently unused).

Returns

string[]

An array of text chunks, each representing a semantic unit of the document.

Author

Jina AI (2024)

Example

const text = "# Heading\n\nThis is a paragraph.\n\n- List item 1\n- List item 2\n\n";
const chunks = splitTextSemanticChars(text);
console.log(chunks);
// Output: ['# Heading', 'This is a paragraph.', '- List item 1', '- List item 2']