compile-topic-model
ai-research-agent / datasets/compile-topic-model
Other
weightWikiWordSpecificity()
function weightWikiWordSpecificity(query): number
Find domain-specific unique words for a single doc with BM25 formula
by using Wikipedia term frequencies as the common words corpus.
All words in English Wikipedia are sorted by number of pages they are in for
325K words with frequencies of at least 32 wikipages, between 3 to 23 characters
of Latin alphanumerics like az09, punctuation like .-, and diacritics like éï,
but filtering out numbers and foreign language.
Total Terms (frequency>=32): 324896
Filesize (JSON, frequency>=32): 4MB
Total Articles (Wiki-en-2020): 5,989,879
Galkin, M., Malykh, V. (2020). Wikipedia TF-IDF Dataset Release (v1.0). Zenodo. https://doi.org/10.5281/zenodo.3631674 https://github.com/SmartDataAnalytics/Wikipedia_TF_IDF_Dataset
Parameters
Parameter | Type | Description |
---|---|---|
|
| phrase to search wiki-idf for each word |
Returns
number
score for term specificity 0-12~
Topics
compileTopicModel()
function compileTopicModel(options?): Promise<void>
Compile a topic phrases model from a dictionary and Wikipedia page titles.
Search and outline a research base using Wikipedia's 100k popular pages as the core topic
phrases graph for LLM Research Agents. Most of the documents online (and by extension thinking
in the collective conciousness) can revolve around core topic phrases linked as a graph.
If all the available docs are nodes, the links in the graph can be extracted Wiki page entities
and mappings of dictionary phrases to their wiki page. These can serve as topic labels, keywords,
and suggestions for LLM followup questions. Documents can be linked in a graph with:
- wiki page entity recognition
2. frequent keyphrases
3. html links - research paper references
5. keyphrases to query in global web search
6. site-specific recommendations.
These can lay the foundation for LLM Research Agents to fully grok, summarize, and outline a research base.
240K total words & phrases, first 117K first-word or single words to check every token against. 100K Wikipedia Page Titles and links - Wikipedia most popular pages titles. Also includes domain specificity score and what letters should be capital.
84K words and 67K phrases in dictionary lexicon OpenEnglishWordNet, a better updated version of Wordnet - multiple definitions per term, 120k definitions, 45 concept categories
JSON Prefix Trie - arranged by sorting words and phrases for lookup by first word to tokenize by word, then find if it starts a phrase based on entries, for Phrase Extraction from a text.
There is "unanimous consensus" that Prefix Trie O(1) lookups (instead of having to loop through the index for each lookup) makes it the best data type for this task.
Parameters
Parameter | Type | Description |
---|---|---|
| { | |
|
| include line breaks in JSON output for debugging |
|
| true to add wiki page titles, false for dictionary only |
|
| max synonyms per term |
|
| min length of term to include |
|
| sort the first words by first two letters Trie, needd for autocomplete after 2 letters typed |
Returns
Promise
<void
>