term-frequency-import
ai-research-agent / datasets/term-frequency-import
Functions
importTermFrequency()
function importTermFrequency(): object
Script to download, decompress, parse and process
Wikipedia term frequency dataset, compiled by SmartDataAnalytics
in 2020 and containing term frequencies on Wikipedia articles.
All words in English Wikipedia are sorted by number of pages they are in for
325K words with frequencies of at least 32 wikipages, between 3 to 23 characters
of Latin alphanumerics like az09, punctuation like .-, and diacritics like éï,
but filtering out numbers and foreign language.
Total Terms (frequency>=32): 324896
Filesize (JSON, frequency>=32): 4MB
Total Articles (Wiki-en-2020): 5,989,879
Returns
object
Author
Galkin, M., Malykh, V. (2020). Wikipedia TF-IDF Dataset Release (v1.0). Zenodo. https://doi.org/10.5281/zenodo.3631674 https://github.com/SmartDataAnalytics/Wikipedia_TF_IDF_Dataset