NCHLT Siswati RoBERTa language model
Title | NCHLT Siswati RoBERTa language model |
Description | Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for Siswati text. |
Contact name | Roald Eiselen |
Contact email | Roald.Eiselen@nwu.ac.za |
Publisher(s) | North-West University; Centre for Text Technology (CTexT) |
License | Creative Commons Attribution 4.0 International (CC-BY 4.0) |
Language(s) | Siswati |
Author(s) | Roald Eiselen |
Contributor | Rico Koen; Albertus Kruger; Jacques van Heerden |
URI | https://hdl.handle.net/20.500.12185/614 |
Media type | Text |
Type | Modules |
Media category | Language model |
Format extent | Training data: Paragraphs: 299,112; Token count: 4,436,576; Vocab size: 30,000; Embedding dimensions: 768; |
Format size | 236.02MB (Zipped) |
Project | NCHLT Text IV |
Software requirements | Python |
Source | Web; Government Documents |
ISO639 code | ss |
Submit date | 2023-07-26T15:11:27Z; 2023-05-01 |
Date available | 2023-07-26T15:11:27Z; 2023-05-01 |
Date created | 2023-05-01 |
Files in this item
This item appears in the following Collection(s)
-
Resource Catalogue [335]
A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.