NCHLT Siswati Text Corpora

Title	NCHLT Siswati Text Corpora
Description	Collection of source text documents, genre classified text documents, raw corpus, clean corpus, lexicon, frequency list and named-entity lists developed during the NCHLT Text project.
Contact name	Martin Puttkammer
Contact email	Martin.Puttkammer@nwu.ac.za
Publisher(s)	North-West University; Centre for Text Technology (CTexT)
License	Creative Commons Attribution 2.5 South Africa License: http://creativecommons.org/licenses/by/2.5/za/legalcode
Language(s)	Siswati
Author(s)	Martin Puttkammer; Martin Schlemmer; Wikus Pienaar; Ruan Bekker
Citation	Eiselen, E.R. & Puttkammer, M.J. 2014. Developing text resources for ten South African languages. (In Proceedings of the 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland. p. 3698-3703)
URI	https://hdl.handle.net/20.500.12185/348
ISLRN	093-210-851-959-9
Media type	Text
Type	Data
Media category	Monolingual text corpora: Unannotated
Format extent	8,80 Mb
Version	1
Format medium	Text; UTF8
Project	NCHLT Text
Source	Based on documents from the South African government domain crawled from gov.za websites and collected from various language units.
Stratum	Details provided in documentation.
Primary collection	Resource Catalogue
Secondary collection	Resource Index
ISO639 code	ssw
Submit date	2018-02-05T20:25:51Z; 2018-03-05T17:47:17Z
Date available	2018-02-05T20:25:51Z; 2018-03-05T17:47:17Z
Date created	2014-05-30

Resource Catalogue [335]
A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.
Resource Index [386]
A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.