Autshumato English-Xitsonga Parallel Corpora

Title	Autshumato English-Xitsonga Parallel Corpora
Description	Aligned English-Xitsonga parallel corpus. The data is given as two seperate UTF-8 text files; with each segment on a newline.
Contact name	Sunny Gent
Contact email	sunny.gent@nwu.ac.za
Publisher(s)	North-West University; Centre for Text Technology (CTexT)
License	Creative Commons Attribution 2.5 South Africa License: http://creativecommons.org/licenses/by/2.5/za/legalcode
Language(s)	English; Xitsonga
Author(s)	Wikus Pienaar; Wildrich Fourie; Cindy McKellar
Citation	McKellar, C.A. 2014. An English-Xitsonga SMT system for the government domain. (In: Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, Cape Town, South Africa).
URI	https://hdl.handle.net/20.500.12185/406
Media type	Text
Type	Data
Media category	Multilingual text corpora: Aligned
Format extent	14.44 Mb
Version	1
Format size	450,000 bilingual segments. 3,461,089 English words (excluding punctuation and numbers). 4,328,407 Xitsonga words (excluding punctuation and numbers).
Format medium	Text; UTF8
Project	Autshumato
Source	Based on documents from the South African government domain crawled from gov.za websites and collected from various language units.
Stratum	Details provided in documentation.
Database	Multilingual Text Corpora: Aligned
Primary collection	Resource Catalogue
Secondary collection	Resource Index
ISO639 code	eng; tso
Submit date	2018-02-05T20:20:44Z; 2018-03-05T17:49:39Z
Date available	2018-02-05T20:20:44Z; 2018-03-05T17:49:39Z
Date created	2014-12-11

Resource Catalogue [335]
A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.
Resource Index [386]
A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.