Autshumato English-Setswana Parallel Corpora

Cindy McKellar

Title	Autshumato English-Setswana Parallel Corpora
Description	Aligned English-Setswana parallel corpus. This set contains data that was translated by professional translators, data that was sourced as translated file pairs from translators and data obtained from Government websites and documents. The data is given as six separate UTF-8 text files; with each aligned sentence pair on a new line.
Contact name	Sunny Gent
Contact email	sunny.gent@nwu.ac.za
Publisher(s)	North-West University; Centre for Text Technology (CTexT)
License	Creative Commons Attribution 2.5 South Africa License: http://creativecommons.org/licenses/by/2.5/za/legalcode
Language(s)	English; Setswana
Author(s)	Cindy McKellar
Contributor	Roald Eiselen; Wikus Pienaar
URI	https://hdl.handle.net/20.500.12185/404
ISLRN	379-219-829-093-2
Media type	Text
Type	Data
Media category	Multilingual text corpora: Aligned
Format extent	9.02 Mb (zipped)
Version	1
Format size	159 000 bilingual segments 2 037 173 English words (excluding punctuation and numbers). 2 596 023 Setswana words (excluding punctuation and numbers).
Format medium	Text; UTF8
Project	Autshumato
Source	Based on documents from the South African government domain crawled from gov.za websites and collected from various language units.
Stratum	Details provided in documentation.
Database	Multilingual Text Corpora: Aligned
Primary collection	Resource Catalogue
Secondary collection	Resource Index
ISO639 code	eng; tsn
Submit date	2018-02-05T20:22:42Z; 2018-03-05T17:49:36Z
Date available	2018-02-05T20:22:42Z; 2018-03-05T17:49:36Z
Date created	2016-10-28

Files in this item

Name:: autshumato_english-setswana_pa ...
Size:: 9.020Mb
Format:: application/zip
MD5:: 1a6ff8721900cd903ad55391873454d0

Download

This item appears in the following Collection(s)

Resource Catalogue [335]
A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.
Resource Index [386]
A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.

Show simple item record

Autshumato English-Setswana Parallel Corpora

Files in this item

License agreement

This item appears in the following Collection(s)