Afrikaans text unit identification data

Puttkammer, Martin

Title	Afrikaans text unit identification data
Description	This dataset was developed during a masters degree and used in the development of a text unit identifier capable of tagging sentences, named-entities, words, abbreviations and punctuation in Afrikaans text. The dataset consists of 39,762 tokens, containing 3,294 named entities in 1,581 sentences. The data was manually annotated by the author and verified by an independent linguist according to the tagset developed during the same study. Details on the annotation and tagset used are available in the publication mentioned above in (2). The data is also presented in CoNNL-2002 format (Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Available at: https://www.aclweb.org/anthology/W02-2024).
Contact name	Martin Puttkammer
Contact email	martin.puttkammer@nwu.ac.za
Publisher(s)	Centre for Text Technology, North-West University
License	Creative Commons Attribution 4.0 International: https://creativecommons.org/licenses/by/4.0/
Language(s)	Afrikaans
Author(s)	Puttkammer, Martin
Subject	Afrikaans, Tokenisation, Sentence recognition, Named-entity recognition, sentence, named-entity, word, token
Citation	Puttkammer, M.J. 2006. Outomatiese Afrikaanse tekseenheididentifisering. Potchefstroom: North-West University. (Dissertation - MA).
URI	https://hdl.handle.net/20.500.12185/507
Media type	Text
Media category	Monolingual text corpus: annotated
Format extent	39,762 tokens
Version	1.0
Format size	195,324 bytes (zipped)
Format medium	N/A
Primary collection	Resource Catalogue
Secondary collection	Resource Index
ISO639 code	afr
Submit date	2019-04-15T14:05:43Z
Date available	2019-04-15T14:05:43Z
Date created	2006

Files in this item

Name:: TEI_v1.0.rar
Size:: 190.7Kb
Format:: application/rar
MD5:: 1d6f160b603b7a87fe390cda5f8908ed
Description:: All files zipped

Download

This item appears in the following Collection(s)

Resource Catalogue [335]
A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.
Resource Index [386]
A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.

Show simple item record

Afrikaans text unit identification data

Files in this item

License agreement

This item appears in the following Collection(s)