Afrikaans text unit identification data
License agreement
By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.
Download
MD5: 1d6f160b603b7a87fe390cda5f8908ed
License agreement
By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.
Collections
- Resource Catalogue [335]
- Resource Index [386]
Author(s)
Puttkammer, Martin
Metadata
Show full item recordDescription
This dataset was developed during a masters degree and used in the development of a text unit identifier capable of tagging sentences, named-entities, words, abbreviations and punctuation in Afrikaans text.
The dataset consists of 39,762 tokens, containing 3,294 named entities in 1,581 sentences. The data was manually annotated by the author and verified by an independent linguist according to the tagset developed during the same study. Details on the annotation and tagset used are available in the publication mentioned above in (2). The data is also presented in CoNNL-2002 format (Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Available at: https://www.aclweb.org/anthology/W02-2024).
Contact person
Martin PuttkammerContact person's e-mail address
martin.puttkammer@nwu.ac.zaPublisher(s)
Centre for Text Technology, North-West University