Show simple item record

Corpus of multilingual code-switched soap opera speech
The corpus comprises 26.9 hours of annotated multilingual speech that contains examples of code-switching in isiZulu, isiXhosa, Setswana, Sesotho and English. The speech was obtained from South African soap operas. Code-switching between English and one of the Bantu languages is by far most prevalent in the data. Although not very common, switches between the Bantu languages themselves also occur. An initial attempt to align the audio extracted from soap opera episodes with the corresponding scripts revealed that actors very often perform ad lib. The speech and the examples of code-switching it contains can therefore be considered to be spontaneous. Full archive that contains all data available here - https://www.dropbox.com/s/dkv65o8sb3rcfv7/5lang.zip?dl=0
Thomas Niesler
trn@sun.ac.za
Stellenbosch University
Research only.
English; isiXhosa; isiZulu; Setswana; Sesotho
van der Westhuizen, Ewald; Niesler, Thomas
code-switching, spontaneous speech, South African languages, isiZulu, isiXhosa, Setswana, Sesotho
E. van der Westhuizen and T.R. Niesler, “A first South African corpus of multilingual code-switched soap opera speech,” in Proc. LREC, 2018, pp. 2854–2859.; A. Biswas, E. Yılmaz, F. de Wet, E. van der Westhuizen, T.R. Niesler, "Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages", in Proc. LREC, 2020, pp. 3468-3474.
https://hdl.handle.net/20.500.12185/545
Speech
Annotated multilingual speech corpus
26.9 hours of annotated multilingual code-switched soap opera speech
1.0
4.25 Gb
N/A
"A multilingual corpus of code-switched South African speech", carried out on behalf of the Department of Arts and Culture of the Government of South Africa
2021-08-20T14:54:15Z
2021-08-20T14:54:15Z
2020-02-28


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

  • Resource Catalogue [335]
    A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.

Show simple item record