NCHLT Optical Character Recognition for South African Languages

Martin Puttkammer; Justin Hocking; Roald Eiselen

Download

nchlt_optical_character_recognition.zip (103.8Mb)

MD5: 024ec2bec53429e3aa221b4a916cd7e5

URI

https://hdl.handle.net/20.500.12185/322

Collections

Resource Catalogue [338]
Resource Index [387]

Author(s)

Martin Puttkammer

Justin Hocking

Roald Eiselen

Metadata

Show full item record

Description

An OCR system is an application that enables one to convert scanned paper documents into editable and searchable texts. The engine analyses the structure of document image and divides the page into elements such as blocks of texts, tables and images. These blocks are used to identify character image patterns which are used to advance several hypotheses about the character possibilities. These hypotheses are used to produce different character, word and line level variations and associated probabilities. The set of probability hypotheses are then searched to find the most likely combination of characters, words and lines to produce a textual representation of the image.

Contact person

Martin Puttkammer

Contact person's e-mail address

Martin.Puttkammer@nwu.ac.za

Publisher(s)

North-West University

Centre for Text Technology (CTexT)

License

Creative Commons Attribution 3.0 Unported License (CC BY 3.0)

Verification status

Level 0