CS345, Machine Learning
Prof. Alvarez
Text Recognition
Handwritten text recognition is an interesting and challenging domain
to which machine learning techniques can been applied. Recognizing
free-form text is quite difficult; good performance requires the
consideration of syntactical information as well as visual data.
For visual data only, character recognition (classification
of individual character images) is a good test case.
NIST Handwritten Digits Database
NIST has compiled a database of handwritten numerals (0-9).
Some sample images that I extracted from the NIST database
appear below.
The image files are originally in a binary format; extracting them
from the database is not difficult but requires care regarding details
of machine representation such as big endian vs. little endian formats.
Notice that
although the NIST images are scaled and centered in the image frame,
there are significant variations among the different images
for a given class (digit).
For example, the digit '7' appears in both crossed and uncrossed variants,
and the vertical slant varies quit a bit among instances of the digit '1'.
Such variations make automated recognition a non-trivial task.
The results of extensive experimentation with the NIST dataset
obtained via a variety of machine learning techniques appears
in the following paper. We will discuss some of these techniques
in this course.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
"Gradient-based learning applied to document recognition."
Proceedings of the IEEE,
86(11):2278-2324, November 1998.
A subset of the NIST dataset containing 60,000 instances,
downloads of the above paper, and error rates for several
classifiers are all available at:
http://yann.lecun.com/exdb/mnist/.