Machine Learning Approaches to Text Segmentation


Department of Computer Science and Engineering,Shiraz University


Two machine learning approaches are introduced for text segmentation. The first approach is based on inductive learning in the form of a decision tree and the second uses the Naive Bayes technique. A set of training data is generated from a wide category of compound text image documents for learning both the decision tree and the Naive Bayes Classifier (NBC). The compound documents used for generating the training data include both machine printed and handwritten texts with different fonts and sizes. The 18-Discrete Cosine Transform (DCT) coefficients are used as the main feature to distinguish texts from images. The trained decision tree and the Naive Bayes are tested with unseen documents and very promising results are obtained, although the later method is more accurate and computationally faster. Finally, the results obtained from the proposed approaches are compared and contrasted with one wavelet based approach and it is illustrated that both methods presented in this paper are more effective.