Machine Learning Approaches to Text Segmentation

Haji, M.M.

Machine Learning Approaches to Text Segmentation

Author

M.M. Haji

Department of Computer Science and Engineering,Shiraz University

Abstract

Two machine learning approaches are introduced for text segmentation. The first approach is based on inductive learning in the form of a decision tree and the second uses the Naive Bayes technique. A set of training data is generated from a wide category of compound text image documents for learning both the decision tree and the Naive Bayes Classifier (NBC). The compound documents used for generating the training data include both machine printed and handwritten texts with different fonts and sizes. The 18-Discrete Cosine Transform (DCT) coefficients are used as the main feature to distinguish texts from images. The trained decision tree and the Naive Bayes are tested with unseen documents and very promising results are obtained, although the later method is more accurate and computationally faster. Finally, the results obtained from the proposed approaches are compared and contrasted with one wavelet based approach and it is illustrated that both methods presented in this paper are more effective.

Volume 13, Issue 4 - Serial Number 4
Transactions on Computer Science & Engineering and Electrical Engineering (D)
October 2006

Receive Date: 03 February 2007
Revise Date:
Accept Date: 30 December 2006

How to cite

Machine Learning Approaches to Text Segmentation

Volume 13, Issue 4 - Serial Number 4
Transactions on Computer Science & Engineering and Electrical Engineering (D)
October 2006

Files

Cited by

History

Share

How to cite

Statistics

Machine Learning Approaches to Text Segmentation

Volume 13, Issue 4 - Serial Number 4 Transactions on Computer Science & Engineering and Electrical Engineering (D)October 2006

Files

Cited by

History

Share

How to cite

Statistics

Volume 13, Issue 4 - Serial Number 4
Transactions on Computer Science & Engineering and Electrical Engineering (D)
October 2006