Classification of Heterogeneous Data Using Logistic Regression Trees
- Classification of Heterogeneous Data Using Logistic Regression Trees
- Date Issued
- As the use of data increases in various fields due to the advances of related technologies, data mining that finds valuable information from data is increasingly important. Classification is one of the most common data mining tasks. This thesis focuses on solving two main challenges for a classifier learning. The first challenge is ‘heterogeneity in data’. In real-world classification problems, it is not unusual to encounter data that include mixed types of variables, and the majority of existing classifiers have difficulty in dealing with both categorical and numerical variables simultaneously. Also, heterogeneous class-separating patterns emerge if the data are generated from multiple sources, or the variables have a complex structure of interactions. Second, ‘a trade-off between accuracy and interpretability’ is another challenge. In general, the best classification accuracy is achieved by “black-box models” that are not interpretable; however, the model interpretability is often the key to success in research areas that require comprehensive human understanding.
Logistic regression trees can provide a solution to the two problems. A logistic regression tree is a hybrid approach that combines a decision tree and logistic regression. This approach recursively partitions the input space of data and learns logistic regression models at the leaf nodes. The tree structure can not only treat categorical variables in a natural way using them for splits, but also allows the stratified models to be optimized for the subpopulations that have heterogeneous class-separating patterns. Moreover, the final model that consists of a set of logistic regression models is intuitively interpretable.
This research proposes a novel method that constructs logistic regression trees for classification of heterogeneous data. The proposed method consists of two key algorithms. First, a boosting algorithm that efficiently learns sparse logistic regression models in an incremental manner is proposed. An extended version of least angle regression is employed to the LogitBoost algorithm for variable selection. Second, a split selection algorithm for the logistic regression tree is proposed. It uses a split evaluation measure that looks ahead to the classification accuracy of child models without model fitting. Thus, an intermediate node can efficiently compare candidate splits and select the best one without exhaustive search. Experimental results on simulated and real datasets demonstrate the usefulness of the proposed methods. Therefore, this research provides a computationally efficient, accurate and intuitively interpretable classifier for heterogeneous data. The proposed method is helpful for users to make the best possible use of the data that are encountered in a variety of research areas and industries.
- Article Type
- Files in This Item:
- There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.