Haku

Feature selection for classification of single amino acid variations

QR-koodi

Feature selection for classification of single amino acid variations

Genetic variations that lead to changes in amino acid sequences have the ability to cause structural and functional changes of proteins. All such variations do not show phenotypic effects, so it is important to have classifiers that can classify the disease causing variations from neutral to prioritize the experimental study of variants. Large number of features associated with variations can be extracted but many of them do not contribute to classification instead increase the computational time and sometimes they may even deteriorate the classification ability. Feature selection filters out the non-relevant and redundant features from an input feature set so to obtain a feature subset that can induce a model with higher performance.

615 features that define the physicochemical and biochemical properties of amino acids were collected from the AAindex database. Four different feature selection techniques: Least Absolute Shrinkage and Selection Operator (LASSO), random forest, Random Forest Artificial Contrast with Ensembles (RF-ACE) and Area Under the ROC Curve of Random Forest (AUCRF) were applied to select the most relevant features for classification of variations. The classification abilities of the feature subsets, selected by different approaches, were compared. 7 features that can represent 615 input features were selected. The selected feature subset takes less computational time and has slightly better classification ability compared to the whole feature set.

Feature selection is an effective tool in machine learning to reduce the number of features and thus reduce the computational time. Application of feature selection can also increase the performance of the model.

Asiasanat:Feature selection, variation classification, amino acid features

Tallennettuna: