Performance comparison of partial least squares-related variable selection methods for quantitative structure retention relationships modelling of retention times in reversed-phase liquid chromatography
Introduction
Quantitative structure–retention relationships (QSRRs) are statistically derived models relating the quantities characterizing the molecular structure of the analytes (molecular descriptors) with their chromatographic retention parameters. QSRR models have found a wide range of applications, such as prediction of the retention time, investigation of the separation mechanisms, and classification of chromatographic columns [1], [2]. The goal of QSRR modelling is to establish a trend in the descriptor values, which parallels the trend in retention parameters. When retention prediction is of concern, the model must be sufficiently general to give accurate prediction of retention times for unseen compounds that have not been used for the generation of the model [3].
QSRR modelling can be established from a small set of molecular descriptors with well-understood physicochemical properties [4], [5], [6], [7]. The quality of QSRR models developed using this approach depends strongly on a priori knowledge about the retention mechanism and also on the availability of suitable molecular descriptors which are most relevant to this retention mechanism. Alternatively, models can be derived by generating a large pool of descriptors with molecular modelling software, followed by the extraction of a subset of the most significant molecular descriptors from the large pool by means of suitable mathematical variable selection methods. Clearly, the predictive power of the models resulting from this approach depends primarily on the efficiency of the variable selection method employed [8]. One of the difficulties in the descriptor selection phase of QSRR modelling is the fact that although increasing the number of descriptors incorporated into a particular model often improves the model fit for the training dataset used to generate the model, reducing the number of redundant and uninformative variables can avoid the risk of over-fitting and chance correlation, and therefore lead to an improved prediction performance of the external test set [9], [10], [11]. However, care should be taken in such elimination since it has been shown that high co-linearity among variables does not necessarily imply an absence of feature complementarity, and also variables that are useless by themselves can be potentially useful together [3].
PLS is a linear, multiple regression method frequently used in chemometrics and multivariate calibration studies. Unlike multiple linear regression (MLR), PLS is particularly useful in handling a great number of variables even in the presence of co-linearity, redundancy and noise in both independent (x) and dependent (y) variables [2], [10], [12]. In fact, correlation in the x-variable matrix (X) is even considered as a useful duplicate measurement [10]. PLS can also deal with datasets having more variables than the samples, such as those commonly dealt with in QSRR studies. In mathematical terminology, PLS summarizes the variation in the X matrix into a small set of orthogonal, linear latent variables (LVs) by maximizing the covariance between the X matrix and the response variable y [8], [13]. The complexity of the model is controlled by optimizing the number of LVs, thus over-fitting can be minimized. Among the benefits that PLS has to offer is its innate ability for integration into different mathematical algorithms for variable selection. PLS-related variable selection methods, such as GA-PLS, UVE and iterative stepwise elimination of variables by PLS (ISE-PLS) have been previously applied in a wide range of contexts, including QSRR modelling [1], [8], [14], [15], [16], [17]. More recently, some other approaches have been developed to address the shortcomings of the earlier methods and these approaches include CARS [18], [19], IRIV [20], VISSA [21] and autoPLS [22]. While they have been successfully used in a range of applications, such as spectroscopy data analysis and prediction of biological activities, their utility for variable selection in QSRR is yet to be demonstrated.
The aim of the present study was to explore the applicability of some of the recently introduced variable selection algorithms and to compare their performance with more conventional algorithms for a typical QSRR study incorporating an example dataset of 86 suspected sports doping species separated using gradient RPLC. Predictive PLS models with different number of molecular descriptors were derived and subsequently characterized using several measures for prediction accuracy and model validity and applicability.
Section snippets
GA-PLS
In this work, the implementation of the GA-PLS algorithm proposed by Leardi was employed [23]. In brief, the algorithm starts with randomly creating a pool of chromosomes. Each chromosome encodes a random subset of variables by a binary string representation, where the presence or absence of a variable is defined by a value of one or zero, respectively. While the length of all chromosomes is the same and is equal to the total number of variables, the maximum number of variables presenting in
Dataset
The dataset used in this study was obtained from the recent work of Miller and co-workers [6], and consisted of the retention times of 86 suspected sports doping-related compounds included in the London 2012 Olympic and Paralympic Games. Accordingly, UHPLC retention data were collected by running a linear gradient of water/acetonitrile (both containing 0.3% formic acid) through a Waters Acquity BEH-C18 column (2.1 mm × 50 mm, 1.7 μm) with high-resolution mass spectrometry detection. Under these
Preparation of the dataset
The true predictive power of a QSRR model can be assessed primarily by using the model to predict the retention times of external test set compounds which have not been used in the model development phase [8], [29]. In this study, the same test set as used by Miller et al. [6] was used for the external validation of PLS models. In selecting the test compounds, structural diversity and a uniform distribution of their retention times across the 10 min runtime were taken into account [6]. Results
Conclusions
QSRR modelling using a predefined subset of molecular descriptors limits the scope of retention prediction to the situations where the retention mechanism is fully understood and a priori knowledge about descriptors is available. In fact, this is rarely the case for more complex chromatographic modes, such as hydrophilic interaction liquid chromatography (HILIC) or mixed-mode chromatography, thus highlighting the need for a suitable descriptor selection method prior to the QSRR modelling.
The
Acknowledgement
The authors acknowledge the Australian Research Council for the financial support of this research by an ARC Linkage Projects grant (LP120200700).
References (38)
- et al.
Modelling of UPLC behaviour of acylcarnitines by quantitative structure–retention relationships
J. Pharm. Biomed. Anal.
(2014) - et al.
Evaluating the performances of quantitative structure–retention relationship models with different sets of molecular descriptors and databases for high-performance liquid chromatography predictions
J. Chromatogr. A
(2009) - et al.
Multivariate linear QSPR/QSAR models: rigorous evaluation of variable selection for PLS
Comput. Struct. Biotechnol. J.
(2013) - et al.
Modelling the quality of enantiomeric separations based on molecular descriptors
Chemom. Intell. Lab. Syst.
(2006) - et al.
A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies
Chemom. Intell. Lab. Syst.
(2005) - et al.
Prediction of liquid chromatographic retention times of steroids by three-dimensional structure descriptors and partial least squares modeling
Chemom. Intell. Lab. Syst.
(1998) - et al.
Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration
Anal. Chim. Acta
(2009) - et al.
A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration
Anal. Chim. Acta
(2014) - et al.
Genetic algorithms applied to feature selection in PLS regression: how and when to use them
Chemom. Intell. Lab. Syst.
(1998) - et al.
Performance of some variable selection methods when multicollinearity is present
Chemom. Intell. Lab. Syst.
(2005)
PLS-regression: a basic tool of chemometrics
Chemom. Intell. Lab. Syst.
Towards better understanding of feature-selection or reduction techniques for Quantitative Structure–Activity Relationship models
Trends Anal. Chem.
Beware of q2!
J. Mol. Graph. Model.
Modeling of retention behaviors of most frequent components of essential oils in polar and non-polar stationary phases
J. Sep. Sci.
Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
Chem. Cent. J.
QSRR prediction of chromatographic retention of ethynyl-substituted PAH from semiempirically computed solute descriptors
Anal. Chem.
Prediction of chromatographic retention time in high-resolution anti-doping screening data using artificial neural networks
Anal. Chem.
Quantitative structure retention relationship models in an analytical Quality by Design framework: simultaneously accounting for compound properties, mobile-phase conditions, and stationary-phase properties
Ind. Eng. Chem. Res.
Retention prediction of peptides based on uninformative variable elimination by partial least squares
J. Proteome Res.
Cited by (41)
Liquid chromatography in the pharmaceutical industry
2023, Liquid Chromatography: ApplicationsQuantitative inversion model of protein and fat content in milk based on hyperspectral techniques
2022, International Dairy Journal