Performance comparison of partial least squares-related variable selection methods for quantitative structure retention relationships modelling of retention times in reversed-phase liquid chromatography

doi:10.1016/j.chroma.2015.10.099

Journal of Chromatography A

Volume 1424, 11 December 2015, Pages 69-76

https://doi.org/10.1016/j.chroma.2015.10.099 Get rights and content

Highlights

•
The relative performance of six variable selection methods in QSRR was compared.
•
All methods demonstrated very small demands of computational time and effort.
•
Models from selected descriptors outperformed PLS model derived from all descriptors.
•
Combining variable selection methods improves performances of resulting models.
•
Frequently selected descriptors were found relevant to the RPLC retention mechanism.

Abstract

The relative performance of six multivariate data analysis methods derived from or combined with partial least squares (PLS) has been compared in the context of quantitative structure–retention relationships (QSRR). These methods include, GA (genetic algorithm)-PLS, Monte Carlo uninformative variable elimination (MC-UVE), competitive adaptive reweighted sampling (CARS), iteratively retaining informative variables (IRIV), variable iterative space shrinkage approach (VISSA) and PLS with automated backward selection of predictors (autoPLS). A set of 825 molecular descriptors was computed for 86 suspected sports doping compounds and used for predicting their gradient retention times in reversed-phase liquid chromatography (RPLC). The correlation between molecular descriptors selected by each technique and the retention time was established using the PLS method. All models derived from a selected subset of descriptors outperformed the reference PLS model derived from all descriptors, with very small demands of computational time and effort. A performance comparison indicated great diversity of these methods in selecting the most relevant molecular descriptors, ranging from 28 for CARS to 263 for MC-UVE. While VISSA provided the lowest degree of over-fitting for the training set, CARS demonstrated the best compromise between the prediction accuracy and the number of selected descriptors, with the prediction error of as low as 46 s for the external test set. Only ten descriptors were found to be common for all models, with the characteristics of these descriptors being representative of the retention mechanism in RPLC.

Introduction

Quantitative structure–retention relationships (QSRRs) are statistically derived models relating the quantities characterizing the molecular structure of the analytes (molecular descriptors) with their chromatographic retention parameters. QSRR models have found a wide range of applications, such as prediction of the retention time, investigation of the separation mechanisms, and classification of chromatographic columns [1], [2]. The goal of QSRR modelling is to establish a trend in the descriptor values, which parallels the trend in retention parameters. When retention prediction is of concern, the model must be sufficiently general to give accurate prediction of retention times for unseen compounds that have not been used for the generation of the model [3].

QSRR modelling can be established from a small set of molecular descriptors with well-understood physicochemical properties [4], [5], [6], [7]. The quality of QSRR models developed using this approach depends strongly on a priori knowledge about the retention mechanism and also on the availability of suitable molecular descriptors which are most relevant to this retention mechanism. Alternatively, models can be derived by generating a large pool of descriptors with molecular modelling software, followed by the extraction of a subset of the most significant molecular descriptors from the large pool by means of suitable mathematical variable selection methods. Clearly, the predictive power of the models resulting from this approach depends primarily on the efficiency of the variable selection method employed [8]. One of the difficulties in the descriptor selection phase of QSRR modelling is the fact that although increasing the number of descriptors incorporated into a particular model often improves the model fit for the training dataset used to generate the model, reducing the number of redundant and uninformative variables can avoid the risk of over-fitting and chance correlation, and therefore lead to an improved prediction performance of the external test set [9], [10], [11]. However, care should be taken in such elimination since it has been shown that high co-linearity among variables does not necessarily imply an absence of feature complementarity, and also variables that are useless by themselves can be potentially useful together [3].

PLS is a linear, multiple regression method frequently used in chemometrics and multivariate calibration studies. Unlike multiple linear regression (MLR), PLS is particularly useful in handling a great number of variables even in the presence of co-linearity, redundancy and noise in both independent (x) and dependent (y) variables [2], [10], [12]. In fact, correlation in the x-variable matrix (X) is even considered as a useful duplicate measurement [10]. PLS can also deal with datasets having more variables than the samples, such as those commonly dealt with in QSRR studies. In mathematical terminology, PLS summarizes the variation in the X matrix into a small set of orthogonal, linear latent variables (LVs) by maximizing the covariance between the X matrix and the response variable y [8], [13]. The complexity of the model is controlled by optimizing the number of LVs, thus over-fitting can be minimized. Among the benefits that PLS has to offer is its innate ability for integration into different mathematical algorithms for variable selection. PLS-related variable selection methods, such as GA-PLS, UVE and iterative stepwise elimination of variables by PLS (ISE-PLS) have been previously applied in a wide range of contexts, including QSRR modelling [1], [8], [14], [15], [16], [17]. More recently, some other approaches have been developed to address the shortcomings of the earlier methods and these approaches include CARS [18], [19], IRIV [20], VISSA [21] and autoPLS [22]. While they have been successfully used in a range of applications, such as spectroscopy data analysis and prediction of biological activities, their utility for variable selection in QSRR is yet to be demonstrated.

The aim of the present study was to explore the applicability of some of the recently introduced variable selection algorithms and to compare their performance with more conventional algorithms for a typical QSRR study incorporating an example dataset of 86 suspected sports doping species separated using gradient RPLC. Predictive PLS models with different number of molecular descriptors were derived and subsequently characterized using several measures for prediction accuracy and model validity and applicability.

Section snippets

GA-PLS

In this work, the implementation of the GA-PLS algorithm proposed by Leardi was employed [23]. In brief, the algorithm starts with randomly creating a pool of chromosomes. Each chromosome encodes a random subset of variables by a binary string representation, where the presence or absence of a variable is defined by a value of one or zero, respectively. While the length of all chromosomes is the same and is equal to the total number of variables, the maximum number of variables presenting in

Dataset

The dataset used in this study was obtained from the recent work of Miller and co-workers [6], and consisted of the retention times of 86 suspected sports doping-related compounds included in the London 2012 Olympic and Paralympic Games. Accordingly, UHPLC retention data were collected by running a linear gradient of water/acetonitrile (both containing 0.3% formic acid) through a Waters Acquity BEH-C₁₈ column (2.1 mm × 50 mm, 1.7 μm) with high-resolution mass spectrometry detection. Under these

Preparation of the dataset

The true predictive power of a QSRR model can be assessed primarily by using the model to predict the retention times of external test set compounds which have not been used in the model development phase [8], [29]. In this study, the same test set as used by Miller et al. [6] was used for the external validation of PLS models. In selecting the test compounds, structural diversity and a uniform distribution of their retention times across the 10 min runtime were taken into account [6]. Results

Conclusions

QSRR modelling using a predefined subset of molecular descriptors limits the scope of retention prediction to the situations where the retention mechanism is fully understood and a priori knowledge about descriptors is available. In fact, this is rarely the case for more complex chromatographic modes, such as hydrophilic interaction liquid chromatography (HILIC) or mixed-mode chromatography, thus highlighting the need for a suitable descriptor selection method prior to the QSRR modelling.

The

Acknowledgement

The authors acknowledge the Australian Research Council for the financial support of this research by an ARC Linkage Projects grant (LP120200700).

References (38)

A.A. D’Archivio et al.
Modelling of UPLC behaviour of acylcarnitines by quantitative structure–retention relationships
J. Pharm. Biomed. Anal.
(2014)
C. Wang et al.
Evaluating the performances of quantitative structure–retention relationship models with different sets of molecular descriptors and databases for high-performance liquid chromatography predictions
J. Chromatogr. A
(2009)
K. Varmuza et al.
Multivariate linear QSPR/QSAR models: rigorous evaluation of variable selection for PLS
Comput. Struct. Biotechnol. J.
(2013)
S. Caetano et al.
Modelling the quality of enantiomeric separations based on molecular descriptors
Chemom. Intell. Lab. Syst.
(2006)
T. Hancock et al.
A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies
Chemom. Intell. Lab. Syst.
(2005)
L.I. Nord et al.
Prediction of liquid chromatographic retention times of steroids by three-dimensional structure descriptors and partial least squares modeling
Chemom. Intell. Lab. Syst.
(1998)
H. Li et al.
Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration
Anal. Chim. Acta
(2009)
Y.H. Yun et al.
A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration
Anal. Chim. Acta
(2014)
R. Leardi et al.
Genetic algorithms applied to feature selection in PLS regression: how and when to use them
Chemom. Intell. Lab. Syst.
(1998)
I.-G. Chong et al.
Performance of some variable selection methods when multicollinearity is present
Chemom. Intell. Lab. Syst.
(2005)

S. Wold et al.

PLS-regression: a basic tool of chemometrics

Chemom. Intell. Lab. Syst.

(2001)

M. Goodarzi et al.

Towards better understanding of feature-selection or reduction techniques for Quantitative Structure–Activity Relationship models

Trends Anal. Chem.

(2013)

A. Golbraikh et al.

Beware of q2!

J. Mol. Graph. Model.

(2002)

M. Jalali-Heravi et al.

Modeling of retention behaviors of most frequent components of essential oils in polar and non-polar stationary phases

J. Sep. Sci.

(2011)

N.M. O’Boyle et al.

Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction

Chem. Cent. J.

(2008)

E.B. Ledesma et al.

QSRR prediction of chromatographic retention of ethynyl-substituted PAH from semiempirically computed solute descriptors

Anal. Chem.

(2000)

T.H. Miller et al.

Prediction of chromatographic retention time in high-resolution anti-doping screening data using artificial neural networks

Anal. Chem.

(2013)

K. Muteki et al.

Quantitative structure retention relationship models in an analytical Quality by Design framework: simultaneously accounting for compound properties, mobile-phase conditions, and stationary-phase properties

Ind. Eng. Chem. Res.

(2013)

R. Put et al.

Retention prediction of peptides based on uninformative variable elimination by partial least squares

J. Proteome Res.

(2006)

Cited by (41)

A novel endogenous retention-index for minimizing retention-time variations in metabolomic analysis with reversed-phase ultrahigh-performance liquid-chromatography and mass spectrometry
2024, Talanta
Consistent retention time (t_R) of metabolites is vital for identification in metabolomic analysis with ultrahigh-performance liquid-chromatography (UPLC). To minimize inter-experimental t_R variations from the reversed-phase UPLC-MS, we developed an endogenous retention-index (endoRI) using in-sample straight-chain acylcarnitines with different chain-length (L_C, C0–C26) without additives. The endoRI-corrections reduced the t_R variations caused by the combined changes of mobile phases, gradients, flow-rates, elution time, columns and temperature from up to 5.1 min–0.2 min for most metabolites in a model metabolome consisting of 91 metabolites and multiple biological matrices including human serum, plasma, fecal, urine, A549 cells and rabbit liver extracts. The endoRI-corrections also reduced the inter-batch and inter-platform t_R variations from 1.5 min to 0.15 min for 95 % of detected features in the above biological samples. We further established a quantitative model between t_R and L_C for predicting t_R values of acylcarnitines when absent in samples. This makes it possible to compare metabolites’ t_R from different t_R databases and the UPLC-based metabolomic data from different batches.
Liquid chromatography in the pharmaceutical industry
2023, Liquid Chromatography: Applications
The primary focus of separation scientists supporting pharmaceutical drug development is to provide evidence of safety of medicines administered to patients and volunteers during clinical trials. This critical objective is achieved through application of various forms of state-of-the-art separation science techniques, often combined with spectroscopic detection techniques. The role of separation science, which plays a pivotal role in all phases of pharmaceutical drug development, is extensively described in the introductory part of this contribution. The early stages of pharmaceutical drug development typically require chromatographic techniques that provide very high resolution. This is essential as, at this stage of development, a relatively large number of process-related impurities, synthetic intermediates, and degradation products must be separated to characterize starting materials and products of chemical synthesis. In the first part of this chapter, we focus on multiple ways of enhancing chromatographic resolution for the purposes of satisfying these early development demands. In the later stages of the drug development process, when the manufacturing processes are being qualified, the emphasis shifts from resolution to speed, ruggedness, and robustness. The second part of this chapter provides an overview of useful tools and techniques that may be applied in such a setting. In the final part of this chapter, we focus on novel trends in chromatographic method development related to the analytical quality by design initiative. In this section, we also provide references to some recent research aimed at structure-driven prediction of chromatographic retention which can be used to drive early stages (scoping) of method development.
Quantitative inversion model of protein and fat content in milk based on hyperspectral techniques
2022, International Dairy Journal
Traditional chemical methods for detecting milk composition suffer from many disadvantages, such as low efficiency and complicated operations. We propose a novel method based on hyperspectral inverse modelling method that combined Savitzky–Golay and first differentiation (SG_FD) to process the spectral data, coupled with an innovative application of improved spatial frog-hopping algorithm (IVRF_CA) to filter the feature wavebands, followed by a voting regressor (VR) to predict the fat and protein content in milk. The results demonstrated that the SG_FD algorithm is a hyperspectral preprocessing method that effectively improves the modelling accuracy, and the IVRF_CA algorithm reduced model complexity while ensuring the accuracy of the model. The test set coefficients of determination (R²) for the fat and protein partial least squares regression (PLSR) models built using feature wavebands filtered by the IVRF_CA were 0.9608 and 0.8623, respectively, while the corresponding test set R² for the VR model were 0.9834 and 09607, respectively.
Investigation of supercritical fluid chromatography retention behaviors using quantitative structure-retention relationships
2022, Analytica Chimica Acta
Supercritical Fluid Chromatography (SFC), a high-throughput separation technique, has been widely applied as a promising routine method in pharmaceutical, pesticides, and metabolome analysis in the same way as conventional liquid chromatography and gas chromatography. However, the retention behaviors of many compounds in SFC are not fully investigated. In this study, more than 500 pesticides were analyzed on several polar and nonpolar columns using SFC/MS/MS. Then, partial least squares regression (PLS) was used to explore the retention behaviors of pesticides and construct the quantitative structure-retention relationships under practical gradient elution. The optimized relationships between pesticide structures and pesticide retention were established and validated for predicting power using both internal- and external-validations; hence, several important factors affecting retention of the compounds were identified. In the best case, approximately almost all pesticides in the training set and nearly 80% of pesticides in the external validation set could be predicted with the prediction error of less than 0.5 min. Moreover, the proposed workflow successfully established the local interaction profiles, describing the possible interactions in the 8 studied chromatographic systems, and can be further applied for any groups of compounds under any system conditions.
Prediction of pesticide retention time in reversed-phase liquid chromatography using quantitative-structure retention relationship models: A comparative study of seven molecular descriptors datasets
2021, Chemosphere
Predicting chromatographic retention times of pesticides has become more and more important for suspect and non-target screening. Indeed, high-resolution mass spectrometry hyphenated (HRMS) to liquid chromatography (LC) are of growing interest for research and monitoring of pesticides, their metabolites and transformation products. The development of quantitative structure-retention relationship models require selecting the most adequate and best set of molecular descriptors and the best machine-learning algorithm. Here, we used seven molecular descriptor sets extracted from four well-known studies and applied them to roughly 800 pesticides and their chromatographic reversed-phase retention times. We used and optimized five different machine-learning algorithms with these descriptor sets to carry out predictions. Our results show that a support-vector machine regression algorithm with only eight molecular descriptors gave the best compromise between the number of molecular descriptors, processing time and model complexity to optimize prediction performance for this specific gradient LC method.
Improvement of quantitative structure–retention relationship models for chromatographic retention prediction of peptides applying individual local partial least squares models
2020, Talanta
In Reversed-Phase Liquid Chromatography, Quantitative Structure–Retention Relationship (QSRR) models for retention prediction of peptides can be built, starting from large sets of theoretical molecular descriptors. Good predictive QSRR models can be obtained after selecting the most informative descriptors. Reliable retention prediction may be an aid in the correct identification of proteins/peptides in proteomics and in chromatographic method development. Traditionally, global QSRR models are built, using a calibration set containing a representative range of analytes. In this study, a strategy is presented to build individual local Partial Least Squares (PLS) models for peptides, based on selected local calibration samples, most similar to the specific query peptide to be predicted. Similar local calibration peptides are selected from a possible calibration set. The calibration samples with the lowest Euclidian distances to the query peptide are considered as most similar. Two Euclidian distances are investigated as similarity parameter, (i) in the autoscaled descriptor space and, (ii) in the PLS factor space of the global calibration samples, both after variable selection by the Final Complexity Adapted Models (FCAM) method. The predictive abilities of individual local QSRR PLS models for peptides, developed with both Euclidian distances, are found significantly better than those of two global models, i.e. before and after FCAM variable selection. The predictive abilities of the local models, developed with distances calculated in the PLS factor space, were best.

View all citing articles on Scopus

View full text

Performance comparison of partial least squares-related variable selection methods for quantitative structure retention relationships modelling of retention times in reversed-phase liquid chromatography

Highlights

Abstract

Introduction

Section snippets

GA-PLS

Dataset

Preparation of the dataset

Conclusions

Acknowledgement

J. Pharm. Biomed. Anal.

J. Chromatogr. A

Comput. Struct. Biotechnol. J.

Chemom. Intell. Lab. Syst.

Chemom. Intell. Lab. Syst.

Chemom. Intell. Lab. Syst.

Anal. Chim. Acta

Anal. Chim. Acta

Chemom. Intell. Lab. Syst.

Chemom. Intell. Lab. Syst.

Chemom. Intell. Lab. Syst.

Trends Anal. Chem.

J. Mol. Graph. Model.

Modeling of retention behaviors of most frequent components of essential oils in polar and non-polar stationary phases

J. Sep. Sci.

Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction

Chem. Cent. J.

QSRR prediction of chromatographic retention of ethynyl-substituted PAH from semiempirically computed solute descriptors

Anal. Chem.

Prediction of chromatographic retention time in high-resolution anti-doping screening data using artificial neural networks

Anal. Chem.

Quantitative structure retention relationship models in an analytical Quality by Design framework: simultaneously accounting for compound properties, mobile-phase conditions, and stationary-phase properties

Ind. Eng. Chem. Res.

Retention prediction of peptides based on uninformative variable elimination by partial least squares

J. Proteome Res.