Elsevier

Journal of Chromatography A

Volume 1424, 11 December 2015, Pages 69-76
Journal of Chromatography A

Performance comparison of partial least squares-related variable selection methods for quantitative structure retention relationships modelling of retention times in reversed-phase liquid chromatography

https://doi.org/10.1016/j.chroma.2015.10.099Get rights and content

Highlights

  • The relative performance of six variable selection methods in QSRR was compared.

  • All methods demonstrated very small demands of computational time and effort.

  • Models from selected descriptors outperformed PLS model derived from all descriptors.

  • Combining variable selection methods improves performances of resulting models.

  • Frequently selected descriptors were found relevant to the RPLC retention mechanism.

Abstract

The relative performance of six multivariate data analysis methods derived from or combined with partial least squares (PLS) has been compared in the context of quantitative structure–retention relationships (QSRR). These methods include, GA (genetic algorithm)-PLS, Monte Carlo uninformative variable elimination (MC-UVE), competitive adaptive reweighted sampling (CARS), iteratively retaining informative variables (IRIV), variable iterative space shrinkage approach (VISSA) and PLS with automated backward selection of predictors (autoPLS). A set of 825 molecular descriptors was computed for 86 suspected sports doping compounds and used for predicting their gradient retention times in reversed-phase liquid chromatography (RPLC). The correlation between molecular descriptors selected by each technique and the retention time was established using the PLS method. All models derived from a selected subset of descriptors outperformed the reference PLS model derived from all descriptors, with very small demands of computational time and effort. A performance comparison indicated great diversity of these methods in selecting the most relevant molecular descriptors, ranging from 28 for CARS to 263 for MC-UVE. While VISSA provided the lowest degree of over-fitting for the training set, CARS demonstrated the best compromise between the prediction accuracy and the number of selected descriptors, with the prediction error of as low as 46 s for the external test set. Only ten descriptors were found to be common for all models, with the characteristics of these descriptors being representative of the retention mechanism in RPLC.

Introduction

Quantitative structure–retention relationships (QSRRs) are statistically derived models relating the quantities characterizing the molecular structure of the analytes (molecular descriptors) with their chromatographic retention parameters. QSRR models have found a wide range of applications, such as prediction of the retention time, investigation of the separation mechanisms, and classification of chromatographic columns [1], [2]. The goal of QSRR modelling is to establish a trend in the descriptor values, which parallels the trend in retention parameters. When retention prediction is of concern, the model must be sufficiently general to give accurate prediction of retention times for unseen compounds that have not been used for the generation of the model [3].

QSRR modelling can be established from a small set of molecular descriptors with well-understood physicochemical properties [4], [5], [6], [7]. The quality of QSRR models developed using this approach depends strongly on a priori knowledge about the retention mechanism and also on the availability of suitable molecular descriptors which are most relevant to this retention mechanism. Alternatively, models can be derived by generating a large pool of descriptors with molecular modelling software, followed by the extraction of a subset of the most significant molecular descriptors from the large pool by means of suitable mathematical variable selection methods. Clearly, the predictive power of the models resulting from this approach depends primarily on the efficiency of the variable selection method employed [8]. One of the difficulties in the descriptor selection phase of QSRR modelling is the fact that although increasing the number of descriptors incorporated into a particular model often improves the model fit for the training dataset used to generate the model, reducing the number of redundant and uninformative variables can avoid the risk of over-fitting and chance correlation, and therefore lead to an improved prediction performance of the external test set [9], [10], [11]. However, care should be taken in such elimination since it has been shown that high co-linearity among variables does not necessarily imply an absence of feature complementarity, and also variables that are useless by themselves can be potentially useful together [3].

PLS is a linear, multiple regression method frequently used in chemometrics and multivariate calibration studies. Unlike multiple linear regression (MLR), PLS is particularly useful in handling a great number of variables even in the presence of co-linearity, redundancy and noise in both independent (x) and dependent (y) variables [2], [10], [12]. In fact, correlation in the x-variable matrix (X) is even considered as a useful duplicate measurement [10]. PLS can also deal with datasets having more variables than the samples, such as those commonly dealt with in QSRR studies. In mathematical terminology, PLS summarizes the variation in the X matrix into a small set of orthogonal, linear latent variables (LVs) by maximizing the covariance between the X matrix and the response variable y [8], [13]. The complexity of the model is controlled by optimizing the number of LVs, thus over-fitting can be minimized. Among the benefits that PLS has to offer is its innate ability for integration into different mathematical algorithms for variable selection. PLS-related variable selection methods, such as GA-PLS, UVE and iterative stepwise elimination of variables by PLS (ISE-PLS) have been previously applied in a wide range of contexts, including QSRR modelling [1], [8], [14], [15], [16], [17]. More recently, some other approaches have been developed to address the shortcomings of the earlier methods and these approaches include CARS [18], [19], IRIV [20], VISSA [21] and autoPLS [22]. While they have been successfully used in a range of applications, such as spectroscopy data analysis and prediction of biological activities, their utility for variable selection in QSRR is yet to be demonstrated.

The aim of the present study was to explore the applicability of some of the recently introduced variable selection algorithms and to compare their performance with more conventional algorithms for a typical QSRR study incorporating an example dataset of 86 suspected sports doping species separated using gradient RPLC. Predictive PLS models with different number of molecular descriptors were derived and subsequently characterized using several measures for prediction accuracy and model validity and applicability.

Section snippets

GA-PLS

In this work, the implementation of the GA-PLS algorithm proposed by Leardi was employed [23]. In brief, the algorithm starts with randomly creating a pool of chromosomes. Each chromosome encodes a random subset of variables by a binary string representation, where the presence or absence of a variable is defined by a value of one or zero, respectively. While the length of all chromosomes is the same and is equal to the total number of variables, the maximum number of variables presenting in

Dataset

The dataset used in this study was obtained from the recent work of Miller and co-workers [6], and consisted of the retention times of 86 suspected sports doping-related compounds included in the London 2012 Olympic and Paralympic Games. Accordingly, UHPLC retention data were collected by running a linear gradient of water/acetonitrile (both containing 0.3% formic acid) through a Waters Acquity BEH-C18 column (2.1 mm × 50 mm, 1.7 μm) with high-resolution mass spectrometry detection. Under these

Preparation of the dataset

The true predictive power of a QSRR model can be assessed primarily by using the model to predict the retention times of external test set compounds which have not been used in the model development phase [8], [29]. In this study, the same test set as used by Miller et al. [6] was used for the external validation of PLS models. In selecting the test compounds, structural diversity and a uniform distribution of their retention times across the 10 min runtime were taken into account [6]. Results

Conclusions

QSRR modelling using a predefined subset of molecular descriptors limits the scope of retention prediction to the situations where the retention mechanism is fully understood and a priori knowledge about descriptors is available. In fact, this is rarely the case for more complex chromatographic modes, such as hydrophilic interaction liquid chromatography (HILIC) or mixed-mode chromatography, thus highlighting the need for a suitable descriptor selection method prior to the QSRR modelling.

The

Acknowledgement

The authors acknowledge the Australian Research Council for the financial support of this research by an ARC Linkage Projects grant (LP120200700).

References (38)

  • S. Wold et al.

    PLS-regression: a basic tool of chemometrics

    Chemom. Intell. Lab. Syst.

    (2001)
  • M. Goodarzi et al.

    Towards better understanding of feature-selection or reduction techniques for Quantitative Structure–Activity Relationship models

    Trends Anal. Chem.

    (2013)
  • A. Golbraikh et al.

    Beware of q2!

    J. Mol. Graph. Model.

    (2002)
  • M. Jalali-Heravi et al.

    Modeling of retention behaviors of most frequent components of essential oils in polar and non-polar stationary phases

    J. Sep. Sci.

    (2011)
  • N.M. O’Boyle et al.

    Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction

    Chem. Cent. J.

    (2008)
  • E.B. Ledesma et al.

    QSRR prediction of chromatographic retention of ethynyl-substituted PAH from semiempirically computed solute descriptors

    Anal. Chem.

    (2000)
  • T.H. Miller et al.

    Prediction of chromatographic retention time in high-resolution anti-doping screening data using artificial neural networks

    Anal. Chem.

    (2013)
  • K. Muteki et al.

    Quantitative structure retention relationship models in an analytical Quality by Design framework: simultaneously accounting for compound properties, mobile-phase conditions, and stationary-phase properties

    Ind. Eng. Chem. Res.

    (2013)
  • R. Put et al.

    Retention prediction of peptides based on uninformative variable elimination by partial least squares

    J. Proteome Res.

    (2006)
  • Cited by (41)

    • Liquid chromatography in the pharmaceutical industry

      2023, Liquid Chromatography: Applications
    View all citing articles on Scopus
    View full text