Introduction

Cost-utility analyses (CUA) require preference-based measures (PBMs) of outcome. Traditionally PBMs, so-called multi-attribute utility instruments (MAUIs), have been generic. The mostly commonly employed generic PBM is the EQ-5D [1], a measure which the National Institute for Health and Care Excellence (NICE) actively encourages [2]. While the use of the same measure across a range of diseases and conditions increases comparability (what NICE refers to as a need for consistency) when informing decisions, there have been criticisms that these generic measures are not sensitive to certain disease-specific characteristics [35]. This is not withstanding the fact that such PBMs have been found to be sensitive to health issues that the instrument does not explicitly ask about. For example, the EQ-5D has been found to be sensitive to a range of clinical features in patients with Parkinson’s disease including hallucinations [6]. Sensitivity is therefore a grey area. While using PBMs in some diseases may mean that important clinical and patient quality of life changes are missed entirely, in other disease areas it may be that effects are found, but the magnitude of these is underestimated. That is, it is not a simple question of whether PBMs are sensitive, but whether they are sensitive enough?

There are a number of ways in which health economists can introduce disease-specific sensitivity to the assessment of outcomes in a CUA [7]. Often they utilise mapping algorithms which estimate the relationship between a condition-specific non-preference-based measure and a generic PBM [8, 9]. A more resource intensive alternative is to elicit preferences from patients (or the general public) for condition-specific vignettes describing a health state, that is to use preference elicitation techniques like time trade-off, standard gamble or a discrete choice experiment within the study population [10, 11]. A third alternative is the use of bolt-ons to existing generic instruments, like adding vision impairment or hearing impairment to the EQ-5D [12]. Bolt-ons are thought to improve a generic instrument’s content validity for a particular condition, but also retain a core instrument that is comparable across conditions. A further option which is growing in popularity is to develop condition-specific preference-based measures (CSPBMs). CSPBMs can be developed from first principles (determine dimensions that are important to a patient/sufferer, design the instrument, undertake a valuation study and produce a set of tariffs) [13], or one could modify (in many instances this means reduce) an existing non-preference-based measure and undertake a valuation study [14]. As these non-preference-based measures have already been developed for the condition, arguably they have already been assessed for validity and sensitivity. An additional benefit of using existing non-preference-based measures is that clinicians can get information on quality of life and disease dimensions of interest to them, while health economists are able to estimate health state values for use in a CUA without the need to administer an additional outcome measure.

Despite the apparent benefits of CSPBMs, their use is limited. If CSPBMs are to be more widely adopted, then evidence of their performance is required. This does not negate that many health technology appraisal (HTA) agencies have an explicit preference for generic instruments, so it may be that even in the face of compelling evidence of the benefits of CSPBMs implementation will be restrained. Leaving this debate aside (interested readers are referred to Versteegh et al. [5] and Brazier and Tsuchiya [15]), this paper—using oncology as a case study—seeks to assess the validity, responsiveness and sensitivity of a cancer-specific preference-based measure, the EORTC-8D, relative to a generic PBM, the EQ-5D-3L.

Methods

Data

Cancer 2015 is a large-scale prospective longitudinal population-based molecular cohort study [16]. It enrols newly diagnosed/treatment naïve cancer patients irrespective of the tumour site (except leukaemia) and at all stages of disease. Recruitment is staged, and phase 1 (2011–2014) targeted the enrolment of 1000 patients from five hospitals in Victoria, Australia. It aims to test and implement a new model of cancer diagnosis and treatment with a specific focus on integrating molecular pathology into routine cancer diagnosis [17], whereby all tumours are genotyped and actionable mutations identified so to inform cancer diagnosis, prognosis and treatment at an individual level. The new model is one where personalised treatment plans, specifically precision medicines guided by genomic testing, would be offered to patients.

Patients consent to have their tumour biopsy and blood screened using next-generation sequencing (NGS) [18]. A baseline questionnaire collects information on patient socio-demographics and patient and familial history. Clinical records including pathology results are drawn upon to gather information on tumour site and stage and treatment intentions (including changing intentions over time). Patients are also asked to complete three patient-reported outcome measures (PROMs); see “Instruments” section for further details. PROMs are repeated at 6- and 12-month follow-up (for those with advanced disease the first follow-up point was at 3 months).

Instruments

The European Organization for Research and Treatment of Cancer Quality of Life Questionnaire C30 (EORTC QLQ-C30) is a non-preference-based health-related quality of life (HRQoL) measure which is frequently employed in cancer clinical trials. It is one of a suite of EORTC instruments and the C30 is regarded as a ‘generic’ cancer measure (e.g. EORTC QLQ-BR23 is specific to breast cancer, while EORTC QLQ-MY20 is for myeloma) [19]. It includes 30 questions which feed into nine multi-item scales: five functional scales (physical, role, cognitive, emotional and social functioning); three symptom scales (fatigue, pain, and nausea and vomiting); and a global health status/quality of life scale. Six single-item scales mainly for symptoms are also included (dyspnoea, insomnia, appetite loss, constipation, diarrhoea and financial difficulties). Cancer 2015 included version 3.0, the most recent version, which is recommended by the EORTC Quality of Life Group. The instrument is scored so that it provides summary scores (between 0 and 100) for a patient’s functioning, symptoms and global quality of life, where a higher score represents a higher (‘better’) level of functioning or quality of life or a higher (‘worse’) level of symptoms.

The EORTC-8D has eight dimensions (physical functioning, role functioning, pain, emotional functioning, social functioning, nausea, fatigue and sleep disturbance, and constipation and diarrhoea) each with 4 levels except physical functioning which has 5 levels. The instrument was derived using Rasch and factor analysis from the EORTC QLQ-C30 [20] with the EORTC-8D drawing on 10 questions from a possible 30. There are 81,920 unique health states in the EORTC-8D which were valued using a time trade-off approach in a sample of the general population from the north of England. The resulting values range from 0.292 to 1.00, on the full health–dead 1–0 scale.

The EQ-5D-3L (previous known as the EQ-5D) has five dimensions (mobility, self-care, usual activities, pain/discomfort and anxiety/depression) each with three levels such that there are 243 health states [21]. The EQ-5D-3L was also valued using a time trade-off approach in the UK. Other country valuations exist, some of which use other valuation techniques (including a discrete choice experiment (DCE) in Australia [22]); however, this analysis ignores any cross-country differences, and undertakes the comparison using UK tariffs employing the UK tariff (MVH-A1 algorithm) for the EQ-5D-3L [23]. The scoring range for the EQ-5D-3L in the UK is from −0.594 to 1.00, i.e. it includes states that are worse than dead (<0).

Cancer 2015 also included the EORTC-8D as an instrument in its own right (i.e. both the 30-item EORTC QLQ-C30 and the 10-item EORTC-8D were administered meaning that 10 items were duplicated in the questionnaire), in contrast to deriving EORTC-8D values indirectly from EORTC QLQ-C30 responses. Although the EORTC-8D has not been validated (e.g. psychometrically assessed) as a standalone instrument, it was deemed interesting to compare the responses of the standalone EORTC-8D with the derived EORTC-8D given that the standalone instrument is shorter (10 vs 30 questions). Our analysis ignores the responses to the standalone instrument, however, and only utilises the derived responses as they were found to be highly correlated (r = 0.93) and the standalone EORTC-8D instrument, despite being shorter, had more missing responses (which was possibly due to an ordering effect in the questionnaire because the standalone instrument was included after the C30). Note that phase 2 of Cancer 2015 does not include the EORTC-8D as a standalone instrument.

Analysis

Using the baseline health state values, we assessed the normalities of the distribution of each instrument using both the Shapiro–Wilk W test and the Shapiro–Francia W test. Skewness and kurtosis were also assessed.

We then assessed the correlation both within dimensions (using Spearman’s rank) and the health state values as a whole (using Pearson’s R). We used this correlation matrix analysis to consider convergent validity (i.e. the degree to which an instrument/dimension correlates with another instrument/dimension measuring the same concept) [24]. We expect there to be convergent validity in the items which are similar, e.g. those measuring physical dimensions of health and those measuring psychological dimensions. Strength of correlation was based on the following thresholds: r = 0–0.2 (very weak), r = 0.2–0.4 (weak), r = 0.4–0.7 (moderate), r = 0.7–0.9 (strong), r = 0.9–1.0 (very strong) [25]. It is important to note that the EQ-5D does not claim to perform measurement within its dimensions (i.e. it does not measure mobility), but instead provides a simple classification system; however, correlations within dimensions are commonly undertaken in assessments of validity [26, 27]. We additionally explored ceiling effects in each instrument by considering the relationship between item dimensions in one instrument and full health in the other instrument as measured by a health state value of 1, e.g. EORTC-8D item responses when the EQ-5D-3L is one and vice versa.

Agreement between the instruments was examined using a Bland–Altman plot [28]. This plotted the difference between EORTC-8D and EQ-5D-3L values against the mean of the values for each individual. The mean difference provides the estimate of bias while the limits of agreement, LOA (based on a ±1.96 × SDdifference interval), provide an estimate of the influence of random variation. If there was good agreement between the EORTC-8D and the EQ-5D-3L, then only 5% of points would lie outside of the LOA. Agreement was further assessed by estimating the intra-class correlation coefficient (ICC) [29] (two way mixed effects with absolute agreement). Strength of agreement was based on the following thresholds: ICC = 0–0.2 (poor), ICC = 0.2–0.4 (fair), ICC = 0.4–0.6 (moderate), ICC = 0.6–0.8 (strong) and ICC > 0.8 (almost perfect) [29, 30].

To understand the construct validity of each measure, that is whether the instrument is sensitive (or indeed more sensitive) to different covariates [24], we compared mean health state values using paired t tests and ANOVAs where appropriate, and estimated the standardised effect size (difference in means divided by the standard deviation). The covariates included age, gender, site of recruitment, insurance status, smoking status, performance/functioning status (measured using the Eastern Oncology Cooperative Group (ECOG) performance status scale) at baseline and over time, initial treatment intention (as an indicator for severity: none, curative, palliative), planned initial follow-up point (again as an indicator for severity), status at follow-up (dead or alive), site of tumour and staging of the disease. We hypothesised that the EORTC-8D would have a greater ability to discriminate between the disease characteristics (functioning, severity, stage) than the EQ-5D-3L. We additionally hypothesised that for the patient-level characteristics (age, gender, insurance status, etc.) both instruments should have similar levels of discriminatory power as they are unrelated to condition.

QALYs were estimated using the area under the curve method. The average time of follow-up was 434 days. Those who died were given a health state value of zero at their date of death and included in the QALY calculation. Correlation between the generic QALYs and condition-specific QALYs was assessed using Pearson’s R correlation coefficient. The sensitivity of the QALY estimates to various covariates (as described above) was also explored in bivariate analyses in order to further assess construct validity. As above, we hypothesise that there will be more discrimination with the condition-specific QALYs than with the generic QALYs for the covariates which reflect disease characteristics, but they will have equal discriminatory power for the patient-level characteristics. Regression analysis was employed to further examine the extent to which the difference in QALYs (condition-specific minus generic QALYs) was influenced by baseline patient demographics, disease characteristics, indicators of severity, change in patient’s performance/functioning (ECOG) status overtime and the difference in baseline health state values. A linear model was imposed and the regression was multivariate with all variables included at the same time.

All statistical analyses were undertaken in STATA MP version 13.0.

Results

Cancer 2015 recruited its first patient in November 2011, and as of February 2015 there were 1829 patients enrolled in the cohort; however, not all patients have complete PROMs data. We have baseline EQ-5D-3L values for 1715 patients and EORTC-8D values for 1689 patients, and the complete case sample (where there is a baseline value for both instruments) is 1678. We are able to estimate generic and condition-specific QALYs for 1157 patients. Note that 269 patients (nearly 15% of those recruited) have died.

Table 1 presents the descriptive statistics for the sample at baseline. The sample is elderly (mean age 62), the majority are male (54%), and a large number have other co-morbidities as measured by the Charlson Comorbidity Index (mostly diabetes and arthritis) [31]. The cohort purposely included a private hospital in the sample (in order to make treatment comparisons at a later date, and also because Australia has a two-tiered health care system); this hospital contributed 19% of the patients, but 43% of the total sample have insurance cover for hospitals. The hospital insurance variable can be considered to be reflective of income, as at a certain income threshold private health insurance is incentivised (i.e. an additional tax is imposed on high income earners who do not have insurance).

Table 1 Baseline sample descriptive statistics

In terms of disease, breast cancer and prostate cancer contribute the most patients (20% and 15%, respectively) to the cohort, but there is representation across the spectrum of tumour sites. All stages of cancer (staged via the staging method appropriate to the tumour site) are represented, and the majority of patients are noted to have curative treatment intentions at enrolment (82%), although some patients have palliative treatment intentions. The large majority of patients (67%) have an ECOG performance status which aligns with normal activity [32].

Baseline values

The EORTC QLQ-C30 summary measures are presented in Table 2; the mean functioning score was 79, mean symptom score was 19, and the mean global health score was 69. The mean EQ-5D-3L health state value was 0.748, while the mean health state value for the EORTC-8D was 0.829. The range of health state values is shown in Fig. 1 which plots the histograms for the baseline values for each instrument. The data are skewed and non-normal, and this is further supported in formal statistical tests (EQ-5D-3L Shapiro–Wilk test z = 11.8, p < 0.001; EORTC-8D Shapiro–Wilk test z = 10.1, p < 0.001; EQ-5D-3L Shapiro–Francia test z = 10.9, p < 0.001; EORTC-8D Shapiro–Francia test z = 9.5, p < 0.001).

Table 2 Descriptive statistics for health status
Fig. 1
figure 1

Histogram of baseline EQ-5D and EORTC-8D values

There is considerable variability across dimension responses in both instruments; see “Appendix”. The use of the highest level (no problems) in the EQ-5D-3L ranged from 91.6% of responses for usual activities to 48.3% of responses for pain/discomfort. For the EORTC-8D, the use of the highest level ranges from 77.8% of responses for nausea and 30.9% of responses for fatigue and sleep disturbance.

The convergent validity of the instruments is assessed by considering the correlations across the dimensions and the health state values. Table 3 shows that correlations between the dimensions of the EQ-5D-3L and the EORTC-8D are mostly moderate, particular in the dimensions which appear to be assessing similar constructs, e.g. physical functioning (EORTC-8D) and mobility (EQ-5D-3L), pain (EORTC-8D) and pain/discomfort (EQ-5D-3L), emotional functioning (EORTC-8D) and anxiety/depression (EQ-5D-3L). The correlation between the baseline health state values is 0.755, considered to be strong [25]. The correlations between the baseline health state values and the baseline EORTC QLQ-C30 summary scores are strong/very strong, ranging from 0.730 to 0.917, except for the correlation between the EQ-5D-3L health state value and the global quality of life summary score which is 0.651 (moderate).

Table 3 Correlations between health state dimensions at baseline

In the assessment of ceiling effects, Tables 4 and 5 show good content validity; when one instrument records a value of full health, this corresponds with the higher levels in each dimension in the other instrument. An exception to this is the fatigue and sleep disturbance dimension in the EORTC-8D; 40% of the responses are not at level 1 when their EQ-5D profiles suggest they are in full health. This suggests that the generic PBM, in this context, would fail to pick up impairments in fatigue and sleep disturbance.

Table 4 EORTC-8D responses when EQ-5D-3L = 1 (percentages)
Table 5 EQ-5D-3L responses when EORTC-8D = 1 (percentages)

The ICC is 0.595 which suggests the agreement between the measures is moderate. The Bland–Altman plot in Fig. 2 suggests that there are small mean differences between the two instruments at baseline, but relatively wide limits of agreement. 6.97% of the data points are found to lie outside of the LOA suggesting poor agreement between the two measures, and this is particularly the case for the lower values of HRQoL.

Fig. 2
figure 2

Bland–Altman plot of EORTC-8D and EQ-5D-3L at baseline

An analysis of the sensitivity of each instrument to various subgroups including patient and disease characteristics (see Table 6, columns 1–4) finds that both the EQ-5D-3L and the EORTC-8D are sensitive to gender (females have lower baseline health state values), admission to hospital (public patients have lower health state values), smoking status (smokers, including ex-smokers, have lower health state values), stage of disease (metastatic cancer patients have lower health state values), hospital insurance (those without insurance have lower health state values), expected future follow-up (those with plans for follow-up at three months—i.e. more advanced disease—have lower baseline health state values) and ECOG score (those with worse scores have lower health state values). There is also variation in cancer site, and both instruments find that prostate cancer patients have the highest baseline health state values, while patients with lung cancer and cancer of the unknown primary (CUP) have the lowest baseline health state values. Although both instruments identify statistically significant differences within the covariates, it is notable that the variation in values is greater for the EQ-5D-3L. However, the standard deviation for the EORTC-8D is smaller, such that the estimated effect sizes (not shown) are larger for the EORTC-8D, although the differences in effect sizes between the EQ-5D-3L and the EORTC-8D are not significant. These findings imply that our initial hypothesis that the EORTC-8D would have greater discriminatory power with respect to the disease characteristics (functioning, severity, stage) is not refuted.

Table 6 Differences in baseline health state values and QALYs

QALYs

The estimated mean QALYs when using the EQ-5D-3L is 0.860 (range −0.108 to 3.138); the estimated mean QALYs when using the EORTC-8D is 0.909 (range 0.001–3.078); thus, the QALY estimates are higher for the condition-specific measure and the range is narrower (see Table 2). The difference while small (0.049) is statistically significant (p < 0.001, paired t test). The generic and condition-specific QALYs are very strongly correlated (Pearson’s R = 0.959); see Fig. 3.

Fig. 3
figure 3

Correlation in QALY estimates

The sensitivity of both types of QALYs to variations in the sample characteristics is presented in Table 6, columns 5–8. There are many similarities with the relationships that were found for baseline health state values (columns 1–4), although some significant relationships are no longer apparent (for example, there is no difference in QALY estimates in terms of whether the patient was recruited in a public or private hospital). Most notable is that generic QALYs and condition-specific QALYs have a similar ability to discriminate across patient and disease characteristics, reporting similar effect sizes.

Table 7 presents the results of a multivariate regression examining the differences in QALY estimates derived using the condition-specific measure (EORTC-8D) and the generic instrument (EQ-5D-3L). The average patient condition-specific QALYs are higher than generic QALYs, and the results in Table 7 suggest that this can be explained in part by the variation in baseline health state values, smoking status, changing ECOG status, advanced disease, death, and having prostate or bone and soft tissue cancer. A large variation in baseline health state values results in a greater difference in QALYs gained. In terms of clinically relevant variables, patients with prostate cancer (and marginally for those with bone and soft tissue cancer) relative to breast cancer patients have greater differences in terms of condition-specific and generic QALYs. Similarly those who experienced a decline in their ECOG performance relative to those who did not change performance status from baseline to their last follow-up point also have larger differences in QALY estimates, while those identified at baseline as having advanced disease thus requiring earlier follow-up and those who died during the course of follow-up led to smaller differences between the condition-specific QALY and the generic QALY.

Table 7 Regression results examining the difference in QALYs

Discussion

The health economics discipline has been debating condition-specific measures in the literature for a number of years [3, 5, 33, 34]. Recently there has been a plethora of condition-specific measures developed [3539], but their use in decision-making remains limited [14]. This paper further informs the debate by testing the validity, responsiveness and sensitivity of a CSPBM for cancer. The EORTC-8D has previously been found to be broadly comparable to the EQ-5D [40], but that was within the same dataset that the EORTC-8D was developed from; thus, wider evaluation is required. This study provides one of the first external assessments of the instrument in comparison with EQ-5D-3L (see Hatswell et al. [41] for a comparison of EORTC-8D to SF-6D).

Descriptive analysis found that the mean health state value for the EORTC-8D was higher than for the EQ-5D-3L. Lloyd et al. [42] also found that the EORTC-8D scores were higher than EQ-5D-5L scores (a newer 5-level version of the EQ-5D [43]) in a group of men with metastatic castration-resistant prostate cancer. This may be a function of the EORTC-8D having a higher ‘floor’, and the lowest possible health state value is 0.292 compared with the EQ-5D-3L floor of −0.594. The greater range of values for the EQ-5D-3L may be observed because there is a greater opportunity for there to be lower values due to the theoretically plausible wider range of values that are available. Note that in the provisional EQ-5D-5L tariff for England the minimum value for the worst health state (55555) is −0.281 [44], while in other countries the worst health state values range from −0.446 [45] to −0.148 [46]. Further comparative analysis should be undertaken to consider the effect of the 5-level version, and note that Lloyd et al. [42] used a crosswalk from the 3L to the 5L. The EQ-5D-5L has been included in the next phase of Cancer 2015 [16].

Our assessment of convergent validity found that the dimensions and instruments were strongly correlated, while the analysis of content validity found few ceiling effects. Despite this, the agreement between the instruments was poor, with considerable variation in values for those with lower baseline HRQoL. Similar wide confidence intervals have been reported by others when comparing alternative generic MAUIs [47, 48].

The condition-specific QALYs estimated using the EORTC-8D were significantly higher than those derived from differences in the EQ-5D-3L over time (although the difference was small). Both the generic and condition-specific QALYs were found to be similarly sensitive to a number of patient and disease characteristics. Multivariate regression analysis of the difference in QALY estimates at a patient level found variation in baseline health state values had a large influence on the difference in QALYs gained, such that higher baseline EORTC-8D health state values relative to EQ-5D-3L values yield higher condition-specific QALYs compared to generic QALYs. This is despite the fact that higher baseline values mean there is less utility space in which to improve, given that health state values are bounded at 1. However, this may be driven by greater variation at the lower end of the health state values which would reaffirm the Bland–Altman results presented earlier (Fig. 2) where the poor agreement was driven by the patients with lower (mean) baseline HRQoL who also happened to have larger baseline differences.

Previous analysis [40] suggests that the EORTC-8D produced outcome values that are as valid, responsive and sensitive as EQ-5D-3L values. Our findings align with this, and at baseline the EQ-5D-3L and EORTC-8D values have a similar ability to discriminate between groups. This is also carried through to the QALY estimation where both generic and condition-specific QALYs appear equally responsive and sensitive to disease characteristics. When specifically considering the difference in QALYs, we find that this is most sensitive to differences in baseline health state values which are larger for those with lower HRQoL and in patients with declining performance and for particular sites. Therefore, researchers producing QALYs estimates from the EORTC-8D and decision makers utilising this information are advised to be cautious if their target group includes such patients. Caution is also advised if researchers/decision makers are using the instruments interchangeably (as may be the case in modelled economic evaluations) as the health state values differ considerably between instruments.

A limitation of this study is that while the cohort is rich in information it is not a clinical trial, and therefore, treatment effects vary. More research is required to compare the generic PBM and CSPBM in a trial setting. A further future limitation is that an additional PBM using the EORTC QLQ-C30 is underdevelopment (QLU-C10D) [4951]. While the EORTC-8D classification system was derived using data from multiple myeloma patients, the new measure utilises data from multiple countries and multiple types of cancer to derive its classification system and in addition aims to produce country-specific preference weights for a range of different countries including the UK and Australia. Both instruments draw on the EORTC QLQ-C30 which the EORTC Quality of Life Group suggests is supplemented by disease-specific modules (e.g. QLQ-BR23 for breast, QLQ-MY20 for multiple myeloma). Therefore, it may be that more specificity is required with oncology assessments and both of these CSPBM require further supplementation.

There is growing concern with respect to the high cost of personalised or targeted drugs for cancer treatment [52, 53]; the greater financial risk means that it is even more important to accurately measure the outcomes of treatment to estimate if treatment offers value for money. Our research suggests that CSPBMs offer both similarities and differences to generic PBMs, and while this difference equates to marginally higher QALYs in our cohort, further research is required to confirm if these higher QALYs offer a more accurate reflection of HRQoL gains [54].