If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Individual reprints of this article are not available. Address correspondence to Elizabeth A. Hahn, MA, Center on Outcomes, Research and Education, Evanston Northwestern Healthcare, 1001 University Pl, Suite 100, Evanston, IL 60201
Center on Outcomes, Research and Education, Evanston Northwestern Healthcare, Institute for Healthcare Studies, and Robert H. Lurie Comprehensive Cancer Center, Feinberg School of Medicine, Northwestern University, Evanston, IL; Département de la Recherche Clinique et du Développement de l'Assistance Publique-Hôpitaux de Paris, Hôpital Saint-Louis, Paris, France
To many clinicians, the assessment of health-related quality of life (HRQL) seems more art than science. This belief is due in part to the lack of formal training available to clinicians regarding HRQL measurement and interpretation. When HRQL is used systematically, it has been shown to improve patient-physician communication, clinical decision making, and satisfaction with care. Nevertheless, clinicians rarely use formal HRQL data in their practices. One major reason is unfamiliarity with the interpretation and potential utility of the data. This unfamiliarity causes a lack of appreciation for the reliability of data generated by formal HRQL assessment and a tendency to regard HRQL data as having insufficient precision for individual use. This article discusses HRQL in the larger context of health indicators and health outcome measurement and is targeted to the practicing clinician who has not had the opportunity to understand and use HRQL data. The concept and measurement of reliability are explained and applied to HRQL and common clinical measures simultaneously, and these results are compared with one another. By offering a juxtaposition of common medical measurements and their associated error with HRQL measurement error, we note that HRQL instruments are comparable with commonly used clinical data. We further discuss the necessary requirements for clinicians to adopt formal, routine HRQL assessment into their practices.
Measurement in medicine is not new; in fact, it has been an integral component of medical diagnosis and treatment since the beginning of the clinical practice of medicine. From the start of their schooling and training, clinicians are taught the utility of measurement, such as height, weight, vital signs, pathology reports, and laboratory chemistry values. Clinicians are also trained in obtaining the qualitative aspects of a medical history but are not routinely taught the quantitative measurement of patient-reported outcomes such as health-related quality of life (HRQL). As a result, HRQL measurement may seem more like an art than science to many clinicians. However, increasing evidence suggests that routine, formal assessment of HRQL improves care on multiple levels. For example, adding HRQL assessment to clinical practice has led to improved problem identification
Randomized trial of co-ordinated psychosocial interventions based on patient self-assessments versus standard care to improve the psychosocial functioning of patients with cancer.
The effectiveness of the use of patient-based measures of health in routine practice in improving the process and outcomes of patient care: a literature review.
Assessment of HRQL has been successfully used to change and influence patient and physician communication, resulting in improved patient satisfaction in a community practice setting.
The mechanisms by which routine assessment of HRQL might improve clinical practice include (1) aiding detection of physical or psychosocial problems that otherwise might be overlooked, (2) monitoring disease and treatment, (3) allowing precisely timed alterations in therapeutic plans, (4) facilitating patient-physician communication, and (5) improving the delivery of care.
The effectiveness of the use of patient-based measures of health in routine practice in improving the process and outcomes of patient care: a literature review.
It is also possible to routinely use HRQL instruments in clinical practice to evaluate the efficacy of interventions designed to prevent or treat common problems experienced by patients.
The first is the availability of an acceptable set of measures from which to choose. These HRQL measures must be brief and simple to administer, complete, score, and interpret. The second critical factor involves clinical relevance and ease of use, ideally with results presented in a structured format that includes comparison (reference) data for these assessments.
The results and interpretation of HRQL information must be delivered in a manner that facilitates and guides interventions. Finally, buy-in from both clinic staff and patients is essential so that routine HRQL assessment can be effectively implemented.
This article discusses HRQL in the larger context of health indicators and health outcome measurement. The discussion is targeted to the practicing clinician who has not had the opportunity to understand and use HRQL data. If one extends to HRQL assessment the observation that health care professionals not only tolerate but also depend on measurements inherently associated with error, whether from examination or laboratory findings, the future of HRQL assessment in clinical practice is bright. To support this statement, we offer a juxtaposition of common medical measurements and their associated error (when it has been studied), with HRQL measurement error and discuss what will likely be necessary for clinicians to adopt formal, routine HRQL assessment into their practices. Interested readers can find more detailed information about incorporating HRQL into practice in 2 other articles published in this issue.
Clinical Significance Consensus Meeting Group. An exploration of the value of health-related quality-of-life information from clinical research and into clinical practice.
Although the quality of cancer care has traditionally been measured with such clinical outcomes as survival and tumor response, recognition of the importance of patient-reported outcomes is increasing.
During the course of disease and/or treatment, patients may experience many symptoms, including weight loss, fever, fatigue, and pain; treatment adverse effects, such as shortness of breath, fatigue, dizziness, hair loss, nausea, and pain; and challenges to their ability to cope with physical and emotional changes.
Measuring the symptom experience of seriously ill cancer and noncancer hospitalized patients near the end of life with the memorial symptom assessment scale.
After completion of treatment, patients must contend with physical, emotional, and social problems related to the direct effects of the disease, consequences of treatment, and individual or family factors.
Physical problems may include organ dysfunction, infertility, second malignancies, and recurrence; social problems may include employability and insurability; emotional problems may stem from fears of recurrence, adjustment to physical limitations, loss of job flexibility, and posttreatment mood and stress disorders. Systematic attention should be directed to the full range of patient concerns to better address patients' needs both during and after treatment.
Concept of health-related quality of life and of patient-reported outcomes.
in: Chassany O Caulin C Health-Related Quality of Life and Patient-Reported Outcomes: Scientific and Useful Outcome Criteria. Springer-Verlag,
Paris, France2003: 23-34
which clarifies the source of data (eg, patient, observer) and its relationship to the HRQL outcome. Figure 1, an adaptation of these 2 models, displays at the top an essentially unidirectional (causal) pathway from biological and physiological processes, such as a disease or chronic condition, which often results in symptoms. Symptoms in turn can produce limitations in functional status and ability to engage in normal everyday activities. Over time, such an impact can have detrimental effects on patients' general views of their own health and even self-esteem or sense of personal value. All this can contribute to a decline in one's overall evaluation of quality of life. When evaluation of quality of life is limited to the context of health and illness, it is usually referred to as health-related quality of life.
FIGURE 1Measures of patient outcomes as organized by current models.
Concept of health-related quality of life and of patient-reported outcomes.
in: Chassany O Caulin C Health-Related Quality of Life and Patient-Reported Outcomes: Scientific and Useful Outcome Criteria. Springer-Verlag,
Paris, France2003: 23-34
Moving from left to right across the top of Figure 1, the strength of the association weakens. Thus, for example, biological and physiological variables, measured in numerous ways across medicine, can be expected to correlate most strongly with measures of symptoms and most weakly with overall HRQL (although there is a relationship across the expanse of the model). On this general model linking clinical and HRQL variables,
Concept of health-related quality of life and of patient-reported outcomes.
in: Chassany O Caulin C Health-Related Quality of Life and Patient-Reported Outcomes: Scientific and Useful Outcome Criteria. Springer-Verlag,
Paris, France2003: 23-34
To be useful, instruments must be both reliable and valid. Reliability refers to an instrument's dependability, expressed as the extent to which it either measures something accurately or produces the same score on repeated applications. Validity refers to the extent to which an instrument measures what it proposes to measure. This article focuses on reliability of measurement. Using multiple clinical examples, we summarize relevant statistics, methods, and standards for assessing reliability and then describe the reliability of common physiological and self-report instruments.
METHODS
We conducted a literature review based on sources known to the authors, supplemented by a review of familiar sources in support of the effectiveness of formal practice-based HRQL assessment. We focused attention on clinical measurement studies that offered sufficient information about measurement error to allow comparison with error in HRQL measurement. To enable these comparisons, we recorded literature-based standards for high, moderate, and low reliability across commonly reported reliability (precision) statistics. We then categorized each of the selected clinical measurements and the reliability information into 1 or more of these predetermined classifications, according to the reviewed literature.
Definition of Measurement Error
Any measurement has some degree of error because of imperfect calibration of the measuring device, misunderstanding by the patient of a question, or the inherent lability of the characteristic.
The classic linear model for an observed value is X = T + e, in which X represents the observed value for a patient on some variable, T is the patient's true score, and e represents the difference between the true score and the observed score.
Systematic error might affect all observations equally or it might affect certain types of observations differently than others and be considered a type of bias. A miscalibrated thermometer that always records a temperature 3° higher than the actual measurement is an example of systematic error because it affects all observations. A thermometer reading affected by an object's color or density reflects a biased thermometer because characteristics not directly related to temperature are influencing it. Random temperature measurement error might be present if a person reading the thermometer occasionally transposed the digits. Biases and random errors are also present in outcome measurement.
The term reliability is sometimes defined as “freedom from random error.”
It is used generically for 2 different characteristics of a measure: repeatability and internal consistency. The characteristic of repeatability can be measured over time (test-retest reliability), over observers (interrater reliability), or across different versions of an instrument (alternate forms reliability). A clinical analogy is the measurement of blood pressure, in which one might want to know the reliability of measures taken during a 24-hour period (test-retest) or by different health care professionals (interrater reliability). Internal consistency refers to the extent to which a set of questions measures a single underlying dimension, such as fatigue, depression, or physical functioning. It is analogous to measures of reliability of laboratory tests of replicate samples. The formulas used to estimate reliability for repeatability or internal consistency are equivalent. Table 1 provides 3 classification categories for reliability statistics (high, moderate, low) based on the type of data (nominal, ordinal, interval/ratio) and suggests ranges that we have found useful for classifying a measure or test in terms of reliability.
TABLE 1Guidelines for Instrument Reliability and Precision
Creating a Common Ground: Reliability and Precision Statistics
Nominal and Ordinal Data. Recording the presence or absence of a symptom is an example of a nominal (named) variable; recording a symptom as none, moderate, or severe is an example of an ordinal (ordered) classification. Therelevant statistic for estimating the reliability of a value on a nominal or ordinal scale is the κ or weighted κ, respectively.
A κ statistic quantifies the amount of agreement between 2 or more measurements that is greater than the amount expected by chance alone. The Kendall coefficient of concordance can also be used with ordinal data.
The repeated measurements may be over time, over observers, or over different forms of a test. If κ=0, there is just chance agreement; if κ<0, there is even less than chance agreement (a rare occurrence); if κ>0, there is greater than chance agreement; and if κ=1, there is perfect agreement. Table 1 lists some recommended criterion values for good (κ=0.40-0.74) and excellent (κ>0.74) agreement. Some authors have proposed even finer distinctions among levels of agreement, for example, 0.41 to 0.60 for moderate agreement, 0.61 to 0.80 for substantial agreement, and 0.81 to 1.00 for almost perfect agreement.
Interval and Ratio Data. The most common units of measurement are based on interval or ratio scales, and several useful reliability statistics exist. Just as κ can be used to assess the 3 types of repeatability (test-retest, interrater, alternate forms) for nominal or ordinal data, the intraclass correlation coefficient (ICC)
assesses the 3 types for continuous measures. For nominal data, κ is mathematically equivalent to the ICC. For ordinal and interval data, weighted κ and the ICC are equivalent in certain conditions.
Numerous versions of ICCs are available; choosing the most appropriate one depends on several factors, including the types of raters and the types of patients.
In a simple example in which there is interest in assessing the test-retest reliability of an HRQL measure, a 1-way random-effects analysis of variance technique would be used. Recall the model for an observed value: X = T + e, in which T represents the patient's true score (also termed the error-free score, steady-state value, or signal
). In a population of patients, the T will vary around a mean value μ with a variance of
and the random error e has a variance of
. The total variation in the scores can be partitioned into 2 parts: (1) variability among patients (
) and (2) variability of the random errors
. For this example, the ICC is defined as the ratio of the between-subject variance to the total variance:
. Another way of thinking about this is the ratio of the variance of the true scores (
) to the variance of the observed scores (
). This ratio can range from 0 to 1. Values near 0 indicate that almost all the variation in score is due to measurement error and that the measure is unreliable. Values near 1 (>0.90 in Table 1) indicate that there is minimal measurement error and that the measure is very reliable.
An alternative to ICCs was proposed by Altman and Bland.
This approach involves plotting the differences of observed pairs of measurements against their mean values, creating limits of agreement (mean ± 2s, in which s is the SD of the differences), and examining trends using linear regression analysis. Although this more visual approach is often more easily understood by nonstatisticians, it relies on statistical significance tests of differences rather than on tests of consistency or equivalence, and it lacks a single measure that would be preferable, especially when more than 2 methods are compared.
The Pearson correlation coefficient (r) is an estimate of the linear association between 2 interval/ratio variables. It is calculated using the SDs of the 2 variables (sz and sx) and their covariance (szx): r = szx/szsx. The correlation coefficient can range from -1 to +1. Values near 0 indicate almost no linear association between the 2 variables, and values near -1 or +1 indicate that 1 variable can be almost perfectly predicted from the value of the other. Some investigators use r as a substitute for rICC, but these 2 coefficients yield different types of information and are not generally interchangeable.
has the same conceptual basis as the aforementioned stability (or repeatability) measures of reliability. Internal consistency can be interpreted as the ratio of the variance of the true values among patients to the variance of the observed values. If each patient completes a multi-item instrument or answers the same question on several occasions, the average of these observations should have higher reliability than a score based on a single answer. This occurs because the measurement error is presumably random; when the values are averaged, the error is averaged out and thus decreases. Given that the error of the mean of k random values is
then for a multi-item instrument, the reliability (rα) of a k-item score is
). In laboratory studies and for multi-item instruments, the implication is that as the number of assessments is increased, the reliability will increase. Increasing the number of assessments (or questions) will have the greatest impact on the reliability of a test when each question has a large measurement error relative to the variation of the true values. One sees diminishing returns with increasing questions. Internal consistency reliability is most commonly assessed using the Cronbach coefficient α, and values greater than 0.90 are considered the standard for individual-level applications (Table 1).
is expressed on the same scale as the quantity being measured. The SEM is defined in terms of SD (σS) reliability (either rα or rICC):
. If a measure has a reliability of 0.80 (common for many HRQL scales), the error of measurement associated with any individual score is 45% of the SD. If the reliability decreases to 0.50 (uncommon for most HRQL scales), then the SEM is 70% of the SD. One way to interpret this statistic is to note that we would expect a person's observed score to fall within the interval of ±1 SEM around a person's true score 68% of the time and in the interval of ±2 SEM 95% of the time. If the reliability is 0.80 and the SD is 10, the SEM is
or 4.5. Thus, if the estimated true score was 60, we would expect that 68% of the time the observed score would fall within the interval of 60±4.5 or between 55.5 and 64.5. In a clinical setting, it is also of interest to know how big a difference one might expect if the person takes the same test on 2 occasions when his or her true value actually does not change. The SD of the difference of 2 scores is
SEM. This can be used to estimate a confidence interval for the estimated true score. For example, if the true score for a patient was 62 on the first occasion and the patient is tested a second time without a change in true score, the probability is 68% that the second score will be in the interval
or between 55.7 and 68.3. The SEM can be used to help interpret the meaningfulness of intrapatient change. Recent research has suggested that a change less than 1 SEM is rarely clinically meaningful (Table 1).
A number of issues influence the evaluation of these reliability statistics. The most critical is how the information is to be used. If the measure is to be used in a patient management decision at the individual level, higher levels of reliability are required than for comparisons among groups of patients.
Clinical Significance Consensus Meeting Group. Group vs individual approaches to understanding the clinical significance of differences or changes in quality of life.
If the measure is being used as a screening tool to identify patients in need of additional assessment, the criteria for adequate reliability can be lowered. Another important feature is that this measure of reliability is closely linked to the population in which one wants to use the measure. Clinicians and researchers need to be aware of the characteristics of the sample used to assess the reliability of a test or measure. The more heterogeneous the population, the larger the differences between patients and thus
. Thus, the reliability estimate will tend to be higher when there is a mixture of patients who would be expected to have values across the entire range of the measure as would occur when there are patients both with and without a condition. Finally, we need to be aware of the type of reliability that was measured and whether it is appropriate to our study setting. Are we interested in whether a measurement at one point in time will agree with one taken later or whether the patient's assessment will agree with the caretaker's assessment? The use of item response theory models is contributing to additional advances in the reliability of HRQL measurement.
Figure 1 illustrates an organizing model for this report, wherein it is hypothesized that the link between clinical and HRQL variables is stronger for self-reported disease symptoms than for more general health perceptions. The overlay of the source of data onto this model helps to clarify that patient-rated data, although paramount given the definition of HRQL as patient focused, are not the only sources of information regarding patient status. An example of the model at work can be seen in the case of cystic fibrosis, as depicted in Figure 2. In cystic fibrosis, varying degrees of association are observed across related measurements, from physiological to clinician- and caregiver-reported patient status to various types of patient-reported outcomes. The strongest association with patient HRQL is self-reported dyspnea, whereas the weakest (but still significant) association is with physiological variables (Figure 2).
FIGURE 2Proximal vs distal associations of clinical and health-related quality-of-life (HRQL) variables in cystic fibrosis based on a review of the literature.
Table 2 reports data from selected studies of the reliability (degree to which error is reduced) of common clinical and HRQL measurements, according to the criteria outlined in Table 1.
Validity of a measure of the frequency of headaches with overt neck involvement, and reliability of measurement of cervical spine anthropometric and muscle performance factors.
Designed to be representative of common health measurements, rather than comprehensive, Table 2 provides a range of reliability estimates within each measurement category (see also Figure 1). For example, the reproducibility of vital sign measurements spans all 3 columns in Table 2: from high reproducibility for the classification of tachycardia, bradycardia, and systolic hypertension to low reproducibility for systolic hypotension.
Similarly, the reliability of commonly used HRQL measuresvaries across questionnaires and across subscales within a questionnaire (eg, 36-Item Short-Form Health Survey physical functioning vs role-physical).
As mentioned previously, awareness of the characteristics of the sample used to assess the reliability of each measure is important. The information in Table 2 should be used only as an overall summary of the possible range of reliability.
TABLE 2Degree of Error in Common Health Measurements
Validity of a measure of the frequency of headaches with overt neck involvement, and reliability of measurement of cervical spine anthropometric and muscle performance factors.
Validity of a measure of the frequency of headaches with overt neck involvement, and reliability of measurement of cervical spine anthropometric and muscle performance factors.
Association Between Patient-Reported HRQL and Biological and Physiological Measurements
The criteria generally used to measure the activity of a disease, such as a biological value, a physiological performance, or a radiographic image, do not by themselves reflect the perceptions and subjective state of the patient. Two patients with an identical biological value or physiological score may experience a different impact on their perceptions of symptoms or HRQL. For the same patient, a physical performance objectively assessed in a laboratory is not necessarily similar to the physical ability of the patient in everyday life.
Measuring physical function in community-dwelling older persons: a comparison of self-administered, interviewer-administered, and performance-based measures.
For many conditions, correlation levels reported in the literature between a physical measurement of performance or a functional capacity (eg, forced expiratory volume in asthma or chronic obstructive pulmonary disease) and the measurement of the severity of symptoms or the physical dimension of HRQL seldom exceed 0.40 and are generally lower than 0.20.
Hemoglobin level, although related directly to oxygenation and therefore energy, also rarely correlates with self-reported fatigue and function beyond r=0.40, suggesting less than 15% shared variability in these conceptually linked measurements.
Another example is among patients with peripheral arterial occlusive claudication, in which the correlation between hemodynamic parameters and angiogram score vs self-reported functional disability and HRQL is low.
Still another example in osteoarthritis showed substantial discordance among radiographic osteoarthritis, physician-based diagnosis, and patient-reported pain.
The growing body of evidence linking patient-reported outcomes to clinical indicators suggests that although there is some common ground, there is even more uniqueness to the 2 types of information, and both have value. Across self-report and clinical measurements alike (Table 2), some of the lack of agreement is due to measurement error. Since there is clearly error in both clinical and self-report data and they converge only modestly in most cases, we suggest that self-report information is necessary to complete an accurate understanding of a patient's current HRQL.
Association Between Patient- and Physician-Assessed HRQL
Patient-reported outcomes provide additional information on treatment effects and patient perceptions that are not adequately captured by objective criteria and clinician-reported outcomes. By definition, HRQL is subjective. Therefore, patients are the best source to rate their own HRQL or perceived health and well-being. Patients' ratings of their experiences of disease or treatment often differ in both degree and type from those of health care professionals.
Furthermore, in some conditions, such as cancer, chronic heart failure, chronic obstructive pulmonary disease, or rheumatoid arthritis, baseline HRQL scores (especially physical domains) predict survival.
This predictive value has been recently extended to show that, in addition to the prognostic value of baseline patient self-report data, change over time forecasts outcome in advanced lung cancer.
Early change in patient-reported health during lung cancer chemotherapy predicts clinical outcomes beyond those predicted by baseline report: results from Eastern Cooperative Oncology Group Study 5592.
Association Among Different Patient-Reported Outcomes
Even though in a given disease a logical association exists between the severity of the symptoms and a worsening perception of HRQL by the patient (Figure 1), there are situations in which the measurement of the symptoms does not reflect the subjective real life of the patient. For example, irritable bowel syndrome is a functional and benign disease, but the long-term course is composed of symptomatic flares that significantly affect health perceptions.
The absence of pain or abdominal discomfort at a given time (eg, during a consultation with the physician) is not synonymous with a good HRQL score. The patient may be anxious to know when the next symptomatic bout will occur, may be limited in social activities, or may be constrained by having to take drugs and pay attention to food. The fear or the forecast of the crisis is possibly a handicap more significant than the crisis itself. Thus, the clinician cannot infer all aspects of HRQL.
Reliability Classification of Physiological and Self-report Measurements
Both HRQL and other patient-reported outcomes are sometimes labeled as subjective by clinicians because they are based on individual perceptions. However, we argue that the distinction between subjective and objective should not depend on who makes the rating; in other words, a measurement is not considered objective just because it is made by a clinician.
In fact, ratings of patient performance or other aspects of well-being by clinicians are often discordant with the self-ratings provided by patients, leading one to question the objectivity of the clinician rater. Even socalled objective morphological measures, such as tumor size or change, can lack reliability when a subjective observer must interpret results. More than a quarter century ago, Moertel and Hanley
evaluated the readings of 40 radiographs of lung tumors by 5 radiologists. They showed intraobserver misclassification rates of 3% to 14% of tumors using Response Evaluation Criteria in Solid Tumors and World Health Organization criteria for response and interobserver misclassification of 10% to 43% of tumors using these criteria for progression. This degree of misclassification places tumor response, long held to be objective, in the low reliability column of Table 2.
The considerably higher misclassification rate of progression compared with response raises concern regarding the recent Food and Drug Administration
US Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research
Guidance for industry: clinical trial endpoints for the approval of cancer drugs and biologics. May 2007.
shift to time to progression as a primary surrogate end point, unless such an end point can be shown to be associated with improved HRQL or survival.
Studies not dealing with oncology, such as in classification of fractures, have also reported poor to moderate intraobserver and interobserver agreement.
Similarly, the knowledge of treatment assignment may influence decisions regarding drug dose adjustments even when objective rules exist for those adjustments.
Self-reported pain is used to titrate analgesic medication and to determine the potential utility of intraspinal opioids among patients with cancer and others with acute and chronic pain.
Implantable Drug Delivery Systems Study Group. Randomized clinical trial of an implantable drug delivery system compared with comprehensive medical management for refractory cancer pain: impact on pain, drug-related toxicity, and survival.
These studies have subsequently led to the widespread use of both intrathecal and epidural opioids for pain management and relief of pain in numerous settings, including cancer, obstetrical labor and delivery, postsurgical management, and other acute and chronic pain syndromes.
Asthma. In asthma treatment, patient-reported outcomes are primary indicators of disease status and progress. Therefore, in a recent asthma clinical trial that compared combination fluticasone propionate and salmeterol with placebo, the Asthma Quality of Life Questionnaire was the primary trial end point. Improvement from baseline to end point (12 weeks) was greater in the combination group than in the placebo group. The differences between groups exceeded the prespecified minimally important difference (0.5 on a scale ranging from 1-7) for the 4 dimensions and for the global score.
Self-report of Symptoms and Adverse Events in Human Immunodeficiency Virus Disease. The perception of patients with human immunodeficiency virus (HIV) regarding the symptoms related to multidrug antiretroviral therapies may differ from the perception of clinicians. In a recent validation study, when compared with open-ended clinician interviews, the 20-item self-reported HIV symptom index captured more frequent and bothersome symptoms.
Compared with self-report, clinicians underreported the presence and severity of symptoms. Reports by clinicians demonstrated greater variability by site and poorer test-retest reliability. Clinician-reported severity scores were less strongly associated than were self-reports of functional status, global quality of life, and survival. Thus, the perception of patients with HIV about their symptoms may be more informative than that of clinicians. Finally, a discrepancy exists between clinician-based diagnosis and self-report of depression among patients with AIDS.
Self-report in Functional Gastrointestinal Disorders. Discrepancies between physician and patient responses, by either standardized information (eg, Rome II criteria) or self-report,
have been described in diagnosing functional gastrointestinal disorders such as irritable bowel syndrome, diarrhea, and constipation. For example, the rate of constipation estimated across a population may differ, depending on whether the estimate is based on a definition of frequency (<3 stools per week) or self-perception.
Some patients who routinely have fewer than 3 stools per week may not feel constipated, and conversely, some patients may feel very constipated if they do not have 1 stool per day, although they may be considered healthy by a clinician.
This discrepancy in criteria raises the question of which is the more appropriate source, especially for these functional disorders: the physician diagnosis based on norms or the perception of the patient and the impact on his or her satisfaction, well-being, and HRQL.
Self-report in Rheumatoid Arthritis. For rheumatoid arthritis, accepted disease severity indicators are based on clinical examination of joints and functional assessment, as well as patient self-report of symptom severity and impact on functioning. Patient-reported outcomes tend to be highly correlated with findings on clinical examination and with the familiar American College of Rheumatology criteria of 20%, 50%, and 80% improvement.
An index of the three core data set patient questionnaire measures distinguishes efficacy of active treatment from that of placebo as effectively as the American College of Rheumatology 20% response criteria (ACR20) or the Disease Activity Score (DAS) in a rheumatoid arthritis clinical trial.
Thus, in nonresearch settings it may be more efficient (and more relevant to the patient) to use self-report in place of examination results.
CONCLUSION
The practice of medicine is art as much as science. Clinicians in daily practice depend on physical examination-based, laboratory, radiographic, and other measurements to assess and care for patients. Rarely do they use formal assessment of patient-reported outcomes as part of routine clinical practice. Given the importance of HRQL assessments in the lives of patients with chronic conditions such as cancer, one questions this underuse. One major barrier to routine use of HRQL instruments in clinical practice is the perception that they are not sufficiently reliable or trustworthy to make individual diagnosis and treatment decisions. This perception has been perpetuated by test developers themselves, many of whom are measurement scientists trained to focus on error and precision at times over meaning and usefulness.
The purpose of this article is to discuss the larger context of health indicators, including routine clinical measurements used every day in practice. Through this exercise, it can be seen that reliability of measurement varies for patient-reported outcomes and clinical measurements, such as blood pressure, heart rate, tumor measurement, or carotid wall thickness using ultrasonography. If one were to relax the requirements placed on use of patient-reported HRQL for use in clinical practice to a level comparable to other measurements used in clinical care, the fidelity of HRQL assessments would compare favorably.
Given the importance of HRQL to people with chronic diseases, the advent of computer-assisted assessment, and the emergence of electronic patient records, we suggest it is time to convert practice behavior to routine HRQL monitoring as a way to promote excellence in patient health care. Future research should focus on overcoming technical and system barriers to such a conversion, determining optimal ways to complement clinical and physiological data with self-report data, evaluating the efficacy of routine monitoring in clinical practice, and determining the cost-effectiveness of routine monitoring in chronic illness care.
Acknowledgments
We thank Amy Eisenstein for her assistance in researching details inTable 2.
REFERENCES
Frost MH
Bonomi AE
Cappelleri JC
Schünemann HJ
Moynihan TJ
Aaronson NK
Clinical Significance Consensus Meeting Group. Applying quality-of-life data formally and systematically into clinical practice.
Randomized trial of co-ordinated psychosocial interventions based on patient self-assessments versus standard care to improve the psychosocial functioning of patients with cancer.
The effectiveness of the use of patient-based measures of health in routine practice in improving the process and outcomes of patient care: a literature review.
Clinical Significance Consensus Meeting Group. An exploration of the value of health-related quality-of-life information from clinical research and into clinical practice.
Measuring the symptom experience of seriously ill cancer and noncancer hospitalized patients near the end of life with the memorial symptom assessment scale.
Concept of health-related quality of life and of patient-reported outcomes.
in: Chassany O Caulin C Health-Related Quality of Life and Patient-Reported Outcomes: Scientific and Useful Outcome Criteria. Springer-Verlag,
Paris, France2003: 23-34
Clinical Significance Consensus Meeting Group. Group vs individual approaches to understanding the clinical significance of differences or changes in quality of life.
Validity of a measure of the frequency of headaches with overt neck involvement, and reliability of measurement of cervical spine anthropometric and muscle performance factors.
Measuring physical function in community-dwelling older persons: a comparison of self-administered, interviewer-administered, and performance-based measures.
Early change in patient-reported health during lung cancer chemotherapy predicts clinical outcomes beyond those predicted by baseline report: results from Eastern Cooperative Oncology Group Study 5592.
US Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research
Guidance for industry: clinical trial endpoints for the approval of cancer drugs and biologics. May 2007.
Implantable Drug Delivery Systems Study Group. Randomized clinical trial of an implantable drug delivery system compared with comprehensive medical management for refractory cancer pain: impact on pain, drug-related toxicity, and survival.
An index of the three core data set patient questionnaire measures distinguishes efficacy of active treatment from that of placebo as effectively as the American College of Rheumatology 20% response criteria (ACR20) or the Disease Activity Score (DAS) in a rheumatoid arthritis clinical trial.