Statistical significant change versus relevant or important change in (quasi) experimental design: some conceptual and methodological problems in estimating magnitude of intervention-related change in health services research

This paper aims to identify problems in estimating and the interpretation of the magnitude of intervention-related change over time or responsiveness assessed with health outcome measures. Responsiveness is a problematic construct and there is no consensus on how to quantify the appropriate index to estimate change over time between baseline and post-test designs. This paper gives an overview of several responsiveness indices. Thresholds for effect size (or responsiveness index) interpretation were introduced some thirty years ago by Cohen who standardised the difference-scores (d) with the pooled standard deviation (d/SDpooled). However, many effect sizes (ES) have been introduced since Cohen's original work and in the formula of one of these ES, the mean change scores are standardised with the SD of those change scores (d/SDchange). When health outcome questionnaires are used, this effect size is applied on a wide scale and is represented as the Standardized Response Mean (SRM). However, its interpretation is problematic when it is used as an estimate of magnitude of change over time and interpreted with the thresholds, set by Cohen for effect size (ES) which is based on SDpooled. Thus, in the case of using the SRM, application of these well-known cut-off points for pooled standard deviation units namely: ‘trivial’ (ES<0.20), ‘small’ (ES≥0.20<0.50), ‘moderate’ (ES≥0.50<0.80), or large (ES≥0.80), may lead to over- or underestimation of the magnitude of intervention-related change over time due to the correlation between baseline and outcome assessments. Consequently, taking Cohen's thresholds for granted for every version of effect size indices as estimates of intervention-related magnitude of change, may lead to over- or underestimation of this magnitude of intervention-related change over time. For those researchers who use Cohen's thresholds for SRM interpretation, this paper demonstrates a simple method to avoid over-or underestimation.


Introduction
Methodological problems in estimating change in outcome with well-known measures of quality of life or health status have become a significant place on the research agenda in clinical evaluation research. However, these methodological problems seem to be relevant in integrated care research in which the integrated approach is compared with standard care practice when quality of life or health status outcome measures are used. Furthermore, improving methods to estimate change may contribute in the development of evidence based practice. This article was written because researchers in the field of health services research seem often unaware of the wide variety of indices that may contribute to the understanding of an intervention's or programme's effect (in addition to its statistical significance) in terms of health-related Quality of Life (HRQL) outcome or health-related functional status (HRFS). In the attempts to improve healthcare delivery with a new approach, researchers may have the need to distinguish between those who improved in terms of 'small', 'moderate' or 'large' before this new approach will become general practice. The problem This article is published in a peer reviewed section of the International Journal of Integrated Care of testing differences between the new approachgroup and a standard care group goes together with the dilemma that with large samples, trivial differences between these groups may be statistically significant.
There is a growing recognition that assessing an intervention's effect should not only focus at the statistical significance of the differences in health outcome between the experimental care and control group, but should also focus at the relevance or importance of these outcomes. Estimating the magnitude of the difference between change scores in both groups, the difference between mean change scores are expressed in standard deviation units with the effect size index (ES). To compare the magnitude of change D assessed in the experimental group with E change D assessed in the control group the idea of C effect size between groups can be turned on its side and applied to measurement instruments to estimate the amount of change over time within a group. Change over time indices are also applied to measurement instruments to evaluate them in terms of being sensitive to detect change in before-after studies. In literature on psychometrics or clinimetrics the concept of responsiveness was introduced to denote the magnitude of change over time or sensitivity to change over time. However, many responsiveness indicators have been proposed and resulted in numerous effect size indices (ES). Most of the indicators agree on the numerator (the change score between baseline and post-treatment) but there is little agreement on the appropriate denominator. Since a general convention for effect size interpretation is used for almost any ES out of this effect size family, researchers run the risk of overestimation or underestimation of an intervention's effect. This paper gives, on the one hand, an overview (not an exhaustive enumeration) of several responsiveness indices that may be relevant for evaluation research in health care. On the other hand, this paper gives a simple solution for underestimation or overestimation for two widely used ES.
Health services research is heavily dependent on valid health measures e.g. of health-related quality of life (HRQL) or health-related functional status (HRFS). These concepts have become important in the measurement of intervention-outcome and used as comparable outcomes in cost-effectiveness evaluation. However, in evaluation studies quality of life outcomes have turned out to be a 'kaleidoscopic' concept since no consensus exists with regard to the meaning of the concept in either the research community or the clinical community. Furthermore, the operationalization of the concept of (health-related) quality of life is heavily dependent on the disciplinary perspective in outcome assessment. This lack of consensus has given rise to the development of a myriad of measures involving different components whose conceptual dimensions vary w1x. Therefore, instruments labelled as quality of life measures ''may appear as health status, physical functioning, emotional functioning, perceived health status, symptoms, mood, need satisfaction, well being, and, often, several of these at the same time'' w2x. During the last 10 to 15 years, there has been an exponential increase in the development and use of instruments to measure the outcomes of medical interventions from the patient's perspective. A family of more than 150 instruments were identified in 75 studies w3x; in 1996, Spilker et al. catalogued nearly 215 measures in their second edition of ''Quality of Life and Pharmacoeconomics in Clinical Trials'' w4x. Since there is no consensus on the theoretical construct of quality of life w2, 5-8x, the universe of domains belonging to this concept (and therefore the ongoing discussion on the selection of items by which it is operationalized), we prefer concepts such as health-related functional status. Functional status reflects the ability to perform the tasks of daily life in physical, emotional and social domains. There is also a growing agreement on the components of these constructs and the validity of their measurement; for example, by validating these selfreport measures with evidence-based measures w9-11x. By using the term health-related functional status (HRFS) in this paper, we implicitly assume that a change in health status or functioning is indirectly related to the patient's subjective experience of quality of life.
For health care administrators or other health professionals who feel the need to measure HRFS as an outcome in evaluation of, for example, the effectiveness of hip-replacement by comparing integrated care with standard care w12x, it is essential to know that the choice of available health status instruments is related to the methodological debate on the psychometric properties of instruments (in contrast to outcomes such as physiologic measures). Consequently, this choice is also associated with methodological issues relating to the interpretation of outcome in terms of the magnitude of intervention-related change over time in HRFS or the assessment of the magnitude of differences in outcome between experimental (e.g. managed care, transmural care, shared care) and control groups (standard care, usual care).
Because improving the functional status of patients has become a central therapeutic goal of treatment for many diseases, it is important that health administrators, clinicians and researchers develop a common understanding of: which measure is likely to be the most appropriate one in the context of the disease and the evaluation of, for example, an interdisciplinary or integrated approach; -the methods to assess intervention-related change (responsiveness of outcome measures); and the methods by which a valid interpretation of the magnitude of that change in terms of relevance or importance can be achieved.
In the current paper, the methods to assess the effectiveness of an intervention in terms of change over time (responsiveness) will be discussed since valid assessment of the magnitude of the patients' improvement, deterioration and of no change seems to become important in detecting stable, improved and deteriorated patients-groups to evaluate direct costs of new interventions in the context of disease management.

The psychometric properties of HRFS outcome measurement tools
When the reliability and validity of health-related functioning measures have been established, these psychometric properties are generally accepted conditions for use of these measures in evaluation research.
However, the appropriateness of the instrument designed to measure change over time in persons is not only determined by its reliability and validity. Measuring change in order to evaluate efficacy of, for example, new care interventions requires the instrument to be sensitive to detecting change when patients improve in physical function after that intervention. Over the last 15 years, this property has become well known through the widely used concept of responsiveness. Responsiveness of health status measures has been denoted as one of the 'holy trinity' of necessary psychometric properties of health status instruments: reliability, validity and responsiveness although other researchers classify responsiveness as longitudinal validity w13x. To quantify responsiveness, several effect sizes are used as estimates of the amount of change detected with an instrument. One of the aims of this paper is to address some methodological issues relating to the assessment of change over time in health-related functional status and the meaning of the magnitude of this change in scores within experimental and control groups. Traditionally, the many generations of researchers who have evaluated the efficacy of care-related interventions, base their decisions on the statistical significance of the within-group (intervention-related) change over time or any statistically significant difference in change from repeated measurements between experimental (care) and control groups (with the underlying hypothesis that the experimental group should show a higher mean change in terms of improvement compared to the control group) w12, 14x. In some cases, investigators eager for results are likely to detect a statistically significant (but very small) change in scores related to the intervention, simply due to large sample size. Consequently, even if change which is statistically significant, though trivial in magnitude, is detected, the p-0.05 doctrine unwittingly pushes the question of how meaningful, important, relevant, or substantial the change is into the background. Significance tests support the decision as to whether the change is due to chance fluctuation or can be functionally related to (medical) intervention. The observed statistical significance does not indicate the magnitude of change. In spite of this, some researchers implicitly suggest that smaller p-values represent larger, and thus more 'relevant', effects w15x.
Against this background, the objectives of this paper can be formulated in terms of the following topics: -Responsiveness is a construct that is used with different theoretical definitions and with a wide variety of operationalisations by effect size indices.

Responsiveness, a problematic construct
To give greater meaning to the interpretation of the amount of change in scores on health-related functional status instruments, the concept of responsiveness was introduced in publications. For evaluation studies, the usefulness of a HRFS-instrument depends on its ability to detect a change that is clinically meaningful. Clinically meaningful refers to a change that justifies alteration in management of the disease or to a change that indicates the efficacy of an innovative type of intervention in domains of HRFS. Responsive measures discriminate between trivial and substantial changes within groups and consequently, show the difference in change between those groups. Thus, the term responsiveness is used as an indicator of the instrument's sensitivity to change, as well as an indicator of the magnitude of intervention-related change over time. The term responsiveness, however, is a confusing one for the beginner who encounters it in the literature, since papers addressing intervention-related change in terms of HRFS may refer to a varying composite of aspects. As appears from a selection of scientific papers, the term responsiveness is used as an operational definition of: -'An indicator of the sensitivity of an instrument to detect change over time' w17-22x or even refer to the extent to which a measure is sensitive to real change w23x; -'a measure of clinically relevant change in health' w57, 58x, although some investigators prefer the term 'clinically significant change' w59, 60x.
Qualitative terms such as 'clinically important' need at least a golden standard. However, such a standard is not available for HRFS measures. An substitute that is often used for a golden standard for HRFS is an external criterion. The blinded observation of a health professional can be used as an external criterion for justifying the interpretation in terms of clinically relevant or important change in HRFS. Another external criterion or yardstick for the interpretation of changes in HRFS is the patient's perception of the importance of change after (for example) a specific intervention.
Husted et al. w61x distinguished internal responsiveness from external responsiveness by defining internal responsiveness as the ability of a measure to detect change over time, whereas external responsiveness was defined as the extent to which change in a measure relates to corresponding change in a reference measure w11, 62, 63x. Despite this clarification of the concept of responsiveness by this recently published classification, the assessment of change in HRFS over time in evaluation research is quantified using a variety of approaches. For the sake of clarity, we will therefore in this paper use the concepts in the following meaning: responsiveness: the psychometric property of a measurement instrument, namely its sensitivity to detect difference between two points in time (change over time) within groups; -meaningful or relevant difference: the amount of change in scores or the magnitude of change within and between groups, according to statistical or other quantitative criteria (e.g. effect size indices); -clinically relevant or clinically important change in scores on a health-related functional status measure as the magnitude of change that is linked to an external criterion of relevance.
The purpose of a study and its study design may require different psychometric properties of the outcome measure. Consequently, the measure must either have the property of being able to detect differences between subjects at a single point in time (discriminative instruments) i.e. the ability to differentiate between groups 'who have a better HRFS and those who have a worse HRFS' w53, 64, 65x. Other studies may require the instrument's ability to detect change over time within subjects (evaluative instruments) w66-68x. Consequently, in randomised clinical trials (RCT) or quasi-experimental designs, HRFSinstruments should have both properties, namely: 1. the ability to reliably estimate change between baseline and post-test within an experimental and a control group, and 2. the ability to estimate the difference in change over time by comparing the average change assessed in e.g. patients receiving standard care and in patients receiving the new care intervention in order to determine intervention-related effect, when it is hypothesised that subjects assigned to the care innovation group are expected to change (on the average) more than those in the control group do.

Responsiveness and the instrument's scope: generic versus specific measures
An important criterion for choosing an instrument in order to detect change in HRFS is its generic or disease-specific scope, which will depend on the objectives of the specific study. Generic health status measures seek a broad perspective that is not specifically related to the restricted scope of the HRFS of a specific disease. Therefore, generic measures allow investigators to compare health status across different diseases and interventions w69x. Generic measures are health-related to the extent that disease, injury, treatment, intervention, or policy w70x influences them. Disease-specific measures focus on the disease being studied, allowing greater sensitivity to interventionrelated change compared to generic measures. The responsiveness of a health status instrument is an important issue in the decision to use disease-specific or generic measures of health-related functional state.

This article is published in a peer reviewed section of the International Journal of Integrated Care
For example, for those cases in which therapeutic effects are likely to be modest and undramatic w12, 19, 71x, a better sensitivity to change over time of an instrument is a necessary condition. In health services research, hypothesising statistically significant change over time and more substantial change (improvement) in patients assigned to the experimental group of managed, shared or integrated care, effects are not likely to be large or impressive. Using disease-specific outcome measures gives an opportunity to tap more precisely intervention-related improvement in domains of health, which may have been deteriorated due to the disease where generic measures contain items that are not likely to be linked to domains of health status that may change due to the disease or handicap of the patients in the study. Although the question of whether instruments, that are tailored to the disease, are superior to measures of general function in terms of sensitivity to change, has not been settled definitely, a growing number of studies indicate that diseasespecific measures seem to be more responsive than generic measures w36, 42, 47, 51, 72-76x.

Effect size (ES) as indicator of responsiveness
Mean differences in outcomes between baseline and post-intervention of a test can be standardised to quantify a care intervention's effect in units of standard deviation (SD). Consequently, standardising mean change over time with a standard deviation allows comparison of a particular intervention's different outcomes, independent of the measuring units. The resulting statistical measure is known as effect size (ES) index. In many evaluation studies, standardised change over time in HRFS (ES) is used in comparisons of groups who were treated differently. This method of expressing change scores in a so-called effect size index seems to be an appropriate method to estimate the magnitude of change over time in before-after study designs.
The effect size index tells us something very different from the p-value, which indicates the obtained probability of a Type I error in a test of statistical significance. If a p-value is annotated as statistically significant, rejecting the null-hypothesis does not imply that the effect was important in any way nor does a non-significant p-value indicate a trivial result w77-80x. Criticism of statistical hypothesis testing has a long history w81x, and even Jacob Cohen w15, 82x ''played a prominent role in the anti-hypothesis-testing charge'' w83x. The adoption of a fixed level of significance may lead to the situation in which two researchers obtain identical intervention effects but obtain different p-values (0.04 and 0.06) due to the effect of (slightly) different sample sizes leading to different decisions. Thus, p-values are confounded by the joint influence of sample size and the effect size w84x and make the rejection of the null-hypothesis not very informative. Another criticism of null hypothesis testing is that it is foolish to ask: 'Are the effects of A and B different?' ''They are always different-in some decimal placefor any A and B'' w85x. Since then, quantitative investigators in medical and social sciences have proposed a variety of supplementary effect size indices, some of which we will clarify. Reporting effect sizes without appropriate statistical tests and associated p values is misleading and potentially dangerous if the number of observations that is required to detect a difference has not been estimated by means of a power analysis. Effect size statistics should be provided to supplement statistical testing (not as a substitute for it), and only when the outcome is sufficiently extreme from what would have been expected on the basis of chance (p-a).
It should be noted that during the debate on 'significance testing', several vocal leaders in psychology and education research called for the universal reporting and interpretation of empirically produced effect sizes w86, 87x.
There are myriad estimates of effect size out of which the researcher can make a choice w88x and the question arises as to which of the effect size measures 'that could be summoned up for a given problem should a researcher report?' w83, 84x The most elegant solution for this problem would seem to be for authors to include the sufficient statistics so that every reader can compute whichever effect size index they believe is best suited to the situation. Table 1 gives an overview of responsiveness measures in repeated measurement study designs.  there is no consensus on the mathematical way to determine the magnitude of the difference between scores gained on two different occasions: researchers classify the extent of responsiveness and magnitude with effect sizes using several standard deviations (see Table 1 divided by a standard deviation. The researcher's decision as to which SD he will take is either a wellconsidered choice or one which is copied from wellreputed colleagues and has no further justification. However, in giving meaning to standardised mean change in terms of 'trivial', 'small', 'moderate', or 'large' effects using the thresholds that Cohen w16x provided us with some thirty years ago, it seems to have been forgotten that these cut-off points were calculated with the pooled standard deviation (SD ). Consequently, P applying these thresholds for mean change scores standardised with the standard deviation of the change scores ( ) , which is not equal tōX yX ySD  Table 1 (except: T-Test, Normalized ratio, and relative-and efficacy indices), these thresholds are used indiscriminately, which may have contributed to the confusion in this area w61x.

Effect size interpretation: the threat of internal and external validity of (quasi) experimental research by overestimation or underestimation
In the practice of health-related quality of life research, most researchers remain primarily interested in the statistical significance of the change in health-related functional status or quality of life in pre post designs. In combination with e.g. the T-test approach, substantial effects can be detected w96-98x with an estimate of effect size. If a p-value is annotated as statistically significant, rejecting the null hypothesis does not imply an effect of important magnitude; likewise, a non-significant p-value does not indicate a trivial result w77-80x, although some researchers implicitly deem more important those results with smaller p-values.
In the last decade, however, a growing number of longitudinal intervention studies are focussed on questions like ''If the change between baseline and outcome is statistically significant, what can we say about the magnitude (or amount) of change over time that has been detected? Can we interpret this difference in terms of an important difference or as a relevant (substantial) change?'' To answer these questions, the responsiveness i.e. the ability of quality of life outcome measures to detect change over time, has become crucial in the past decade. However, the responsiveness estimation is neglected in many evaluation studies in which it could give information on the importance of change due to intervention-related effects supplementary to the statistical significance of change over time (e.g. before and after intervention) w99, 100x. Reporting effect sizes without appropriate statistical tests and associated p-values is misleading and potentially dangerous when the number of observations that is required to detect a difference has not been estimated with a power analysis. Effect size statistic should be provided to supplement (not as a substitute for) statistical testing, and only then, when the outcome is sufficiently extreme from what would have been expected on the basis of chance (p-a).
Noteworthy in this respect is that in the field of psychological research, editorial policy indicates that ''until there is a real impediment to doing so, authors should routinely present an effect size estimate along with the outcome of a significance test'' w84, 86, 87x. Table 1 shows that several quantitative indices have been developed that belong to the family of effect sizes (standardized differences) each calculated with a different denominator in the ( )formula,X1yX2ySD for example, the SD of stable subjects, the SD of the baseline assessment, the SD of the observed change score (improved, stable subjects) etc. Obviously, there is no consensus on how to declare a difference in terms of standard deviation units. Only in a small number of publications is this lack of consensus on the most appropriate effect size indicator signalled w13, 90, 101-104x.
Despite the fact that different opinions exist on the method to estimate magnitude of difference between groups or the magnitude of change within groups, researchers use the straitjacket of thresholds Cohen provided us with some 30 years ago w16x. However, these thresholds are taken for granted by many researchers for every version of effect size index. With regard to the correct use and interpretation of effect size indices as estimates of intervention-related magnitude of change, we must revisit some basic assumptions: the ES is developed and elaborated by Cohen to estimate power or the necessary sample size to detect relevant change with the basic principle of independent, equal size samples with common within-population standard deviation ; s in the case that this ES is used to calculate the sample size needed to detect change in paired samples or in a repeated measurement-design it must be adjusted for correct use of Cohen's power tables and sample size tables. However, this adjusted ES cannot be interpreted with Cohen's thresholds for effect size interpretation in evaluation research;

Independent samples
Cohen represented the effect size (ES) on some dependent or outcome measure used in an experiment in terms of the difference (using the symbol d9 to denote this ES) between the treatment and control group expressed in units of common within-population standard deviation (in samples this standard deviation is estimated with the pooled standard deviation) as follows: wFormula AxX

Dependent samples or paired observations
The difference or change in matched observations within subjects is standardised by the common withinpopulation , according to Cohen's w16x p. 13, but due s to the removal of the variation in many extraneous characteristics of the subjects, the index must be adjusted w16x, dividing d9 by 6(1yr). Cohen used the symbol d to denote this adjusted ES (in evaluation research often labelled as Standardized Response Mean).
wFormula Bx d9 ds y(1yr) d9seffect size for independent samples dsadjusted effect size rscorrelation between baseline and outcome This 6(1yr) -correction of the denominator of formula A is necessary for a proper use of power and sample size tables since these assume 2(ny1) degrees of freedom where, in the case of paired observations, only ny1 are actually available w16x. This consequence for power and sample size estimation is something different from the use of the effect size d in evaluating efficacy of a new intervention in terms of amount of change in health status, which was not the aim of Cohen's work.

Overestimation or underestimation of effect by using Cohen's thresholds for SRM
When effect sizes are calculated as the standardized difference in mean score to evaluate the magnitude of difference in HFRS, for example, between an intervention group (interdisciplinary or integrated care and a control group, formula wAx should be used. The effect size can be calculated by pooling the estimates (pooled standard deviation) derived from sample data. In contrast to this independent sample case, effect sizes are also used in evaluation studies (prepost study designs) as estimates of the responsiveness or change over time within groups. Effect sizes are also in these study designs used to give meaning to  samples since these cut-off points of the magnitude of the difference were not established as a rule of thumb with the effect size d (dependent samples) but with the index d9 (independent samples). Thus, we argue that Cohen's thresholds are based on the assumption of common within-standard deviation (with matched pairs sample data we use the raw withingroup pooled SD), resulting in an effect size we annotate as ES . Consequently, in matched pairs P studies these thresholds cannot be used interchangeably for the SRM due to the role of the correlation between repeated measures or between scores from paired samples. In this part of the article the attention is focussed on the standardized change in mean score between two points in time within a single group, estimated with the within-group effect size. In relation to the use of Cohen's rule of thumb for effect size interpretation, we evaluate the consequences of the calibration of the SRM with the ES and the role of P the correlation between pre-and post-test scores.
To investigate how serious discrepancies can appear in effect size interpretation we first elaborate a theoretical example and used a sample of studies to evaluate the seriousness of these differences in practice. To evaluate the seriousness of the discrepancies between SRM and ES , the correlation of the subject's P repeated measurements was needed. Empirical data were collected for the purpose of secondary analysis to draw conclusions in terms of the relative size of the SRM to the ES in relation to the size of the correla-P tion. Applying Cohen's thresholds (which are based on the pooled estimate of effect) to interpret the SRM on the one hand may lead to similar results or subtle and trivial differences, but on the other hand also to meaningful shifts in classification of the amount of estimated change. In this article we analysed 148 SRMs interpreted using Cohen's rule of thumb and compared these SRMs with Cohen's ES calculated P with the same data. Furthermore, we calculated for the range of the correlation coefficient (r) 0.01 to 0.99 the SRM adjusted for Cohen's cut-off points 0.20, 0.50 and 0.80 of the pooled effect size.
To study the consequences of the impact of the association or correlation between repeated measures, we restrict the analysis to two effect size indices suitable for the evaluation and interpretation of magnitude of change over time (or responsiveness) within one group, namely the SRM and the ES .
The ES introduced by Cohen was made comparable P to the SRM where the (SD ), is used as the X-change score denominator in which, as we will demonstrate below, the correlation between baseline and outcome scores is involved.
The SRM is the ratio between the mean change score and the variability (the standard deviation) of that change score within the same group.

( )
One of our purposes was to get an indication of how the SRM varies in accordance with the size of the correlation between pre-and post-test scores when the correct pooled effect size estimate is used. An example may illustrate the role of r, the correlation of a person's health status measurements over time: In a study in which the outcome of an intervention was evaluated with a HRFS measure, and in the case of improvement, a lower mean score after intervention was hypothesised. The investigator finds at baseline a mean score of 11.12 with a standard deviation of 4.43 and a mean score of 9.16 (SD: 4.88) at follow up. The estimate of the common within-standard deviation, which is the square root of (SD ) q(SD ) y2), thus 4.66, and the pooled 2 2 baseline outcome effect size d9 (ES ) is then calculated as follows 0.42 P (11.12-9.16y4.66). Before we compare the ES and P SRM in relation to the correlation between repeated measurements, we must solve the problem of the equation of both formulas C and D. According to Cohen, the difference between means for dependent samples is standardised by a value ''which is 62 (1yr) as large as would be the case were they independent'' w16x.  If we take Cohen's original work w16x as being valid, we will have to rectify interpretations of the meaning of the estimated magnitude according to the results from this analyses. In previous work, we published two studies w55, 71x in which 40 Standardised Response Mean indices were interpreted according to Cohen's thresholds for pooled estimates of standard deviation (ES ) out of which 20 turned out to be p overestimations or underestimations of interventionrelated effect ( Table 3).
In another study w107x, we analysed this problem using results from other researchers. This secondary analysis of data from other studies revealed that 23% of the estimated effect sizes did not fall in the same magnitude of change category according to the Cohen's thresholds (Table 4).
To avoid invalid interpretations in the evaluation of responsiveness with SRM index we have, for every value of the correlation between baseline and followup score, calculated the corresponding ES 's for P Cohen's thresholds of 0.20ssmall, 0.50smedium, and 0.80slarge. Indices that lie within the interval that corresponds with these thresholds are not depicted.
To classify the magnitude of change estimated with the SRM more precisely, this effect size index is adjusted for every value of the correlation coefficient (r) between baseline and follow-up assessments and brought into line with Cohen's thresholds for effect size. Figure 1   range of effect must be valued as 'small' with a rs0.14 (0.65y62y61-0.14s0.49).
In contrast with the fixed threshold values 0.20, 0.50 and 0.80 in Figure 1 Ever since Jacob Cohen wrote his well-known book w16x, the effect size has been a problematic parameter in evaluation research, and several promising alternatives (for example, the ''Reliable Change Index''), have been developed w109x, improved and criticised w35, 110-113x. In future studies statistical computer programmes may be able to give the researcher additional information on some intervention effect indices (notwithstanding the fact that no consensus exists on a method for signifying the magnitude of change within and between experimental and control groups that is meaningful in particular intervention contexts). Nevertheless, implementing effect sizes standard in the representation of statistical results may require researchers to change long-held patterns of behaviour.
The values used in effect size classification for difference between means as small, medium, and large was arbitrary but seemed reasonable, Cohen stated some 30 years ago. In the debate over which standardizing unit of the difference one should take in a within-group situation, we propose that estimating the magnitude of change by using either the SD of the change score or the pooled SD is preferable to the use of the SD at baseline as proposed by Kazis  These thresholds of Cohen are now being cited without distinguishing between the unit by which the assessed change over time is standardised. This is surprising since there is unequivocally no doubt that his rule of thumb was derived from the pooled SD as the estimate of the common within variance. Moreover, routine action in calculating effect sizes may have led to a reduced awareness of factors originally considered only in the calculation of power and sample size. For instance, the calculation of power of the detected change or difference without using the information of r can lead to the wrong inferences w16x.
In evaluation research on treatment-related quality of life, researchers seem to overlook the fact that, in assessing change over time within one subject, the experimental technique of 'self-matching' reduces the proportion of the total variance due to extraneous variables not related to the treatment or intervention per se w115x.
We may conclude that the rule of thumb proposed by Cohen can induce differences in the interpretation of the size of estimated effects. At present it does not appear to us that a single set of rules that is unequivocal or normative at some level is available. We have begun to explore alternative methods in effect size estimation and have assessed the interrelation between two effect sizes as estimates of magnitude of change over time within groups. Due to their increasing appearance, it is important that all aspects of estimating the magnitude of change be inspected. One of these aspects is the consequence of the hidden role of the correlation coefficient between repeated measurements, which increases the risk of incorrect conclusions. This initial effort may provide a moderate step toward the development of a precise and useful index in quality of life assessment in clinical trials.

Recommendations for practice and research
So long as no consensus reached on standards for evaluating, using and interpreting effect size estimates of intervention-related change in evaluation research, there is an important need to develop uniform and widely accepted criteria to give meaning to the size of an effect. This lack of precision is not only relevant when evaluating intervention-related change within and between groups, but, even more important in the estimation of power in the planning phase of a trial. Standardisation of effect size interpretation needs reference ranges of health-related functional status assessed with population surveys. Furthermore, longitudinal research is needed to discriminate between changes in HRFS over time in a sample drawn from the general population, with change in a sub-sample of chronically ill patients. In other words, with knowledge about a reference range of an indicator of healthrelated functional status in the general population, we can recognise that there are differences. Furthermore, with a longitudinally assessed estimate of autonomous change in the same sample, we will be able to better understand the meaningfulness of intervention-related effects.
In studies on the measurement of health-related quality of life and HRFS, effect sizes (ES) have been used as surrogates for clinically relevant change when change over time in outcome was substantial. However, ES do not provide a complete understanding of the meaningfulness of the observed change. Patients have to perceive a change in the performance of daily activities in order to rate the direction and degree of change; moreover, even when this perceived change is small in magnitude, it may still be perceived as a significant one by the patient. According to Osoba w116x, the significance of change as perceived by the subject 'should be of paramount consideration' in future attempts to define the meaningfulness of change in HRFS or health-related quality of life. The development of multi-item transition measures may cover change in the relevant underlying domain more representatively w107, 117x. Therefore, we suggest that measures that assess more concrete aspects of the patient's HRFS will provide greater accordance between serial and transition measures of change.
However, when a patient rates a reduction in (for example) difficulty in climbing stairs, as 'large', it does not necessarily imply that a patient will view this subjectively significant change as being important. Future areas of research aimed at quantification of meaningful change in HRFS should also include the importance patients assign to that change, even if it is experienced as being small. One piece of research has produced examples that seem promising extensions of transition questions. In this approach, the respondent rates the direction and the degree of perceived change by a assigning a value that has meaning to the respondent for the experienced change, as well as by rating the degree of importance the respondent assigns to perceived change. In evaluation of intervention-related change in evaluation studies, the importance assigned to the small improvement in one item of a domain of HRFS may outweigh a moderate deterioration in another item belonging to the same domain.
Finally, the following are key issues in the debate on methods for estimating clinically important change: Significance of intervention effects: significance to whom w93x who is to say what is important? w90x and ''ask patients what they want'' w94, 118-120x have increasingly become apparent. To give clinically relevant meaning to change scores gained on two different points in time using HRFS instruments, several investigators suggest that the current approaches could be improved by taking more explicit account of patients' perceptions and expectations. A new paradigm is incorporating individual patient perspectives, expectations and preferences with respect to the effects of (innovative) interventions in the outcome measures. With scoring systems based on individualised measures such as the so-called Goal Attainment Scale (GAS) or Patient Specific Index (PCI), each patient essentially receives his or her 'own instrument' and these instruments seem to show an improved sensitivity to change in health-related functional status when compared with conventional methods w75, 92, 95, 121-125x.
Methodological studies focussed at improving the longitudinal validity or responsiveness of health outcome measurement should be aimed at supporting, health professionals, investigators and administrators in the understanding and critical evaluation of the appropriateness of health status measures and understanding of methods in estimating and interpreting change in patient-assessed health outcomes. Health professionals increasingly stress that in the realisation of effective care and expected outcome of planned change in the process of care delivery, patients' preferences are essential sources of information. The operationalisation of the patient's perception of the severity of limitation in domains of health-related functioning, or operationalisation of individual preference or weighted relevance of items of health-related functional status measures is still in its infancy. However, for health administrators and decision-makers, investigation into the validity of patient-specific HRFS instruments used to evaluate the outcomes of innovative and care, standardisation of methods is required. HRFS instruments cannot be used in the evaluation of treatment and care without a valid way of ascertaining what change in measured difference scores means.