Reliability estimate is calculated by comparing experts subjective ratings.

Inter-rater reliability of defense ratings has been determined as part of a number of studies. In most studies, two raters listened to an audiotaped interview or session and followed a written transcript, blind to subject identity and session number. Sessions were presented in random order to prevent a bias (e.g., rating earlier sessions with more lower level defenses than later sessions). Raters independently marked each defense on the transcript, and tallied each occurrence for each defense to yield a quantitative profile of defenses. Raters subsequently discussed and arrived at consensus ratings for each session, although the consensus rating did not contribute to the determination of reliability.

Table 9.4 displays the inter-rater reliabilities obtained in six studies, two early ones using qualitative ratings, and four more recent ones using quantitative ratings. In a field trial using many different clinicians (Perry et al., 1998), the interrater reliability of ODF was as good as that of the commonly used Global Assessment of Functioning (current GAF or Axis V), while the stability was actually higher for ODF than GAF at one and six months. This probably reflects that the more trait-like aspect of defensive functioning makes ODF more resistant than GAF to fluctuations which coincide with episodes of psychiatric disorders.

Table 9.4. Interrater reliability and stability of defenses rated by the DMRS

Inter-rater reliabilityStabilityNo.ODFDefense LevelsNo.ODFLevelsMed. (range)Med. (range)Qualitative ratingsPerry & Cooper, 1989.53/.74Perry et al., 1998.68.75/.51**Quantitative RatingsLingiardi et al., 1999.83.87.87 (.67-.95)Despland et al., 2001.80.80Perry, 2001.83.85.625 (.52-.80).14.48***.47 (.08-.73)Herzoug, 2002.83Drapeau et al., 2003.79

*Figures are for inter-rater reliability/reliability of two consensus ratings.

**Stabilities are one-month/six-month.***Stability figure is session to session over five weekly sessions.

The quantitative reliabilities for the number of defenses rated in a session (No.) and for ODF are generally above intraclass R > .80 (respective median figures from Table 9.4, 0.83 and 0.84). The median reliabilities for the defense levels are also close (median 0.795). Figures for the 28 or so individual defenses would undoubtedly be lower, although these were usually not reported. The reliability of any individual defense varies much more widely, with those occurring at very low base rates in a given case being the most problematic. Studies which use a small consistent group of trained raters, and which have good variability in subject defensive functioning will generally obtain the higher median reliability figures.

The stabilities of defense ratings were examined in two studies (see Table 9.3). Using qualitative ratings, Perry et al. (1998) obtained a one-month stability for ODF of 0.75. However when quantitative ratings are used, which are more sensitive to change, Perry (2001) found a lower figure, 0.48, for ODF, examining week to week variability in five consecutive psychotherapy sessions from a personality and depressive disorder sample, whereas there was virtually no stability for the number of defenses per session. Drapeau et al. (2003) found that over a 4-session Brief Psychodynamic Investigation, the number of defenses decreased significantly from session to session, as the level of distress decreased. The range of stability coefficients for the number of defenses, as well as for the defense levels appears to indicate that certain of these are more sensitive to state effects, but that on average close to 50% of the variance reflects a stable defense repertoire. This figure rises to 57% if corrected for measurement error (Perry 2001). Interestingly, the high adaptive level defenses showed the lowest stability, suggesting that among those with depressive and personality disorders, these defenses are the most sensitive to disruption by state effects, such as mood or stress. This further suggests that with improvement, increased psychological resilience might be accompanied by an increase in both the proportion and the stability of high adaptive defenses across time.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0166411504800347

Measuring electrical skin resistance on acupuncture points

Filadelfio Puglisi, in Auricular Acupuncture Diagnosis, 2010

INTER-RATER RELIABILITY

Inter-rater reliability is how many times rater B confirms the finding of rater A (point below or above the 2 MΩ threshold) when measuring a point immediately after A has measured it. The comparison must be made separately for the first and the second measurement.

In the first measurement A and B agreed for 182 points over 228, corresponding to 79.8% success.

In the second measurement A and B agreed again for 182 points (not the same points as in the first measurement) over 228, corresponding again to 79.8% success.

Now the question is whether these high percentages of success are meaningful, or partially due to chance. To answer such a question I have adopted Cohen's kappa coefficient, which is easily adaptable to the case of data belonging to two classes.21 But it must be underlined that Cohen's kappa coefficient is normally judged as not comparable with other studies.

In our case rater A had a kappa = 0.506 and rater B a kappa = 0.585 in the intra-rater tests, while in the inter-rater tests kappa was 0.580 for the first measurement and 0.535 for the second. Such kappa values seem to indicate a moderate success both of the intra- and the inter-rater tests, slightly above midway between the case of kappa = 0 (results due to chance alone) and kappa = 1 (raters in perfect agreement).

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780443068669000064

Orthopedic Neurology

M.M. Danzl PT, DPT, PhD, NCS, M.R. Wiegand PT, PhD, in Orthopaedic Physical Therapy Secrets (Third Edition), 2017

7 What is the interrater and intrarater reliability of the following?

A.

SEMMES-WEINSTEIN MONOFILAMENT TESTING FOR LIGHT TOUCH

Reports of interrater reliability for the assessment of light touch using Semmes-Weinstein monofilaments have ranged from good to only slight or fair, while intrarater reliability has been assessed as moderate to good. Inconsistency of standardized testing measures and variations in peripheral nerve tested and the presence or absence of pathology in the subject may explain the variation in reports on light touch reliability.

B.

VIBRATION SENSIBILITY TESTING

Vibration testing stimulates pacinian corpuscles and assesses the function of large-diameter rapidly adapting peripheral nerves and the dorsal column-medial lemniscal central pathways. Using mechanical testing devices, the intrarater reliability of the assessment of vibration sense has been described as good. Moderate reliability has been reported for interrater reliability. Age and height were associated with minimal threshold values of the feet but not of the hands as determined through multiple regression analysis.

C.

TWO-POINT DISCRIMINATION SENSIBILITY TESTING

Although numerous studies have described the reliability of two-point discrimination testing, interpreting these results to apply them to clinical practice has been hampered by the lack of standardized testing procedures and the inability to quantify subject cognitive function. Reliability testing has ranged from moderate and good to poor. The cooperation of the subject and the ability of the subject to attend to the stimulus have been suggested to influence two-point discrimination measures, as do central training effects.

There appears to be little carry-over between static two-point discrimination tests and function, although moving two-point discrimination testing (which tests rapidly adapting afferent fibers) has been shown to correlate with object identification tests. Likewise, the sensitivity of two-point discrimination testing to detect change over time is poor.

Reports of the reliability of two-point discrimination testing vary according to the age and sex of subjects, the peripheral nerve tested, and whether the subject is symptomatic or asymptomatic. Testing procedures also vary with the starting position (wide or narrow distances), the amount of pressure applied, and the instrument used to apply the stimulus.

It is questionable whether any reliability measures of sensibility can be used as a reference to judge the presence of pathology. It is recommended that results from any sensory testing procedures not be used as the sole means of developing diagnoses of peripheral or central nervous system origin.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323286831000199

Assessment

Jack J. Blanchard, Seth B. Brown, in Comprehensive Clinical Psychology, 1998

4.05.3.1.1 Reliability

Several investigations of inter-rater reliability reveal poor to good agreement. Using the SIDP-R, Pilkonis et al. (1995) found that inter-rater agreement for continuous scores on either the total SIDP-R score or scores from Clusters A, B, and C, was satisfactory (1CCs ranging from 0.82 to 0.90). Inter-rater reliability for presence or absence of any personality disorder with the SIDP-R was moderate with a kappa of 0.53. Due to infrequent diagnoses, mixed diagnoses, and the number of subthreshold protocols, in this study kappas for individual diagnoses were not provided.

Stangl et al. (1985) conducted SIDP interviews on 63 patients (43 interviews were conducted jointly, and 20 interviews were separated by up to one week). The kappa for presence or absence of any personality disorder was 0.66. Only five personality disorders occurred with enough frequency to allow kappa to be calculated: dependent (0.90), borderline (0.85), histrionic (0.75), schizotypal (0.62), and avoidant (0.45). Using the SIDP among a small sample of inpatients, Jackson, Gazis, Rudd, & Edwards (1991) found inter-rater agreement to be adequate to poor for the five specific personality disorders assessed: borderline (K = 0.77), histrionic (0.70), schizotypal (0.67), paranoid (0.61), and dependent (0.42).

The impact of informant interviews on the diagnosis of personality disorders and inter-rater agreement for the SIDP was assessed by Zimmerman, Pfohl, Stangl, and Corenthal (1986). Inter-rater agreement (kappa) for the presence or absence of any personality disorder was 0.74 before the informant interview and 0.72 after the informant interview. Kappas for individual personality disorders were all 0.50 or above. Reliability did not appear to benefit or be compromised by the use of informants. However, the informant generally provided additional information on pathology and, following the informant interview, diagnoses that had been established with the subject only were changed in 20% of the cases (Zimmerman et al., 1986).

In an examination of the long-term test–retest reliability of the SIDP, Pfohl, Black, Noyes Coryell, and Barrash (1990) administered the SIDP to a small sample of depressed inpatients during hospitalization and again 6–12 months later. Information from informants was used in addition to patient interviews. Of the six disorders diagnosed three had unacceptably low kappas (below 0.50): passive-aggressive (0.16), schizotypal (0.22), and histrionic (0.46). Adequate test–retest reliability was obtained for borderline (0.58), paranoid (0.64), and antisocial (0.84).

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080427073000031

Operationalization of Global Alzheimer’s Disease Trials

Lynne Hughes, Spencer Guthrie, in Global Clinical Trials for Alzheimer's Disease, 2014

10.3.4 Rater Training

For any AD trial or program to meet its goals, all study scales must be administered uniformly and consistently across all sites in all countries. Experience has shown three key factors that require special management on AD trials: robust and consistent training for raters and Clinical Research Associates (CRAs); completion of worksheets and case report form transcription; and staff turnover at sites. All of these factors can contribute to inconsistent rating. In addition, continuous “real time” (or as close to this as possible) monitoring of scale data throughout the lifespan of the clinical trial will help identify inconsistencies at an early stage and thus permit retraining as required.

Training includes rater training and assurance of inter-rater reliability. This training brings all raters closer to the “gold standard” norm, although it cannot guarantee 100% consistency. However, the objective is to identify those raters at either extreme of the curve. To succeed in these objectives, the following options should be considered:

Rater training, testing, and/or certification on primary scales at investigator meetings

Tapes sent by mail (to sites for rater test and/or from sites for expert evaluation)

Visits to sites by expert consultants or trained professionals

Follow-up rater skills assessment at predetermined intervals

Web-based rater training methods

On-site training/follow-up if there is rater turnover at the site.

Effective implementation of an inter-rater reliability training program involves the selection of a qualified expert in the desired assessment scale. Copyrights and/or usage fees associated with certain scales must also be understood and taken into account in planning. Principal investigators must also pre-select qualified raters from their existing staff members and verify their availability for the investigator meeting, as well as for the duration of the program. It is becoming more commonplace for a sponsor company to request that for a site to be qualified for trial participation they need two or more raters trained and available for the trial, placing an additional burden on the sites.

It is also critical that each CRA understands how to score properly, reviews the worksheets carefully for proper completion, and makes sure the scores are correctly transcribed to the Case Report Form (CRF). Thus, they will be trained in methods similar to the investigators on the rating scales to ensure consistency.

For rater training, local language training may aid comprehension and result in more consistent data. The majority of the scales are now available in many languages and have been fully validated and translated by their owners before being released to the sponsor company.

In terms of scale fatigue—either by the subject or the rater—we are not aware of objective data on the number of scales and duration of scale assessments that an AD subject can undertake and still provide a reliable, consistent and robust outcome. Neither have we seen objective data with regards to the “learning” effect of scale completion, although we are aware of such concerns.

We would always recommend that a subject completes the scale(s) that represent the primary outcome of the trial first, that the scales are administered in a set order and that imaging modalities—which may upset the subject—are performed after the scale assessments. It is often suggested that the LP procedure is performed on a separate day to the scale assessments, preferably afterwards, in order to not have any bias from any anxiety or distress from the LP itself, which could affect the completion of the scales.

How is reliability reliability measured?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable. You measure the temperature of a liquid sample several times under identical conditions.

What are the 3 ways of measuring reliability?

Reliability refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

What are the reliability estimates?

Ideally, estimates of reliability are derived from scores on parallel forms of a test. With this approach, referred to in this research as the parallel-form approach, the estimate of reliability is the correlation between parallel forms of a test taken one or two months apart.

Which of the following methods is to estimate reliability?

We can estimate reliability using the test–retest method, the alternate-forms method, and/or the internal-consistency method. We can also estimate scorer reliability.