The Diagnostic Inventory for Screening Children: Psychometric, factor, and validity analyses
Canadian Journal of Behavioural Science, 1990 Kevin C.H. Parker, Kingston General Hospital, Kingston, Ontario Marian K. Mainland, KitchenerWaterloo Hospital, Kitchener, Ontario Jeanette R. Amdur, KitchenerWaterloo Hospital, Kitchener. Ontario ABSTRACT The Diagnostic Inventory for Screening Children (DISC) was designed as a second stage screening device for referred children from birth to five years. It is intended to fill the gap between tests for first stage screening of nonreferred children and a complete diagnostic assessment by a Psychologist or Psychometrist. The results provide scores in eight skill areas that are intended for use in the development of treatment programs or followup assessment. The DISC was standardized on a sample of 500 children, and norms were obtained on a separate sample of 571 children. Reliability studies indicate that scores are accurate to within one or two points, reflecting good reliability. Content validity is based on the selection of items by experts from a welldeveloped task pool. Construct validity can be seen in the factor structure. A single common factor with specificities greater than zero for each scale was the best fit for children under three years and a good fit for older children. Criterion validity was assessed in two small studies. These showed reasonable congruence between the DISC and other test scores, and child workers' impressions of children. Resume On traite ici du DISC (Diagnostic Inventory for Screening Children), dispositif de s61ection du deuxime stade congu pour les enfants du groupe de r6f fence dont I'ige va de la naissance & cinq ans. Ce dispositif devrait combler le vide existant entre les tests de s61ection du premier stade chez les enfants de I'ext6.rieur du groupe de r6f6rence et 1'6valuation diagnostique complete effectu6e par un psy chologue ou un psychom6tricien. Les r6sultats donnent des scores r6partis dans cinq domaines d'habilet6s qui serviront h 1'61aboration de programmes th6rapeutiques ou d'une 6valuation de cas. Le DISC a 6t6 standardis h partir d'un 6chantillon de 500 enfants; on a ensuite 6tabli des normes & partir d'un 6chantillon distinct de 571 enfants. D'apris certaines 6tudes de fiabilit6, les scores sont exacts h un ou deux points pris, ce qui reticle une bonne fiabilit6. La validit6 de contenu est bas6e sur la s61ection des items effectu& par des experts h partir d'une rdserve de tfiches bien structur6e. On peut d6celer la validit6 conceptuelle h I'int6rieur de la structure factorielle. Aussi, on a d&ouvert que le meilleur ajustement pour les enfants de moins de trois aris et le bon ajustement pour les enfants plus &g6s consista, it en un facteur commun unique dont les sp6cificit6s 6talent sup6rieures h zro pour chaque &helle. Deux petites tudes Svaluen,t la validit6 du critire et montrent une congruence appr6ciable entre le DISC ainsi que d'autres scores e tests et les impressions qu'ont les gens travaillant avec des enfants. The Diagnostic Inventory for Screening Children (DISC) (Amdur, Mainland, & Parker, 1988) is a set of eight scales, each containing 27 items distributed as evenly as possible across the age range from birth to 60 months. The scales consist of items assessing: fine motor skills, gross motor skills, receptive language, expressive This study was supported by the National Welfare Grant's Program of the Department of Nammal Health and Welfare, Canada project number 455564I. Reprints may be obtained from Marian Mare land. Child and Family Centre, KitchenerWaterloo Hospital, 835 Krug Street West. Kitchener. Ontario. Canada, N2G IG3. We would like to acknowledge the constructive criticism from several anonymous reviewers
TABLE I Sample item from the DISC (Expressive Language 17)
TABLE 2 Sample tasks on DISC scales Fine motor tasks: Holds rattle, grabs ring.
pushes car. makes crayon stroke, folds hands, twiddles thumbs, cuts out
circle. language, auditory attention and memory, visual attention and memory, self help skills, and social skills. Each DISC scale yields a raw score that can be converted to a percentile value. Age equivalents can also be calculated based on the raw score. A sample item from the instruction manual is presented in Table I and representative tasks from the eight scales are presented in Table 2. The DISC was designed in a clinical setting to meet specific needs. The Child and Family Centre at KitchenerWaterloo Hospital is a governmentfunded children's outpatient psychiatric centre. In the late 1970's an increasing number of preschool children were being referred, and there was a corresponding increase in the number of assessments and diagnostic screenings. None of the existing screening instruments met the needs of the clinic. Complete assessments by Psychologists or Psychometrists using tests like the StanfordBinet Intelligence Scale (Terman & Merrill, 1972) or the Bayley Scales of Infant Development (Bayley 1969) were too expensive to use with each child. Mass screening devices such as the Denver Developmental Screening Test (Frankenburg & Dodds, 1969) were too insensitive to distinguish among children who had already been identified as having some difficulty. The DISC is an assessment device designed to meet the needs of one particular setting. Naturally, different settings will have different needs, and the DISC will be more or less appropriate for those needs. However, the D1SC has properties that would make it useful in a fairly broad range of settings. Because it was designed to be costeffective in an applied setting, the DISC does not represent a significant change in the science of developmental assessment. Instead, it attempts to break the assessment into practical scales that are easily understood and applied in schools, nurseries, public health units, doctors' offices and treatment clinics. Moreover, it was designed to be administered by Public Health Nurses, Early Childhood Education Workers, Teachers, Psychometrists, or staff with equivalent training, using a relatively inexpensive set of equipment, in a 15 to 40 minute assessment period. The DISC is intended for what might be called second stage screening. In this role, it is an instrument that can be used to make decisions regarding the need for further and more detailed assessment of a referred child. Although this renders the DISC inappropriate for first stage or mass screening, it makes it potentially valuable for agencies that receive significant numbers of preschool referrals, but that do not have the resources to do complete psychological assessments on each child that is referred. If a problem is detected, staff can design a program to develop skills related to the area of weakness identified by the DISC. This is the only programming that can be derived in a direct manner from the DISC. If the child's skills do not improve based on this programming, or if the delay is profound, a referral for more complete assessment and treatment should be made, if possible. We will review the psychometric data, factor structure, and validity data for the DISC in three sections. The first section describes a variety of results derived from the normative data for the DISC. The other two sections report simple validity studies. ANALYSES OF THE NORMATIVE DATA SET Standardization: In the assessment of child development, a large pool of tasks common to many tests seems to exist. An extensive survey of the developmental literature as well as developmental and intelligence tests for preschoolers yielded a large array of tasks suitable for the DISC. Clear routes from task sources to the DISC items are difficult to establish. Each DISC item was designed from first principles, specifying material, instructions, and coring criteria. Care was taken to avoid violation of copyrights. An ethics review committee required that only children with no apparent developmental delay be used in the standardization and norming studies. In the absence of an established definition of "no apparent developmental delay", children were accepted for participation in the study who were: living at home; born no more than one month premature; having no apparent behaviour or emotional problems; and, had English spoken in the home as the first language. The children were recruited through nursery schools, day care centres, family YMCA and YWCA programmes, church nurseries, and personal contacts. Four hundred children in eleven age groups from birth to 60 months were recruited and tested by three university students. Testing was carried out between April and December, 1979. Each candidate item was assessed against a number of criteria. The ideal DISC item for a given target age was one that was highly correlated with ls own scale, with a steep slope shown by the item characteristic curve, and where exactly 75% of the children of the target age in the sample passed. However, few items satisfied Ibis criterion perfectly. After items that were hard to administer or score, and items showing Iow correlations with their own scales had been eliminated, there were some sgnificant age gaps in the scale. Most notably, some of the scales allowed all subjects above 36 months m pass all items. In a second stage, items were revised and added to fill age range gaps, and dropped where inappropriate or redundant. A further sample of 100 subjects was obtained, and the revised item pool was administered to the new sample. Testing was carried out between January and March, 1980. A final pool of 216 items (eight scales of 27 items) was produced from the items standardized on the new group. Normative Study: Once the standardzed scales had been derived, a new sample of subjects was obtained in order to establish norms. Approximately fifty subjects were obtained at each of eleven age ranges, for a total of 571 subjects. The two youngest age groups each spanned three months (03, 46) and the nine eldest each spanned six months. The sample was selected to match the univariate distributions of 1971 and 1976 Canadian census data (Statistics Canada, 1971, 1976) as closely as possible on the variables of: age, gender, education and occupation of both parents, job title of the head of the household, ethnic group, marital status of guardian, and urban/rural residence. All subjects were from SouthWestern Ontario. Where a surplus of eligible subjects was available, the required number was selected randomly. Using a procedure similar to that used by Bayley (1969), examiners were assigned quotas of children defined in terms of the sampling variables. Where a shortage existed, subjects were actively recruited to meet the relevant quota. The same ethical protections were applied in the normative study as in the standardization study. Testing was carried out between April, 1980 and June, 1981. Results Covariates of DISC Score: The relationship between DISC scores and demographic characteristics of the children was assessed in a series of oneway analyses of variance, using each DISC scale score as a dependent variable. Three relationships showed significant effects at the .01 level, but none accounted for more than 2% of the variance. Expressive Language by County and Expressive Language by Urban/Rural Residence, and Social Skills by County were all significant. The correlations of DISC scores with Blishen Index (Blishen, 1967), with the Education of the Male Parent, and with the Education of the Female Parent were also measured. No correlation was found to be significant. Age correlates quite strongly with every DISC score, ranging from .97 on the Fine Motor scale to .92 on the Social Skills scale. In each case there was a strong quadratic component of DISC score predicting age. When the subjecis were broken into the 11 age groups and correlations were computed within age groups, the magnitudes dropped quite substantially with the restriction in range. Within the youngest age group the mean correlation between age and DISC score was .52. For the eldest age group the mean correlation was .04. The magnitude of the mean within age group correlation of DISC score and age was significantly correlated with age of the group (r. = .87, p. <.05), congruent with the finding of a quadratic relationship between DISC score and age for the sample as a whole. Item correlations: For the normative sample as a whole, correlations between passing or failing an item and the corresponding scale score were universally significant at the .05 level. The correlations of items at the young and old ends of the scale were substantially and significantly lower than those at the middle. These results are not very informative. Each item has a rather narrow age range in which it provides measurement variance. The large agerelated changes in children from birth to five years overwhelm any differences between scales; because every scale correlates highly with age, every scale correlates with every other scale. The differences among the correlations of a given item with the eight DISC scales were tiny compared to the confidence intervals for correlations. Overall, 77% of items correlated most strongly with their own scales, or are correlated most highly and equally with their own and another scale, but few show significant differences from the seven other correlations. On this aggregate basis, the number of cases where an item correlated most strongly with its own scale departed significantly from chance levels. When correlations were computed within age groups, the proportion of variance accounted for by age was sharply reduced, and other variance components (e.g. scale specific variance) became more visible. To avoid the attenuating impact of strongly skewed distributions, only items with moderate pass/fail rates were sampled. We decided to examine the distributions of correlations rather than the individual values because the small sample size renders the individual correlations rather unstable. For example, with a sample size of 50 and three scores (say Item A, Item B and age), three correlations can be computed. With a correlation of .30 between the Items A and B, and correlation of .60 between Item A and age, the Item B needs to have a correlation with age of .34 or less to be significantly different (p < .05) from Item A's correlation with age. The criterion established for computing a scaleitem correlation was a pass rate for the item between .75 and .25. Few of the correlations of an item with its own scale were significantly different from all seven alternates. However, the simple probability of selecting one predicted observation from among eight is. 125. In effect, this means a statistical test with an alpha of. 125 was done every time we looked for an item to be most strongly correlated with its own scale. For the 206 items sampled (some appearing in two age groups), there was only one case out of these 206 where an item failed to correlate most highly with its own scale. That particular item did correlate most highly with its own scale in another age group. The correlation of an item with its scale is somewhat inflated because the variance of the item also contributes to the variance of the scale. For this reason, item by scale correlations are often computed after the item has been removed from the scale. When this was done for the 206 items sampled, 71% of the items correlated more strongly with their own scales than any other scale. The design of the DISC required that items were distributed throughout a broad range of abilities in such a way that only a small number of items contribute to the variance of children at any given age. For this reason, the loss of one item can produce a substantial drop in scale reliability with the resultant attenuation of correlation. A simple manipulation of the formula for correction of attenuation (Jensen, 1980) can show this effect. The drop in the correlation will be proportional to the ratio of the square roots of the reliability of the scale including the item, and the reliability of the scale with the item removed. For the 206 items sampled here this resulted in a mean .94 attenuation of correlations. The impact of this loss of reliability on the average correlation of an item and its itemremoved scale results in a 6% disadvantage when an item is compared to its own scale, as opposed to another scale. Using the ratio of the square roots of the reliabilities before and after correction, all item by scale correlations were individually corrected for this attenuation. After correction, 77 % of items correlated more strongly with their own scales than with the other scales. Reliability analyses: The reliability of each DISC scale was assessed by three correlations: the corrected splithalf correlation, the correlation between scores deter mined by the tester and observer of the session, and the correlation between the scores obtained by two examiners testing the same child one week apart. The corrected splithalf reliability coefficients ranged from .98 to .99 across the eight scales. For observer reliability, the values were consistently .99. Testretest intertester correlation coefficients for the scales ranged from .94 to .98 over a oneweek interval. While computations of this sort are commonly reported for tests like the DISC, it should be noted that the items selected for the DISC scales are not normally distributed across levels of difficulty, but are in a rectangular distribution. The DISC covers such a range of ability that a large proportion of the items (those that are obviously too easy or too hard for a child) add no precision of measurement to the assessment of any given child, but contribute substantially to measures of reliability when the complete age range is assessed. One solution to this problem is to measure reliability for subsets of the sample stratified by age, as was done with the itemscale correlations. Because the develop mental constructs being measured are strongly correlated with age, this procedure will reduce the variance substantially, and will consequently reduce the reliability. In fact, it is possible to manipulate the reliability to almost any desired value by restricting or expanding the range of ages included in any given sample. For this reason, it seems more appropriate to report standard error of measurement as an index of the reliability of a scale at a given age. This index is reported in the same units as the scale but, more importantly, it is proportional to the variance and inversely proportional to the reliability. Thus, it tends to account for the "tradeoff" between variance restrictions and reliability. The two youngest age groups in the norms span a three month age range instead of the six month age range of the older groups. Ail values were computed and reported for each of the two youngest groups, both separately (0 to 3, and 4 to 6 months), and combined (0 to 6 months) to allow reference to the three month norm groups as well as comparison to the six month age groups. Values were computed for each of the 10 six month age groups and the combined sample as well. Table 3 reports the corrected splithalf reliabilities for the various age groups for each of the DISC scales. The median reliability is .76. When the 10 sixmonth age groups are used, the mean reliability across scales correlates .81 with age (df = 8, p < .05). Table 4 reports the standard error of measurement corresponding to each value in Table 3. The standard error of measurement ranges from 0.42 to 1.47 with a peak in the 3742 month age group and lows at either end of the age range. For the 10 sixmonth age groups, the mean standard error of measurement across scales correlates .90 (df = 7, p < .05) with a quadratic component of age. Standard errors of measurement for the Self Help and Social Skills scales are roughly twice that for the other six scales for children over 30 months. Factor analyses: When the 10 sixmonth age groups were subjected to a common factor analysis, the analyses of two age groups showed two Eigenvalues greater than one, and the other eight showed only one Eigenvalue greater than 1. The first factor accounted for between 21% and 73% of the variance with a median of 36%. The proportion of variance accounted for by the first factor dropped with age. (r. = .88, df = 8, p < .05). There was no significant correlation between the variance accounted for by the second factor and age (r. = . 13, df = 8, p > .05). A onefactor solution was chosen as the preferred uniform solution. All variables loaded positively on the single general factor. Self Help and Social Skills had relatively low loadings for all age groups older than 30 months. Table 5 presents specificity values for each scale in each of the age groups and scales in Tables 3 and 4. Specificity values within six month age groups are generally quite respectable. They range from . 17 to .69. Like standard error of measurement, specificity values have a significant quadratic component with age (r. = .93, df = 7, p < .05) with a peak just before age three. Because the theoretical minimum is .00, the negative values are examples of measurement error, and give some sense of the magnitude of instability of the estimates provided in Table 5. A specificity of . 17 reflects a cumulated error of at least. 17, suggesting that positive or negative errors of that magnitude or greater could easily have occurred for other values. Twentyeight of the 80 sixmonth age group specificity values are higher than the corresponding squared multiple correlations. Eleven of these 28 are from the Self Help and Social Skills scales. Except when this relationship reflects measurement error, it indicates that more of the reliable variance is unique to the scale than is attributable to factors common to the other seven scales of the DISC. Although there are clear instances of measurement error, the central, tendency of the specificities is toward moderate positive values. Discussion In examining the relationships between DISC scores and demographic variables, the Anova and the correlations resulted in 64 inferential statistics. Of these, only three were significant at the .01 alpha level. This indicates that, for the normative sample, DISC scores were relatively insensitive to a variety of socioeconomic factors. These demographic correlates were expected to below because of the nature of the DISC and the nature of the normative population. For example, in research with intelligence, Jensen (1980) has reported that the correlation between socioeconomic status variables and lQ is lower for younger children than older children. Although the DISC is not an IQ test, we expected a similar phenomenon, because the environmental effects of socioeconomic status on development appear to require time to be fully evident. We also expected a small relationship between socioeconomic status and DISC score because of the relatively homogeneous nature of the population being sampled, due to the exclusion of children with any apparent developmental delay. A confidence interval for a DISC score will depend on the application. Onesided intervals would be appropriate for the DISC when it is used for the detection of delay. A onesided 95% confidence interval can be achieved for all of the scales in every age group by including all values below and including the observed score plus two. For more than three quarters of the scale by age group combinations (67 of 88), all values below the observed score plus one will give a onesided 95% confidence interval. The Self Help and Social Skills scales account for 15 of the 21 standard errors of measurement high enough to require the larger confidence interval. The development of the DISC was such that a reasonable amount of validity data can be culled from the internal characteristics of the normative database. The items were developed from tasks that were selected by experts as representative of the constructs measured by the scales they comprise. Because each item was chosen from a constructspecific pool, the extent to which an item is convergent with its own scale and divergent with other scales is a strong indication of validity. Unfortunately this theoretical truth has some practical problems. Balthazard and Woody (1985) have given a fascinating description of some of the problems inherent in factor analysis of scales of hypnotic ability. Like the DISC, hypnosis scales include a number of items designed to provide measurement variance more at certain levels of abilities than at others. There are "easy" items and "hard" items. Much of their material is quite pertinent to the analysis of develop mental scales in general and the DISC in particular. For example, any attempt to examine the factor structure of the underlying items of the eight DISC scales encounters the problem of bivariate items with different levels of difficulty. Items of similar difficulty have a higher maximum correlation than items that are far apart in difficulty. When DISC items were subjected to factor analyses not reported here, resultant factors were defined more by item difficulty than any other construct. This exactly parallels what Balthazard and Woody (1985) found. Biserial and tetrachoric correlations aimed at correcting this problem apparently produce their own problems (Balthazard and Woody, 1985). The factor structure of DISC items was not determined. Instead, a sample of item by scale correlations were examined, and factor analyses of the DISC scale scores (not items) were carried out within six month age groups. In conjunction with this, scale specificities were measured using the procedure outlined by Silverstein (1976). Due to the instability of correlationrelated statistics based on a sample size near 50, we have reported and assessed data by combining the results of groups of items rather than by focusing on each individual item. When 208 item by scale correlations were sampled within i0 sixmonth age groups, items were found to correlate most highly with their own scales almost 100% of the time. When scales were corrected by removing the correlated item, 71% showed a highest correlation to the corrected scale, despite an average .94 attenuation of scale reliability. After correction for this attenuation, 77% of sampled items correlated most highly with their own scales. The selection of factor analysis methods depends on the underlying model of the variables being examined. Two basic models could be offered for the factor structure of DISC scales. A common factors model assumes that the scales share one or more common underlying factors with some variance being unique to each scale (specificity). A principal components model assumes that all variance s accounted for by factorial components, allowing no uniqueness. The D1SC was clearly designed from an implicit common factor model with "development" as a single common factor, and all the rest of the reliable variance being specific to the scale. There is no simple statistical test that will allow a choice between these two factor models, but the relationships in the data may guide selection. Eight of ten common factor analyses showed a second Eigenvalue of less than one, suggesting a single factor solution. There is a strong linear correlation between the age of the sample and the variance accounted for by the first common factor (r. = .88) combined with the absence of a significant correlation of age with the simple variance. Scale specificity (and standard error of measurement) peaks in the 3035 month age group as part of a significant quadratic relationship. Reliability correlates .81 with age and .71 with first factor variance. The correlation between reliability and the first factor is exactly what we would expect from an indirect relationship attributable to a common correlation with age (.71 = .81 * .88). Our conclusion based on these observations is as follows. Up to age three a single common factor with scale specific unique variance accounts for the data. Within this range, the common factor accounts for less variance as age increases and the scales account for more variance. At age three, a small number of children begin to pass all items on a scale, and this phenomenon increases with age. This results in ceiling effects. We assume that these ceiling effects are the major contributors to the quadratic components of standard error of measurement and scale specificity via restriction of range. It may also be the case that principal components factors become more appropriate after age three. If that is the case, the quadratic correlation of specificities with age represents the transfer of variance from the first factor to scale specific variance before age three, and from scales to a second factor after age three. In aggregate, these findings give a strong indication of reliability and validity of the DISC. Scores are accurate to within one or two points, indicating good reliability. Content validity is based on the selection of items by experts from a well developed task pool. Criterion validity can be found in the correlations between items and scales that provide convergent and discriminant validity criteria. Despite strong mutual correlations with age, items tend to correspond most closely to their own scales. Construct validity can be seen in the factor structure. A single common factor can be seen most strongly in the age groups under three years, and is not inconsistent with the data from age groups over three. To the extent that the common factor model is valid, it suggests that the DISC measures general development and, at the same time, supports the notion that each scale has a unique, nonzero component of reliable variance. STUDY 1 Concurrent Validity of the DISC, Denver and Binet Although internal analyses can provide substantial validity data in a test like the DISC, it is more satisfying to use external criteria to assess concurrent validity. With the help of local agencies, we designed and implemented a small validity study that examined the concurrent scores of 40 children assessed with three commonly used tests. Every child was administered the DISC and the Denver Developmental Screening Test (Frankenburg and Dodds, 1969); 29 of the children were administered the StanfordBinet (Terman and Merrill, 1972) and the other 11 children were administered the Bayley Scales of Infant Development (Bayley, 1969), too small a sample to report here. The primary basis for the selection of these tests was pragmatic; these were the tests already in common use with the target agencies. Of course, this means that these were the tests chosen by the target agencies as the most appropriate for the screening task they were attempting to complete, and had been in routine use long before the availability of the DISC. Method In Study l, 40 children referred to three community agencies for clinical assessment and treatment were assessed for developmental delays using three instruments. These children were referred to the agencies because of physical handicaps (e.g. spina bifida and cerebral palsy), emotional problems, behavioural problems, family environment concerns, or speech delays. Certain methodological compromises had to be made. Randomization of order of assessment was planned but not implemented because of the agencies' concerns about interference with normal procedures, so tests were administered in nonrandom orders. It was not possible to estimate order effects with this data structure. Moreover, agency and order effects were confounded, so agency effects were also indeterminate. The relationships reported here must therefore be considered to be confounded both by order effects and agency effects, and the results must be interpreted with this in mind. Note that the nature of the analyses is such that order and agency effects should have no selective impact on one test compared to another. Where there are scaling differences as a result of these confounds, correlation coefficients will be insensitive to them because the rank order is preserved. Where there are interactions of test and order, rank orders will be somewhat disturbed and correlations will be somewhat reduced. However, all tests should suffer equally, since the correlations are relational statistics, unlike tests of central tendency. The DISC has eight scales and the Denver has four. Using a schema produced by Sattler (1965), it was possible to estimate scores for seven scales derived from the Binet. This produced a matrix of eight DISC scales by I 1 other scales. Rather than examine a matrix of 88 correlations, each based on either 40 or 29 subjects, we devised a summary procedure for the results. Five judges (all child clinicians familiar with the DISC, Binet and Denver) were asked to classify each of the matrix entries according to their estimates of the relative magnitude of correlation they would expect (High, Low or Medium/Unknown). The results were then tabulated and summarized. In 32% of the case, all five judges agreed and in 28%, fur out of five agreed. There were l0 Binet and five Denver entries for which four or five judges predicted high values were labelled "High". There were 14 Binet and 10 Denver entries for which four or five judges predicted low values were labelled "Low". The remaining 32 Binet and 16 entries were labelled "Medium/Unknown". Results The number of items passed, (and assumed passed for items below the basal) was computed for each Scale of the DISC, Denver and StanfordBinet (Sattler, 1965). The correlations between the eight DISC scales and the 11 other scales were computed. All of these zeroorder correlations were significant at the .05 level. The 32 correlations between DISC and Denver scales were all .64 or greater. Correlations between the overall StanfordBinet score and each DISC scale ranged from .41 to .84. Correlations between DISC and Sattler's Binet Scales ranged from .64 to .78 The sums of scores on each of the eight DISC scales, and each of the scales of the other two tests, correlated significantly with age. To control for possible effects of this relevant third variable, partial correlations with age effects removed were computed for each combination of the DISC subtests with the Denver and Stanford Binet tests and subtests. As would be expected, not every partial correlation is significant. The results of judges' expectations of correlations were used to sort the partial correlations into High, Medium/Unknown, and Low expectation groups. The pattern of mean partial correlations was as expected. The mean partial correlation for values expected to be high was .68 on the Denver and .62 on the Binet. The mean partial correlation for the values expected to be low was .40 on the Denver and . 12 on the Binet. Correspondence between the test score classifications were also measured. On each DISC scale, scores were classified as: average or above, below average but normal, possible delay, or probable delay (Amdur & Mainland, 1984). These cor responded to scores above the 50th percentile, scores between the 25th and 50th, scores between the 10th and 25th, and scores below the 10th percentile. On the Denver scale, the results from a given child can be classified as: normal, questionable, or abnormal (Frankenburg & Dodds, 1969). On the Binet, the lQ score can be classified as: very superior, superior, high average, normal or average, low average, borderline defective, or mentally defective (Terman & Merrill, 1972). All results were classified, and the correspondences between DISC classifications and Denver or Binet classifications were measured using Kendall's Tau C and the gamma statistic. Only three values failed to show a significant correspondence: DISC Expressive Language by Denver, DISC Self Help by Binet, and DISC Gross Motor by Binet. Although these statistics show significant correspondence, there were substantial discrepancies between DISC and Denver or Binet scores. When other scales identify problems, so does the DISC. On the Denver, 8 of 40 children showed scores in the abnormal range. In every case where the Denver showed an "abnormal" score, the DISC showed at least two scales with probable delay. On the Binet, 7 of 29 showed IQ's less than 70. The DISC showed two or more delayed scales for every Binet IQ below 70. However, the DISC also identified problems in children passed over by the other tests. On the DISC, 31 children out of 40 showed two or more probable delays. Of the children with two or more probable delays; 22 had either questionable, or normal scores on the Denver. On the Binet, 19 of 26 children with two or more DISC probable delays showed IQs greater than 70. Discussion Substantial correlations with selected Denver and StanfordBinet scales demonstrated that the DISC was measuring similar constructs. While the sample size is too small to assess each of the 88 correlations separately, an aggregate procedure showed the appropriate pattern of convergent and discriminant validity, even when age effects were partialled out. Thus, the DISC scales showed the requisite high and low values with respect to specific scales; Gross Motor to Gross Motor, Fine Motor to Fine Motor and so on. This correspondence of sums of scores as measured by the correlations was supported by a tabulation of classification correspondences among the tests. It is also the case that the DISC identified more children as delayed than did either the Denver or the Binet. All of these children had already been through a first stage screening process that indicated possible delay in some area before being referred for treatment and assessment. Only 20% of this sample would have been referred had they been screened with the Denver, 24% if a Binet IQ less than 70 had been the criterion, and 78% if a DISC criterion of 2 or more probable delays had been used. The low referral rate for the Denver is quite consistent with Cadman, Walter et al's (1988) finding of a 94% false negative rate for the Denver scores in a large sample of children from SouthWestern Ontario. Unlike the Denver, the Binet was being used in an artificially restricted manner, and the low referral rate indicates simply that the Binet is inappropriately used in this role. There is no absolute criterion for developmental delay. However, because the data were taken from children at the second stage of screening, we believe that the substantial majority of the sample had developmental problems of some sort. In this context, the 78% rate of positives is preferable to the rates for the Denver (20%) and the Binet IQ (24 %) because fewer delayed children would have been missed. Note that these data do not address the issue of false positive identification. STUDY 11 Construct Validity of the DISC, Denver, and Binet The process used to develop the DISC laid a good foundation for assessing construct validity. Judges assembled pools of tasks that they considered appropriate for each scale. This planful development allows the use of internal characteristics of the test to make inferences about validity. It is important, however, to assess construct validity against other criteria as well. We decided to attempt this task by using judges who were experts on the children being assessed by the DISC, rather than the tasks comprising the scales. If the DISC measures the eight constructs it purports to measure, then professionals who know children well should find DISC results to be congruent with their perception of a given child. We measured this by using a forced choice procedure where clinicians were asked to relate a set of results to a particular child. Method Workers from the agencies in Study 1 were asked to compare test results with their impressions of children they knew well. Within each of three agencies, the children who had been the subjects of Study I were paired on the basis of age and sex. Four data sheets were pre pared for each pair of children: one sheet recorded the children's names; a second sheet recorded the classification of delay for each child for each of the eight DISC scales, labelling one list of results A and the other B; a third sheet reported the Denver classification for each of the children; and a fourth recorded the Binet (or Bayley) classification. The A and B labels were assigned randomly on each page. The same child could be "A" on the DISC page, but on the Denver or Binet pages. The children's names and the prepared protocols were given to workers familiar with both children. The workers included: a psychologist, a psychometrist, a physiotherapist, a speech therapist, behaviour therapists, and resource teachers. Workers were asked to identify independently which child was "A" and which was "B" in each case, for each test. Protocols were prepared for I I pairs of children. All had DISC and Denver data, nine had Binet data, and two had Bayley data. The Bayley data were too sparse to report here. Results The test classifications did not always differentiate one child from another. The two lists of eight DISC scores presented to judges always differentiated the two pair members. The Denver score differentiated only two of 11 pairs of children (i.e. in nine pairs, the Denver classifications were the same for both children). The StanfordBinet IQ classification (i.e. "mentally defective", "borderline defective", "!ow average", etc.) differentiated between seven of nine pair members. When classifications were identical, judges were not asked to try to differentiate them. The number of workers asked to discriminate between a given pair of children varied from agency to agency. One agency had. '.. one worker for each of two pairs, one agency had two workers for each of two pairs, and one agency had six workers for each of seven pairs. The failure of the test classifications to discriminate between certain pairs of children, and the variable number of raters, led to considerable variation in the number of ratings for each test. There were 48 ratings for DISC results, 34 for Binet, and seven for Denver. The staff making the choices correctly matched results with children 81% of the time with the DISC, and 71% of the time with each of the Denver and the StanfordBinet. In each case, the staff were able to match children with results at a rate greater than chance (p. < .05, based on binomial probabilities). When the failure of staff to discriminate correctly is combined with the failure of the test classifications to discriminate between children, the rate of successful classification is significantly diminished for the StanfordBinet intelligence categories (52%) and the Denver categories (10%), but the rate for the DISC remains the same (81%). Discussion In her bachelor's thesis, lllman (1987) compared the results of the preschool teacher ratings with independently obtained DISC scores. When pairs of teachers predicted DISC scores, the multiple correlations ranged from .55 to .87. lllman's type of study can be considered to provide information about how well the differences between children's scores on a given DISC scale predict teachers' ratings of them. Study 11 assesses how well the differences among DISC scores of a given child provide information that distinguishes one child from another in a way that workers can recognize. The present study assesses the accuracy of the information provided by the pattern of DISC scores, not the accuracy of single DISC scores. When children's results were reported as simple summaries, adults familiar with the children were able to match children with test results about 7080% of the time. The discriminations using DISC results were marginally correct more often than those using either StanfordBinet or Denver results, but the differences were not statistically significant. The most important findings, however, are those which indicate how often the children's patterns of test scores provided enough accurate information to allow discrimination among them. It is here that the DISC stands out, allowing 81% accurate discrimination while the Binet allows 52%, and the Denver 10%. Thus the DISC provides more discriminating information in this role than the other tests and shows that additional information is at least as accurate as that provided by the Binet and Denver summaries. This result serves to support the assertion that the higher "referral" rates found for the DISC in comparison to the Binet and Denver in the previous study are appropriate. The DISC is more likely to find probable delays, it is more likely to provide information unique to a given child., and staff are at least as likely to identify unique DISC information with a given child as they are to identify unique Denver or Binet information with a given child. It is important to keep in mind that this study is small and was based on using the DISC in its intended role as a second stage screen, using the Binet and Denver inappropriately. The DISC scores are more descriptive of the children than the StanfordBinet or Denver summary scores. We would expect a substantial increase in the valid information produced by the Binet in the hands of a carefully trained and experienced clinician. The Denver is a first stage screen, easier to administer to lots of children, but providing a great deal less information than the DISC. General Discussion The findings presented have addressed the development, norming, and standardization of the DISC, presenting reliability data and some basic validity data from both the normative sample and a small clinical sample. The reliability data indicated substantial agreement and stability within scales. The reliability is such that the upper bound for a onesided 95 % confidence interval will be the observed score plus one for 75% of the six month age group by DISC scale combinations. Item by scale correlations were sampled in a variety of ways. Given the instability of the correlation coefficients, the frequency with which items correlated most strongly with their own scales was quite acceptable. The pattern of item by scale correlations provides some evidence of criterion validity (both convergent and discriminant). Factor analyses indicated that a common factor model with one factor was a good fit to most age groups, especially those under three years of age. Almost every scale had specific variance at almost every age level. The proportion of variance that was scale specific rose with age to three years and fell after that. Two small validity studies were also presented. The pattern of results reported in these studies combines with the reliability studies to indicate that the DISC is a very promising instrument for the purpose it was designed to serve. The DISC was more likely to lead to a referral than the Denver or the Binet IQ classifications; DISC scores provided more information than Denver or Binet summary scores, and the DISC was at least as accurate in the additional information it provides. The DISC has a number of limitations. A good example is the low reliability of some scales in the oldest age group. In fact, this is not a serious problem when the test is used as it is intended. If a child in one of the oldest two age groups is identified as delayed, his or her performance will have been determined by achieving scores typical of children in younger age groups with higher reliabilities (e.g. a 60month old child would have to achieve scores typical of a 48monthold child or younger to be identified as probably delayed on any scale). The normative population imposes certain limits as well. All of the children were from SouthWestern Ontario, and the application of these norms to another region may or may not be appropriate. Our experience with users from many areas of Canada suggests that the test generalizes well. However, agencies using the DISC are strongly encouraged to create local norms. The exclusion of apparently delayed children from the normative sample means that the norms are not representative of the entire population. Thus, the test is technically limited to making inferences about the likelihood that a particular child is a member of the particular population of children "with no apparent delays". At present, the DISC research team has a number of research projects involving the DISC in progress: we have completed a French standardization and norming (Amdur, Mainland, Parker, & Portelance, in press), we have virtually completed a very brief first stage screen based on the DISC, we are analyzing longitudinal data on a group of highrisk infants. In each of these projects, there were validity studies embedded in the development of the tests. Yet, some significant gaps remain in the research database for the DISC. Of special concern is the need for solid data on the predictive validity of the DISC and data on the frequency of false positive referrals. We are also aware that the concurrent validity studies we report here are small and methodologically imperfect. We strongly encourage the third party assessment of DISC validity and reliability, especially in these areas of concern. References Amdur, J.A. & Mainland, M.K. (1984). The Diagnostic btventoryfor
Screening Children. Kitchener Waterloo Hospital. First submitted 3 January 1989
