The Development of the DISC Pre-School Screen (DPS)


ABSTRACT

The DPS is a first stage screen designed for the detection of developmental delay in children from 5 to 52 months (and up to 59 with proration of scores). The test was developed using Item Response Theory techniques. The development of the test is described and classic psychometric data are reported along with data on the Item Response Theory parameter estimates. Each child is administered 12 items, chosen on the basis of the child's age in months. Corrected split-half reliability is .77 with a standard error of measurement of 1.2 items. Design features which maximise discrimination at the decision point suggest that these are lower bound estimates on precision. Scores on the DPS correlate -.68 with a single question asking about parental concern with development and -.64 with the same question asked of pre-school teachers.

The Diagnostic Inventory for Screening Children (DISC) was developed in the early 1970's as a test allowing "second stage" screening of children referred for developmental assessment (Amdur, Mainland, & Parker, 1988). The concept of a second stage screen came from the demand for preliminary assessment of children who had been already been identified as possibly delayed and referred for further assessment. At the Child and Family Clinic at Kitchener-Waterloo Hospital it made sense to do this second stage screening before a complete multi-disciplinary assessment because of the number of pre-schoolers being referred.

The DISC proved popular in South-Western Ontario with a number of agencies including public health, pre-schools and infant development programmes. As a growing number of agencies began to use the test we discovered a somewhat disconcerting phenomenon. Many agencies were choosing to use the DISC as a first stage screen for children for whom there was no reason to suspect delay. While we were reasonably confident of the appropriateness of the DISC for the task of second stage screening of referred children, we were less certain of the appropriateness of the DISC for first stage screening. Moreover, we were confident that we could produce a better first stage screen than the DISC using elements of the DISC.

We interviewed agency staff who were using the DISC as a first stage screen, and they indicated that they were uncomfortable with other instruments, usually focusing on the Denver Developmental Screening Scale (Frankenberg and Dodds, 1964). Many workers indicated that their agency policies required them to use the Denver, but that they rarely saw children who were delayed enough to produce abnormal results on the Denver. This was true even when they were confident that many of the children that they had assessed were truly delayed. They found that the DISC provided them with data that supported their impressions of children's delays more frequently than the Denver, and that DISC data were acceptable to agencies receiving referrals.

This impression was supported when we examined some data we collected in a validity study using the DISC (Parker, Mainland and Amdur, 1990). In this study, 40 children who had been referred to treatment agencies were assessed with the DISC, Denver, and either the Stanford-Binet (Terman & Merrill, 1972) or the Bayley Scales of Infant Development (Bayley, 1968). These children had been referred with spina bifida, cerebral palsy, speech delays, behaviour problems, emotional problems, or concerns about the family environment. Among these children, it was clear that the DISC was much more sensitive to delays than the Denver. The Denver produced "Abnormal" results for only 8 of the 40 children (20%), and a further 16 were given "Questionable" results. Seven of the 29 children assessed with the Stanford-Binet had IQs less than 70, and the Denver was largely identifying these children with retardation syndromes, and ignoring children with less profound or more specific delays. The DISC results showed that 31 of the children had two or more scales with "Probable Delay", and four more children had one scale with a "Probable Delay". The DISC results were quite congruent with the clinical judgements of the agency workers, suggesting that the DISC was not only more sensitive to delays than the Denver, but also at least as specific.

With these results in hand, and the clear local demand for a primary screen that had properties like the DISC, we decided to develop a primary screening device based on the DISC. Like the DISC, the DISC Pre-school Screen (DPS) is a test developed in clinical settings to meet a specific clinical need.

Development of the DISC Pre-school Screen

Once we had decided to develop a first stage screen, the obvious first step was to exploit the items and data available from the DISC. Access to the DISC item pool gave us a strong head start in developing a first stage screening test. We had at our disposal a pool of 216 standardised and normed items covering eight major domains of child development. We also had easy access to a network of agency workers who provided suggestions and support. The strong support of agency staff familiar with the DISC allowed us to experiment with the structure of the DISC Pre-school Screen (DPS) and to try out our successive drafts with minimal delay.

The first draft of the DISC Pre-school Screen (DPS 1.0) was prepared by selecting seven items from each of the eight DISC scales (total of 56). We selected items that required a minimum of equipment for administration. Items were ordered so that a child would be given one item from each scale in a total administration of eight items. Start points were chosen according to the start points used on the DISC. Minor issues of format and preliminary estimates of validity were addressed in a study of 40 children who were given both DISC and DPS 1.0 administrations (Butler, 1985). A modified first draft was prepared and circulated to agencies that were already using the DISC. We asked workers familiar with the DISC to examine the DPS 1.0, try it out and offer their comments. At the same time, we applied for and were given a grant from the Hospital for Sick Children Foundation to develop the DPS.

By the fall of 1988 we had data from 412 screen administrations, courtesy of agencies that had used the untried screen in trial projects and of master's thesis research by Sharyn Pope at the University of New Brunswick. The sample has many unknown characteristics, because it was a sample of convenience.

Item response theory was the primary psychometric model for the development of the DPS (e.g. Hambleton and Swamindthan, 1985). This family of statistical models allows test development to proceed based on the relationship between item performance and test scores, largely independent of the composition of a particular sample of test takers. Any given test is assumed to be a sample of items measuring some construct (referred to as a "latent trait"). The psychometric characteristics of the test are a function of the properties of the items included in the test. Item characteristics can be described using one, two, or three parameters which define a function relating the probability of passing an item to the ability of the test taker. The precision of the function is maximised by adjusting the three parameters using maximum likelihood estimation techniques. In practice, a computer programme adjusts the three parameters in an iterative fashion to find the values that have the highest probability of reproducing the sample data.

A three parameter item response model estimates: 1) the difficulty of an item (the test score that corresponds to a 50% pass rate), 2) the discrimination of an item (a measure of how quickly the probability of passing an item rises from zero to one as a function of the ability of the test-taker) and 3) the pseudo-guessing level (the probability of a test taker passing an item beyond his or her ability). Pseudo-guessing is most relevant to multiple choice tests where a subject can easily be correct by chance. For all practical purposes, pseudo-guessing is zero in a developmental test, so we chose to set the third parameter equal to zero and concentrate only on discrimination and difficulty -- a two parameter model.

The final item response model will describe the probability of passing an item in a two-parameter model as function in which the difficulty of the item and the discrimination of the item are constants and the ability of the subject is the only variable. Ability is a latent trait that must be estimated concurrently with the discrimination and difficulty parameters in the iterative estimation procedure. Rather than use this classic two-parameter model, we chose a variant of the model. We chose to substitute age for the latent trait of subject ability. This modification embeds a validity component into the model, and reduces the number of estimated parameters by one.

The use of an eight-item screen for the DPS 1.0 proved to have been an unfortunate choice. The span of only eight administered items did not provide enough information for a detailed statistical analysis. Many items were found to be misplaced according to the two-parameter model, so the Trial 1.0 data were inadequate for estimation under the two-parameter model, and questionable for many classical analyses.

When the DPS 1.0 was used with children eight items were administered to each child. The eight items to be administered to a child were chosen on the basis of the child's age in months. In order to test the adequacy of the start points, the relationships between age and DPS 1.0 score (out of eight) was examined. If the start points were appropriate there should be no significant correspondence between age and DPS 1.0 score. This is because we planned that the use of age-dependent start points would partial out the effects of age from the scores and the interpretation of test scores could be independent of age. In fact there were significant effects for both the Trial 1.0 (Chi-square = 517.0)

It was apparent from the analyses of the DPS 1.0 that the items from the Self Help and Social Skills scales had the lowest discrimination parameters. Of 10 items with discrimination parameters under 0.30, seven were from the Self Help and Social Skills scales, and the other three were from items with difficulty levels outside the range of measurement of the scale (and hence with suspect discrimination parameters). This was consistent with previous findings Parker, Mainland & Amdur, 1991). The Self Help and Social Skills scales are psychometrically sound, but have the lowest reliability, lowest loading on a single common factor of development and highest specificity. While these scales have more than adequate characteristics on their own, they do not merge completely with the other six scales on the DPS 1.0.

DPS 2.x & 2.1: Development

The DPS 2.0 was developed in two stages. In the first stage, the DPS 1.0 item set was reduced by deletion of all Self Help and Social Skills items. Our modified item response theory two parameter model (i.e. estimating difficulty and discrimination, but using age instead of a latent trait) was recalculated based on data derived from the DISC normative sample for the reduced item set. One of the advantages of item response theory is that it provides a means of estimating standard error of measurement as a continuous function of score. When this was done, it was possible to identify gaps in the item coverage and to determine the difficulty level of items required to fill the gaps.

In the second stage, new items from the six DISC scales other than Self Help and Social Skills were added to the draft version of the DPS 2.x and the second stage version was analysed. Three times as many items as required were included and assessed in the preparation of the DPS 2.x. Generally, those items with the best discrimination parameter were retained. Where differences were small, items were chosen to minimise clustering of items from the same DISC scale, and items with little or no administrative apparatus were chosen when possible. This produced a 54-item scale which was analysed using Norm Group data.

Method

Subjects.
All data for this analysis were derived from the DISC normative sample.

Analyses.

A data set for DPS 2.x items was abstracted from the DISC normative sample. Following modification, data for DPS 2.0 were obtained in the same way. These data were subjected to a two parameter analysis using the Norm Group data. Both DPS 2.x and 2.0 included all the items from the DPS 1.0 except those from Self Help and Social Skills, and six new items from the DISC were added, as described above, they differed only in the start points for administration. The source scales for the items are listed in Table 1.

Results

The data for the two-parameter model are listed in Table 1. The mean discrimination parameter for the revised version is significantly higher than the first version (t = 2.68, df = 106, p. < .01).

Determination of start points for the DPS 2.x was an iterative process. It was to administer 12 items to each child in the DPS 2.x. Because the DPS 2.x is intended for the detection of delay, it was decided to maximise the discrimination power of the 12-item sets at a level below the mean for a given age on the Norm Group data, and as close as possible to the predicted value of the cut-off criterion. Most scales maximise discrimination at about the mean of the sample distribution. The design of our scale requires that discrimination information be a maximum at the criterion for detection of delay, which would be substantially lower than the mean score.

In order to adjust start points, we treated the 54 items of the DPS 2.x as if they were a 54-item scale with all items administered to each child. Regression equations were computed for the 54-item mean score for a given age in months as a quadratic function of age (R = .995, F = 51923.8, df = 2, 564) and for standard deviation as a function of age, mean score and mean score squared (RA = .466, F = 127.6, df = 3, 439). This was not a standard regression analysis as the values were mean scores weighted by sample size rather than raw scores. This technique tends to inflate the FA and RA statistics and also stabilises the point estimates of the means (which is why we used it). Using estimated mean scores and standard deviations, an estimate of the score corresponding to z = -1.0 (16th percentile) was determined for each age in months. Using these scores, start points were chosen so that for each age group, the seventh item in the 12 item screen had a difficulty level that corresponded to the estimated z = -1.

This means that a child with an ability one standard deviation below the mean will have six items easier than his or her ability level and six items at or above his or her ability level. This means that the child will get as many questions possible that provide information about ability for that child, and the resultant score will be as precise as the item set allows. By way of contrast, a child who is at the mean will have about ten items easier than his or her ability level (this varies with age) and only about two at or above his or her level. As a result, there are not as many items providing good information about this subject and the estimate of score is less accurate. Thus the 12 items for each age group were chosen to maximise discrimination among the children performing at the 16th percentile, and lower precision at other ability levels.

When the distribution of 12-item DPS 2.x scores based on DISC normative data was examined, we had achieved the discrimination characteristics we were seeking. These data were calculated including only children older than 4 months and younger than 51 months to avoid floor and ceiling effects. However, despite our best efforts to reduce the relationship between age and screen score by choosing the correct start points, the relationship was still significant (Chi-square = 625.9, df = 540, p. < .01, Cramer's V = .35) albeit substantially reduced. There was a small but significant correlation between age and 12Šitem DPS 2.x score (r. = .197, df = 416, p. < .001).

On inspection, it could be seen that the regression equation had produced start points too easy for the youngest children and too hard for the eldest children (a regression to the mean phenomenon). Adjustments were made to the slope of the regression equation (only) to minimise this effect, and produce the DPS 2.0. When the revised data were analysed the relationship with age had been eliminated (Chi-square = 547.8, df = 540, p. is not significant, Cramer's V = .33) There was no significant correlation between age and 12-item DPS 2.0 score (r. = -.044, df = 416, p. is not significant).

Table 2 lists classic psychometric measures for both the 54-item (i.e. as if we had administered every item to each child) and 12-item-analyses (i.e. as if we administered only 12 items, with the start point carefully chosen to match the age of the child) of the Norm Group data configured as the DPS 2.0. Note the substantial increase in the reliability of the 12-item version over the 8-item DPS 1.0 (Spearman-Brown increased from .58 to .80).

The data in Table 3 indicate the distribution of scores on the 12-item DPS 2.0 for children from 5 to 50 months. The cumulative percentages indicate that there is substantial leeway in the choice of potential cut-off scores depending on the proportion of children an agency chooses to refer. If all children with scores of 6 or less are referred for further testing, these data suggest that approximately 8% of children from a normal population would be referred. A criterion of 5 or less would refer about 6% of tested children. A criterion of 7 or less would refer about 15%. These estimates will depend on the accuracy of the DISC norm data to predict DPS 2.0 performance and were to be considered as estimates until data were collected from a new sample. If the population of children being screened is a high risk group of some kind (e.g. low birth weight children), the expected referral rates will be substantially higher because the proportion of delayed children is higher. Given the design of the DPS 2.0, the proportion of referred children ought to be relatively constant regardless of age in a uniform population.

Trial of DPS 2.0

We decided to try the DPS 2.0 with large numbers of children to establish preliminary normative data and some validity data on the relationship between DPS scores and two other kinds of measures. Once again, we were given strong agency support.

Method

Subjects.

Data came from 12 sources throughout Ontario, including a very large data set from the Elgin-St. Thomas Public Health Unit, that used the draft screen on a trial basis for pre-kindergarten screening of about 800 children.

From the perspective of a researcher, the data were collected in a non-systematic manner. Although the people doing the data collection were well trained, competent, and systematic in their approach to using the test, there was no control over subject selection. One outcome of this was the odd age distribution of the subjects. There were only 49 children tested 40 months of age or younger. There were 592 from 41 to 51 months inclusive, and 300 over 51 months. Thus the youngest group of children was too sparse for solid statistical analysis.

Analysis.

In analysing the data, the first problem was how to deal with the very large number of protocols with missing item data and refused items. We decided to try each of three strategies and assess the results. When both missing and refused items were treated as missing data, only 697 cases were available with complete data, and the corrected split-half reliability was .58. When misses and refuses were both treated as failures, the corrected split half reliability was .70 for 941 cases. When misses were treated as missing data and refuses were treated as failures, the corrected split-half reliability was .63 for 822 cases.

We chose to treat refused items as if they had been failed, because doing this increases the internal consistency. Although the same could be said of items missed by the examiners, we decided that we could not justify calling an item not administered by the examiner a failure by the child, despite the improvement in reliability. Therefore the scoring convention used in subsequent analyses is that a missed item is treated as missing data and a refused item is treated as a failure to pass the item.

The structure of the DPS 2.0 was based on item response theory. Specifically a two parameter model was used, assessing item difficult and item discrimination with respect to age. Because the sample sizes were relatively small for such an analysis, it was decided that a two-parameter model could not be estimated with any precision. We also noted that the structure of the test favoured a one-parameter model. The use of a constant number of items (12) with start points shifting with age causes the items to be treated as interchangeable units, differing only in difficulty, not discrimination.

A Rasch single parameter model (i.e. estimating only difficulty and assuming that discrimination is constant from item to item) was estimated in each case. Where numbers of subjects warranted, Rasch difficulty levels were estimated for each item. In order to complete this analysis, it was required to break the subjects into groups with the same start point (who therefore received the same set of 12 items) and analyse each group separately. The normal item response theory latent trait estimate of subject ability had to be used in this case rather than age, because all children in each group were of the same age.

Results

The results of the estimates of item difficulty are listed in Table 4. Note that item difficulty is on a common scale within a column, but that each column is scaled differently. It was apparent from the analysis of item difficulty that some items were consistently misplaced in order of difficulty on this draft (e.g. item number 52 which is easier than every item numbered higher than item 46 and some lower than 46. As a result of this analysis it was decided to reorder the items as indicated in Table 4 for use with DPS 3.0.

One impact of this lack of rank order would be to reduce the internal consistency of the 12Šitem scale collapsed over specific items. The intent of the DPS is for any particular item to function as an effective first (easiest), second, third, and so on up to the twelfth (hardest) item, depending on the age of the child being tested. Thus the same item plays twelve different difficulty roles in the scale depending on age. This can only work properly if the items are ordered and spaced properly. It was apparent that the item ordering for these older items was not correct. For this reason, the split-half reliability of the scale was computed separately for each start point (and therefore each particular set of twelve items) and the weighted mean was computed using a technique outlined in Hedges and Olkin (1985) for use in computing mean correlation. For the younger ages, the sizes of samples with the start point were often as small as one or two children, so we only report data from samples with at least 30 children. The result was a mean corrected split-half reliability of .66 for 765 subjects who were started with items 36 to 43.

It is convenient to compute a confidence interval that matches the use of the test. With a standard deviation of 2.0, the standard error of measurement becomes 1.16. The interval from the centre of one item's range to its edge would be 0.5. (Note that a test score of 6, occupies a theoretical range from 5.5 to 6.5, a width of one 1.0, with the assigned score occupying the centre of the range.) From the centre of one item to the edge of the range of the next item would be 1.5, which produces a one-sided 10% confidence interval (given an SEM of 1.16). Assuming that a score of six or lower is chosen as the criterion for suspecting delay, if a child produces a score of six, then the chances that the true score was a high as eight would be 10% or less under classical test theory. The confidence interval is one-sided because we don't care if the score is less than 4 -- it won't change our interpretation in the least. The same argument can be made with a criterion of seven of lower: a score of seven has a 10% or smaller chance of coming from a true score as high as nine. Thus, whatever criterion is used, a one-item buffer or "don't know" range should account for uncertain scores.

The intent for the DPS is that the scores obtained will be invariant with age. For children from 39 months to 52 months of age, scores were cross-tabulated with age. These age groups were selected because they have a large enough sample size, and are within the design range of ages where there is no need to prorate for age. After elimination of children with missing items or with incorrect start points, there were 485 subjects. Chi-square with 143 degrees of freedom was 198.2, which is significant at the .01 level. This indicates that there is a significant relationship of some kind with age. Cramer's V for this data set is .19, indicating that the magnitude of the relationship is small, and Spearman's Rho is -.03, indicating that it is not a linear trend. Inspection of the table indicates that performances at three ages (42, 46, and 51 months) were either better than average (42 months) or worse than average (46 and 51 months). The distortions were small -- one item or less -- suggesting that examination of the means would be worthwhile.

Chi-square looks at the distribution of scores using a non-parametric model. An analysis of variance of these same data assesses the distribution of scores from a parametric model and produced an F (13, 469) of 1.88 which is significant at the .05 level. The corresponding Omega squared is .023.

Both of these analyses make it apparent that the distribution of scores varies slightly, but significantly across the 39 to 52 month age span. However, in clinical use of the DPS the entire range of the score distribution (0-12) is not used. The DPS was designed to make a dichotomous decision (i.e. delayed/not delayed) around a criterion score, with a narrow range of declared uncertainty. The DPS was also designed to maximise precision of measurement at the cut-off point, but sacrificed precision at higher and lower scores. The correspondences between age and score, based on the entire distribution of scores are therefore, possibly irrelevant to the test as it is to be used.

We decided to analyse the data in the form that they would be used. The data were divided into three categories: possible indication of a delay, uncertain indication, and no indication of delay. The correspondence between these three categories and age was examined in a separate analysis for each of two cut-off criteria. In one analysis, all scores below seven (11.49% of scores) were labelled "possible delay", all scores equal to seven (6.26%) were labelled "maybe" and scores above seven were labelled "no delay" (82.25%). The analysis was repeated using a "maybe" value of eight with percentages of 17.75 for "possible delay", 12.53 for "maybe" and 69.72 for "no delay". The selection of a "maybe" category only one item wide was governed by the one-sided 10% confidence interval of 1.5. For both the seven and eight criteria there was no significant impact of age (Chi-square with 26 degrees of freedom was 29.6 for a "maybe" of seven and 26.5 for a "maybe" of eight. Both values are clearly non-significant, and are, in fact very close to the expected value of Chi-square in a random situation. Note that the change from 13 categories (scores from 0 to 12) to three categories (possible delay, maybe, no delay) caused a reduction in degrees of freedom from 143 to 26 for Chi-square analyses with corresponding increases on average cell size, and the power of the test to detect a difference.

The distribution of test scores in the age range 39 to 52 is given in Table 5 along with the expected distributions based on the data from the DISC norms. The differences between expected and observed distributions are small enough to be ignored for clinical purposes. Different agencies may choose to use different criteria for referral for more detailed testing. These data indicate that a criterion of six or lower will refer 11.5% of children from 39 to 52 months of age, and leave about 6.3% of children in an uncertain category. A criterion of seven or lower will lead to the referral of 17.8% of children and leave 12.5% of children in an uncertain category. These distributions are very close to the expected distributions, suggesting that they will generalise well to the (untested) younger age groups as well.

There remained the problem of dealing with children older than 52 months. These are children for whom we were unable to find items difficult enough to form the most difficult items of the 12-item set. However, the structure of the test is such that should not be a crippling problem. The most discriminating items on each testing are the ones of middle difficulty (i.e. roughly the fifth, sixth, seventh and eighth items). For older children, there will not be enough of the most difficult items, but the most discriminating items are available -- albeit as the last few (most difficult) rather than the middle items.

While new items are being assessed in an attempt to find difficult items for these children, a strategy of proration has been attempted as an interim measure. The appropriate proration values were sought by using the data from the younger age group to establish criterion percentiles. One reasonable cut-off score for referral is a value of 6, (the 11.5th percentile). A score of 7 or lower will include 17.74% of children, and score of 8 or lower will include 30.3% of children. Across the age range from 53 months to 63 months, the 11.5th percentile changes with age from a score of 6 to a score of 9, the 17.75th percentile moves from 7 to 9, and the 30th percentile moves from 8 to 9. The curve plotting these percentiles approximates a decelerating quadratic curve asymptotic at 9.

Given the decelerating change in score with age it appeared legitimate to collapse the norms across increasingly larger age ranges. We decided to move from the two month intervals at the end of the DPS to a three month interval for 53 to 55 months and a four month interval from 56 to 59 months. Given the asymptotic nature of the curve, it was decided not to prorate after 59 months. The appropriate proration procedure is as follows: subtract one from the obtained score for children 53 to 55 months and subtract 2 from the obtained score for children 56 to 59 months. The resultant score can be interpreted much as the scores for younger children, but the percentiles are not exact. A score of 6 or lower is at the 7th rather than the 11th percentile, a score of 7 is at the 14th rather than the 18th percentile, and a score of 8 is at the 33rd rather than the 30th percentile. Given a sample size of 126 children in this age range, it would be inappropriate to attempt to refine the proration any more than this.

The correlation between age and the 54-item score (assuming passing of items before the start point and failure of items past the end point) was computed. Unlike the 12-item score, we expect this to be strongly correlated with age. The relationship is non-linear, and this is reflected in a significant quadratic component. Age and squared age have a multiple correlation (including the linear and quadratic components) with the 54-item score of .95 for 822 children with complete data ranging in age from 1 to 78 months.

Discussion

We have outlined the main psychometric analyses of the data collected using the DPS 2.0. The scale performed much as was predicted using the data from the normative data set of the DISC. Item order was incorrect, but this is reasonable to expect on a first attempt, given problems with item ceilings in the DISC data set. Although the internal consistency was lower than expected (.68 instead of .80) the standard error of measurement was not that far different than expected (1.5 as opposed to 1.1). Moreover, the changes made in item order should work to improve internal consistency. Because the change in item order does not change more than one or two of the particular set of items administered to a child, the summed scores are unlikely to change very much.

Without recourse to statistical analysis, it was apparent that a number of minor revisions would have to be made to the DPS 2.0. These revisions were primarily aimed at reducing the likelihood of administrative errors.

A frequent problem was the use of the wrong start point in administering the test. The most common error of this type was the use of an item number as if it were the age starting point for administration (e.g. item #37 for a 37-month-old child instead of item #33 -- the proper start point). This was easy to fix by changing the administration form. Item numbers were eliminated. Another common error was to fail to administer all of the 12 age-appropriate items to a child. The revision aimed at this problem was change the recording of a response. Instead of inserting checks into three columns, the tester was asked to circle one of three words for each item, depending on the response of the child ("Yes", "No", and "Refuse"). The third common administration error was the failure to test every element of multiple element items (e.g. identifies 6 of 8 colours). To correct this, the response recording was organised in a way that made omissions more salient.

Having completed these analyses and attendant changes we produced the DPS 3.0. We were comfortable in accepting the DPS 2.0 as a standardised, face-valid scale with reliabilities as reported in the text for the age range from 40 to 52 months. The success of the DISC normative data in predicting the performance of the items for older children was good with respect to the distribution of scores, and weaker with respect to item order and reliability. It was reasonable to expect that the DPS 3.0 would be similar for the younger age groups, producing equally reliable data. The proration to 59 months will probably be accurate, but a new set of more difficult items is currently being assessed, so that the proration is likely to be only a transitional measure.

To this point we have reported minimal data assessing the validity of the scale (good age correlations). Validity depended on the construction of the scale (which was a classically face valid test based on both expert nomination and statistical selection among nominated items), and the high item-wise correlation with age. The high correlation of age with performance is evident in the high multiple correlation of the 54-item score with age and squared age.

Reliability and validity of DPS 3.0

As part of the development of the DISC Pre-school Screen, it was considered important to collect data assessing validity. With a small budget and the primary use of data volunteered by co-operating agencies, the scope for validation instruments was rather limited. It was decided to produce a brief checklist-type developmental questionnaire to be completed by a parent as the primary validation instrument.

A review of the paediatric literature on high risk pregnancies and deliveries produced a number of potential questionnaire topics. Other topics were culled from the questionnaires used by a large number of agencies that kindly sent us copies of the forms that they use routinely. Items were drafted with attention to use of the simplest language possible. Responses were organised as a Yes/No forced choice with spaces for clarification or amplification. While the bulk of the items are worded so that a Yes response is associated with increased risk, a number are also quite clearly inverted. The criterion for the valence of the question was clarity of expression. The final product was organised into six sections: demographic data, prenatal history of the mother, birth history of the child, infant history, childhood history, and family history. Embedded in the childhood question was a eight part section that asked the parent to indicate any specific concerns she might have with the development of the child. This section was organised to correspond with the eight DISC scales.

Method

Subjects.

A list of agencies using the DISC is maintained for purposes of sharing new information about the DISC. Agencies on this list were sent a letter asking if they would be interested in participating in the development of the DISC Pre-school Screen. Twenty-four agencies volunteered to collaborate in the development of the DISC Pre-school Screen.

In the Fall of 1989, they were sent copies of the DPS 2.0 and (somewhat belatedly) copies of the developmental questionnaire for parents and for pre-school teachers. Five of the volunteer agencies sent in completed copies of the questionnaire with the DPS 2.0 scores that they had collected, for a total of 62 subjects with Questionnaires. Not every questionnaire was complete, and not every child had a score on the DPS 2.0. As a result, the sample size is different for almost every statistic. These data are identified as the DPS 2.0 data.

Revisions were made to the DPS and the Parent Questionnaire following the DPS 2.0 data collection. The revised versions (including DPS 3.0) were sent to agencies in late 1989. Eleven agencies supplied data based on these revised materials. These data are labelled the DPS 3.0 data.

Although fathers were invited to complete the Parent questionnaire, all of the submitted forms were completed by mothers.

Results

Reliability of DPS 3.0

The corrected split-half reliability of the DPS 3.0 was measured based on 87 children from the DPS 3.0 data, between 5 and 52 months inclusive, collapsed across age groups. The value was .77 -- a substantial increase from the value of .66 found in the DPS 2.0, and very close to the value predicted from the data derived from the DISC norms (.80). The standard error of measurement was 1.22, very close to the DPS 2.0 value of 1.16 and to the value of 1.1 predicted from the DISC norm data.

Properties of validation instruments.

The parent questionnaire was organised into clusters of related items. Each cluster of items was treated as a separate measurement scale by scoring each item as 1 when the answer indicated risk and as 0 when the item indicated no risk. The values for reliability are reported in Table 6 for both the DPS 2.0 and DPS 3.0 data collections. Reliabilities were computed using a corrected split-half correlation, because of the substantial skew in the distributions.

The cluster of items assessing specific developmental concern was dropped from the second analysis (i.e. using the DPS 3.0) for two reasons. First, the correlation with the DPS 2.0 was found to be very low (-.04) and second the reliability was too high (.95). The latter reason seems paradoxical until it is recalled that if the eight different developmental concerns were indeed specific, there ought to be a low internal consistency.

The reliabilities were generally modest to low. Prenatal History (r. = .62, .58 in DPS 2.0 and DPS 3.0 data collections respectively) and Birth History (.63, .52) values are low enough to suggest caution in trying to interpret the scale clinically. Infant history is quite respectable (.80, .86). The drop in reliability for Childhood History (.82, .43) probably reflects a typographical error on the second version of the scale that misalign the question and answer columns, leading to some confusion among those answering and scoring. Family History was respectable, but not high (.78, .75).

The parent scales can be compared to the teacher questionnaire, and the results for relationships between some single teacher items and the parent scales are reported in Table 7. The teacher's indication of concern about the child's delay correlates with both Infant (r. = .28) and Child History (r. = .37). The teacher's indication of a medical problem that would tend to indicate delay correlates with Birth (r. = .41) and Infant History (r. = .41). A question about social or family concerns correlates with the Family scale on the Parent Questionnaire (r. = .49). These results suggest that the teachers perceptions of delay are compatible with parental reports of infant and childhood history (i.e. more recent history) while perceptions of medical problems implying delay are more congruent with the parental Birth and Infant history (i.e. more distant history). Perceptions of social or family problems tally reasonably well with parental report of family issues, and little else.

Validity measures.

DPS 3.0 scores show no correlation with age or age squared between 4 and 53 months (R = .164, df = 2, 84, p. is n.s.). The distribution of classifications (delay, maybe, no delay) is not associated with age (Chi-square = 63.0, df = 80, p. is n.s., N = 114), although the data are sparse for such a large table. Nevertheless, the data suggest that the DPS 3.0 score is independent of age for these data (as we hoped it would be).

Correlations of each parent scale with the raw score on the screen (DPS 2.0 or 3.0 and separately) were computed, but only for those children who fell between the ages of 4 and 53 months. Other children show ceiling effects because of the limits of the scale in the current administration. All correlations were tested as one-tailed hypotheses. One item answer was also tested because of its specific interest: the question asking if the parent believes her child is developing normally. Results are listed in Table 8.

Parental concern about the child's development (a single question) is strongly correlated with both DPS 2.0 and DPS 3.0 scores (r. = -.63, -.68). The small increase is consistent with the increased reliability of the DPS 3.0 over the DPS 2.0. Childhood history also shows a consistent correlation with DPS scores (r. = -.54, -.47). The other values are somewhat inconsistent, as might be expected based on low reliability, and poor replication between teacher and parent report. Family History does replicate well from parent to teacher report, does not replicate from DPS 2.0 to 3.0.

Data on Teacher items are reported for DPS 3.0 data only. The teacher's concern for the child's development correlates highly with DPS 3.0 score (r. = -.64), and moderately with parental concern (r. = .48).

Parental judgement of delay corresponds well with DPS 3.0 classification (Chi-square = 16.3, df = 2, p. < .0001, Spearman rho = .58), as does teacher concern (Chi-square = 26.3, df = 2, p. < .0001, Spearman rho = .52). In 10 of 11 cases where parents expressed concern about development, the DPS 3.0 indicated delay (Sensitivity = .91). In 26 of 35 cases where parents indicated no delay, the DPS 3.0 indicated no delay (Specificity = .76), and one case fell into the uncertain category. For teachers, in 18 of 34 cases where teachers indicated delay, DPS 3.0 scores indicated delay (Sensitivity = .62), with 5 cases falling in the "maybe" range. In 40 of 46 cases where teachers did not indicate delay, the DPS 3.0 indicated no delay (Specificity = .87).

Discussion

The reliability of the six validity scales derived from the parent questionnaire is generally respectable, although the prenatal and birth history scales show somewhat lower reliability than might be desired in a clinical scale. This is reasonable, given the variety of potentially unconnected events associated in these scales. The correlation between the DPS and the scales is lower for the more distant history scales and higher for the more recent history scales with the exception of Family history.

Relationships between the teacher and parent questionnaire items allow assessment of their validity. Not all teacher items support all parent scales. SES and adjustment to pre-school seem to be unrelated to developmental issues. Social stressors seem to correspond from parent to teacher ratings, but also seem to be unrelated to development.

The three parent scales with the highest reliability and the most recent history of the child show the highest correlation with scores on the DPS. Moreover, the correlation between parent's expressed concern with a child's development and score on the screen is a remarkable -.63 (DPS 2.0) and -.68 (DPS 3.0) on the two editions. The teacher judgements of child development are also supportive of validity. Specificity of the DPS 3.0 may be better with respect to parental judgement of delay than with teacher judgement, but sample sizes are too small to test this possibility.

The data reported in these studies provide quite reasonable support for the contention that the DPS is a valid measure of developmental delay in the children we tested. The children tested were a good cross-section of children in south-western Ontario who attend pre-schools with a therapeutic orientation as well a very good cross-section of children coming to pre-kindergarten screening clinics. The data were disproportionately representative of children over 39 months of age. While we have data on younger children, they are sparse. The goodness of fit between data on older children and what was predicted from DISC norms suggests that the DPS for the younger group will perform as predicted, too. However, the younger group has not been well evaluated in these studies.

The validity studies are reasonably persuasive. DPS 3.0 scores are independent of age, but strongly related to parent and teacher judgements of delay and risk factors collected from developmental history questionnaires.

A number of tasks lie ahead in the development of the DPS. It will be important to evaluate the test using larger numbers of children under three years of age. We would also benefit from a few more items at both the youngest and eldest extremes of the test. At present we can assert with reasonable confidence that the DPS 3.0 is first stage screen that has proven to be a valid predictor of possible developmental delay for children between 39 and 53 months of age. We can also infer with reasonable confidence that it will prove to be equally valid for children as young as 4 months and as old as five years, and that it will be possible to extend the test both to younger ages and older ages.

It is important not to use this test to rule out delay. It is intended to be a first stage screen of children with no other indication of delay. If the DPS shows no indication of delay, but a parent, teacher or other professional is concerned about development, a more detailed assessment is indicated.

References.

Amdur, J. A. & Mainland, M. K. (1984).
The Diagnostic Inventory for Screening Children. Kitchener-Waterloo Hospital.
Amdur, J. A., Mainland, M. K., & Parker, K. C. H. (1988).
The Diagnostic Inventory for Screening Children, Second Edition.
Kitchener-Waterloo Hospital.
Amdur, J. A., Mainland, M. K., Parker, K. C. H. & Portelance, F. (in press). Mthode d'evaluation diagnostique du du developpement des enfants. Kitchener-Waterloo Hospital.
Butler, Janice. (1986). Validation of the Screen for the Diagnostic Inventory for Screening Children. Unpublished Batchelor's thesis: U. of Waterloo.
Bayley, N. (1969). Bayley Scales of Infant Development.
New York: Psychological Corporation.
Cadman, D., Walter, S. D., Chambers, L. W., Ferguson, R., Szatmari, P., Johnson, N., & McNamee, J. (1988). Predicting problems in school performance from pre-school health, developmental and behavioural assessments.
Canadian Medical Association Journal 13931-36.
Frankenburg, W. K., & Dodds, J. E. (1969). Denver Developmental Screening Test.
Colorado: University of Colorado Medical Centre.
Hedges. L/ V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, Fl.: Academic Press.
Illman, C. (1987) A validity study of the Diagnostic Inventory for Screening Children (DISC) using teacher observations. Unpublished bachelor's thesis. Waterloo, Ontario: Wilfred Laurier University.
Parker, Kevin C. H., Mainland, Marian and Amdur, Jeanette. (1990). The Diagnostic Inventory for Screening Children: Psychometric, factor and validity analyses.
Canadian Journal of Behavioural Sciences 22, 361-376.
Hambleton, Ronald K., & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Kluwer-Nijhoff: Boston.
Terman, Lewis N., & Merrill, Maud A. (1972). Stanford-Binet Intelligence Scale.
Manual for the Third Revision Form L-M. Boston: Houghton Mifflin Co.



Back to Mainland Consulting


Web page design by Jerry Walsh
jwalsh@bigfoot.com