group methodologies; (h) tended to use within-subjects (75%) versus between-subjects designs (25%); (i) had an average memory load of about 54 words, with studies in the SR versus OR class tending to have longer stimulus lists than the SR versus semantic class; (j) used trait words as stimuli more than nouns;

  (k)

    primarily used stimulus lists that contained unrelated words;

    (1)

      used a small number of encoding tasks on average, with SR

versus OR studies tending to use more tasks than did SR versus semantic studies; (m) tended to present stimuli at a fixed rate or measure participants' reaction times; (n) used a variety of modalities to present stimulus materials, with the majority using computers; (o) tended to use free recall as the dependent mea- sure, with a small percentage using recognition and very few using cued recall; (p) tended to allow participants a fixed time period to respond during the retrieval task; (q) were slightly less likely to use distractor tasks (47 % ) than not (53 % ) between presentation and retrieval; and (r) tended to use an incidental learning paradigm (87%) in which participants did not expect a recall test rather than an intentionallearning situation in Which testing was expected.

Overall SRE. After summarizing study char~icteristics, we analyzed the entire set of SRE studies to test the hypothesis that SR results in greater memory than OR or semantic encoding. These analyses appear at the bottom of Table 1 and show that SR encoding does promote better recall on average than other types of encoding, as evidenced by a mean weighted effect size that differed significantly from the 0.00 value that indicates exactly no effect, d = 0.50, 95% confidence interval (CI) = 0.45-0.54. Also as expected, the assumption of homogeneity of effect sizes was rejected, Q(128) = 451.40, p < .0001. Consistent with the conclusion that study results were inconsis- tent, homogeneity could not be achieved until we discarded 34 (26%) outlying effect sizes. The resultant mean effect size was still significant, d = 0.45, 95% CI = 0.39-0.50. Thus, the hypothesis of an overall SRE across the literature was supported, although its magnitude Varied considerably.

Cross-Literature Models for SRE Magnitude

Following the overall analysis, we fitted models using coded study characteristics to explain variation in effect sizes. With our first model, we examined whether the SRE varied as a function of studies' manipulation class (SR-semantic vs. SR- OR). As predicted, the SRE did vary as a function of manipula- tion class, with a significantly smaller SRE for the SR-OR versus the SR-semantic class (see Table 3). However, also as expected, each of these mean SREs was highly significant. Ef- fect sizes within each class were also found to be heterogeneous; it was necessary to remove 11 studies (18%) from the SR- semantic class and 14 studies (20%) from the SR-OR class to achieve homogeneity. After removal of these studies, however, the resulting mean effect sizes were still significant (see Table 1). Thus, results show that, although study effect sizes are inconsistent, the SRE does tend to occur when one compares SR encoding with semantic encoding and OR, as predicted.

To test our hypothesis that the SRE would be smaller when the comparison task promotes both relational and item-specific processing, we performed a model test by collapsing across manipulation class. As the second model in Table 3 shows and

as we predicted, when the comparison tasks used in studies in the literature promoted both relational and item-specific pro- cessing, the SRE was significantly smaller than when the com- parison task promoted either relational or item-specific pro- cessing. Within the class of studies that used tasks that were judged to promote both relational and item-specific processing, the effect sizes were homogeneous; moreover, this class differed significantly from the separate classes for relational and item- specific processing. However, the relational and item-specific classes did not differ significantly from each other. In all three classes, significant SREs were observed.

Two continuous predictors--(a) time between encoding and memory tasks and (b) length of stimulus presentation--were significant predictors of the magnitude of SR effect sizes across the literature. Specifically, the SRE tended to increase as the time between the encoding and memory tasks increased and to decrease as the length of stimulus presentation grew longer. Note that the latter finding is based on the minority of studies that used fixed stimulus presentation times (i.e., k = 51 ). As Table 4 shows, both of these pattems generalized across the two ma- nipulation classes. We next attempt to explain inconsistencies within the SR-semantic and SR-OR manipulation classes, respectively.

Moderators of SR-Semantic Effect Size Magnitude

Theoreticalmoderators. For the SR-semantic manipulation class, model tests for relatedness of stimuli reveal that studies using highly related stimulus items obtained a smaller mean SRE than those in the low-relatedness class (see Table 5 ). Indeed, the mean SRE in the high-relatedness class was not significant, and effect sizes in this class were consistent in contrast to the low- relatedness set in which a significant mean SRE occurred.

A model for type of stimulus materials used shows that, con- sistent with the elaboration hypothesis, the mean SRE for studies

that used traits was significantly greater than that produced in studies that used nouns or other types of stimulus materials (see Table 5; as Table 6 shows and as discussed below, this pattern also appears for the SR-OR subliterature). However, the mean SRE for both the traits and the nouns classes was significant. Effect sizes for both classes of stimulus materials were inconsistent.

The next model was assessed to examine whether the seman- tic-encoding task used was related to SRE magnitude. Although

there were eight classes of tasks in all, a few clear differences did emerge (see Table 5 for specific differences). Consistent with the organization hypothesis, the mean SRE for the fits- category class was marginally smaller than the mean SRE for the synonym judgment class (p = .057); however, it was not smaller than other classes of semantic tasks. Also, studies that used desirability ratings tended not to observe a significantmean SRE, in contrast to studies in the other classes. Studies that used tasks in which participants had to generate definitions and those that used synonym judgment tasks obtained significantly larger mean SREs than those that used desirability ratings. However, desirability judgments did not differ significantly from any of the other tasks, and the synonym judgment class differed sig- nificantly only from tasks involving desirability judgments.

Only three studies in the entire sample tested children as

