Investigating the Effects of False Memory Examining whether a false memory effort can be demonstrated for word lists through the DRM paradigm

The present study examined whether a false memory effect could be demonstrated for categorized word lists. It drew inspiration from previous studies in the field of false memory, most notably in replicating and extending the Roediger-McDermott study and utilizing the Deese-Roediger-McDermott (DRM) paradigm. In the current study, students were presented with word lists of 2 sizes: containing 4 or 8 studied exemplars per category and were then tested on recognition with 2 types of words: words previously presented and novel words. It was predicted that the false alarm rate would be higher for new words from the studied categories and that the false alarm rate would be greater for the categories that contained more exemplars. The results showed that false alarm rates were greater for new words from the studied categories as compared to new words from unstudied categories and also showed that the false alarm rate was greater for the larger categories.

Keywords: false memory, list learning, information recall, memory retrieval

The subject of false memories has long been a topic of contention, debate, and fascination in the field of psychology research. At various points in history, this subject has also made its way into the general public discourse as well, specifically in those cases where the reliability of memory had been called into question. Specifically, this has most often taken place when serious allegations were brought forth decades after the events supposedly took place. These cases were involved with the topic of false memories as they relied on and were centered around the subjective memories of the individuals who claimed to have experienced the events in question firsthand.

A recent example of such a case that forced itself to the forefront of public discussion was the accusation brought forth against Brett Kavanaugh who was announced as the nominee to the Supreme Court of the United States in 2018. As reported by Re (2018), Mr. Kavanaugh was accused by Christine Blasey Ford of allegedly sexually assaulted her in 1982. Given the fact that nearly four decades have passed since those supposed events took place, the incident could only be regarded as a case inexorably linked to the reliability of memory and the potential prevalence of false memories. The common preconception surrounding memory has been the belief that memory was generally accurate and reliable, especially in the case of particularly salient events. However, numerous studies that have been conducted in the field of psychology have cast considerable doubt on whether memory was as consistently reliable as it was commonly believed to be. Moreover, it has been remarked in the research of Roediger and McDermott (1995) that “when people know that their accuracy in recollecting cannot be verified, they may even be more easily led to remember events that never happened than they are in the lab” (p. 812).

As mentioned prior, the phenomenon of false memory has long been studied by researchers working in the field of psychology. Likely one of the first psychologists to study false memory in an experimental fashion was Bartlett who had participants read a folktale titled “The War of the Ghosts” and asked them to repeatedly recount what occurred in the story (Roediger and McDermott, 1995, p.803). His results showed that with repeated reproductions of the original story, the recollection of the story became increasingly distorted by the subjects. Significantly, Bartlett noted a difference between reproductive and recollective memory where the former referred to accurate duplication of original material into memory and its subsequent retrieval, whereas the latter referred to the “active process of filling in missing elements while remembering” (Roediger and McDermott, 1995, p.803). After Bartlett’s seminal research, many follow-up studies followed his lead in presenting study participants with story-based passages and examining the frequency with which false memories occurred. However, this was not the universal approach and Underwood's research served as one of the notable exceptions to this story-based approach to examining false memory (Roediger and McDermott, 1995, p.803). A different experimenter who utilized the list learning paradigm in examining the phenomenon was Deese who tested his subjects’ recollection of word lists. Deese wanted to determine how often participants would recall words that were not actually present in the word lists that they were given and created 36 lists that consisted of 12 words each in order to test how often this phenomenon would occur (Roediger and McDermott, 1995, p.804). The study conducted by Roediger and McDermott set forth to replicate and extend Deese’s experiment. Their study was composed of two experiments, the first of which replicated the Deese study.

For their first experiment, the researchers generated six lists based on those used in Deese’s experiment, with each list composed of 12 associates for each of the critical nonpresented words, with the associates generated from the Russell and Jenkin’s word association norms (Roediger and McDermott, 1995, p.804-805). The six words that were not present in the list but that were closely associated with the other words within the lists were termed the ‘lures’ (Roediger and McDermott, 1995, p.804). Following the presentation of the six lists, the experimenters gave the study participants a 42-item recognition test that consisted of 12 words that were present on the lists and 30 words that were not. Of the 30 words that were not present on the original lists, these were in three varieties – 12 items that were entirely unrelated to any items on any of the six original lists, 12 words weakly unrelated to the items on the previous lists, and finally the six critical words which were the basis for the generated associates that composed the lists presented prior. The result of this experiment showed that participants recalled the words that appeared on the six lists with a probability of .65, and also recalled the critical lure with a probability of .40. Words that were neither presented beforehand nor were the critical lures appeared with only a probability of .14, indicating that the subject participants were not merely guessing on the recognition test (Roediger and McDermott, 1995, p.804-806). Seeing as the critical lures appeared nearly at the same rate as the words that were presented, the researchers concluded that a false recall effect had occurred. Moreover, study participants also stated that they were highly confident that the critical lures did appear on the lists with which they were presented. Given these results, the researchers concluded that the Deese paradigm was an effective way to study false memory and designed a second experiment to expand on the results of the first so as to explore this phenomenon further (Roediger and McDermott, 1995, p.806).

The second experiment conducted by Roediger and McDermott consisted of 24 15-item lists that were similar in design to the lists used in the first experiment. The researchers wished to determine the nature of the direct subjective experiences that subjects had when they falsely recognized words that were not presented to them – specifically whether they made a judgment of remembering the words or merely of knowing them (Roediger and McDermott, 1995, p.806-807). This difference was derived from Tulving’s procedure that distinguished between the experiences of ‘remembering’ and ‘knowing.’ In the Roediger and McDermott study, a ‘remembering’ judgment referred to the cases where the subject was able to recount and relive the experience of seeing the lure on the list. On the other hand, ‘knowing’ took place when the participants believed that the lure was on the list but were unable to relive the experience of seeing it previously (Roediger and McDermott, 1995, p.807). The results of this second experiment was that the participants recalled the critical lure for 55% of the lists, a rate that the researchers noted as being even greater than the one obtained in the first experiment and the researchers hypothesized that this could have occurred due to the increased number of words in each list (Roediger and McDermott, 1995, p.808). Moreover, the researchers found that the subjects believed that they truly experienced the lures as being on the original lists, seeing as they claimed that they actively remembered seeing them.

Given the results of these two experiments, the researchers concluded that list learning paradigms could be used to study the false memory phenomenon. In addition, Roediger and McDermott (1995) stated that the strength of the individual’s belief that something took place and that individual’s perceived salience of that event was insufficient to reliably conclude that the event genuinely took place (Roediger and McDermott, 1995, p.812). In short, the fact that false memories took place frequently in their experiments led the researchers to conclude that the phenomenon of false memory was genuine.

Another significant study on the effect of false memory in the context of cued recall of categorized lists was performed by Smith, Ward, Tindell, Sifonis, and Wilkenfeld (2000) which built upon the discoveries made by Roediger and McDermott. This study hypothesized that category knowledge could be used to guide episodic recall and reconstruction, and that it could be shown to have a guiding effect on the reconstruction of memories due to the embedded cues that could aid the retrieval process (Smith, Ward, Tindell, Sifonis, & Wilkenfeld, 2000, p. 386). The researchers also predicted that the likelihood that false memories would take place would depend on the output dominance and typicality of the omitted word. Output dominance was defined as the frequency with which a given item was listed during simulated recall procedures, or in other words, how often a word would appear when subjects were instructed to guess the composition of a list that they had not seen (Smith et al., 2000, p. 387). The typicality of a word was defined as how similar a given word was to the general category. In essence, the experiments performed by Smith et al. (2000) were set up to demonstrate that memory retrieval was, generally speaking, a constructive process and that the characteristics of the individual omitted list items determined how easily these said items could intrude during the recollection of that list. In short, the researchers were interested in learning how category knowledge would affect the creation of false memories, and how the typicality and output dominance of the omitted words would affect their rate of intrusion (Smith et al., 2000, p. 387). The three experiments that were conducted by the researchers all utilized the list-learning procedure.

The first experiment set forth to prove that if common members of categories were omitted from a list, and that list was presented without those items, that false recall of the omitted items would occur (Smith et al., 2000, p. 388). In other words, if a list were to be presented to subjects without its most common element, it was hypothesized by the researchers that the subjects would recall the list with the omitted item as if it were never removed. In setting up for the experiment, the researchers generated nine lists from Rosch’s semantic category study with 15 items per category with the most typical item omitted (Smith et al., 2000, p. 388). The researchers predicted that intrusions would occur frequently and therefore the phenomenon of false memory would be demonstrated. Following the initial presentation of the lists, participants were instructed to recall the items on the list. The results showed that critical intrusions did indeed occur at high frequencies, that they occurred more frequently in the case of greater delays between initial presentation and testing, and that the correlation between the frequency with which an intrusion occurred and the output dominance of the intruded word was r=.72 (Smith et al., 2000, p. 389). Therefore, the researchers found a positive correlation between the rate of intrusion and the output dominance of the intruding word.

Following this experiment, Smith et al. (2000) carried out a second experiment that was designed to observe the effects of gradation in order to determine the interaction between high, medium, and low output dominance items and their respective rates of intrusion, hypothesizing that high output dominance items would be falsely recalled more frequently. The result of this experiment confirmed this and showed that items that were higher in typicality and output dominance were indeed recalled more often than items that were lower in typicality and output dominance (Smith et al., 2000, p. 390). However, it was also determined through statistical analysis that the rate of recall was more closely related to the output dominance of a given item rather than its typicality. Thus, the researchers concluded that the rate of recall, both accurate and false, depended on how easily the word came to mind – a concept that they defined as retrieval fluency (Smith et al., 2000, p. 391).

Finally, in their third experiment, the researchers aimed to demonstrate that the rate of false recall of items would be correlated with the output dominance of those items (Smith et al., 2000, p. 392). In other words, the items that would be falsely recalled the least would be the ones with the lowest output dominance and vice versa. This was found to be the case as the items with higher output dominance were recalled more often than those lower in output dominance (Smith et al., 2000, p. 393). Moreover, the researchers also collected the rates of confidence reported by the subjects and these data showed that the study participants were completely confident that they were correct in claiming that the omitted items were presented to them prior as often as 48% of the time (Smith et al., 2000, p. 393).

All three experiments conducted by Smith et al. (2000) further demonstrated evidence for the existence of false memories in categorized list recall. These experiments showed that intrusions happened often, that they occurred with greater frequencies with increased delay between presentation and testing, and that priming also increased how often intrusions occurred. The results also showed that the output dominance of an item also could play an important role in both false and accurate recall of that item (Smith et al., 2000, p. 395). The findings of these experiments were in line with the previous research conducted on this topic.

It was a fact of note that the category lists generated for the two experiments discussed above were based on older category norm tools. A study conducted by van Overschelde, Rawson, and Dunlosky (2004) hypothesized that these older tools had become ill-suited for contemporary usage due to changing societal norms and social knowledge and consequently needed to be brought up-to-date. In order to do this, the researchers expanded and improved the frequently-used Battig and Montague category norms (van Overschelde, Rawson, & Dunlosky, 2004, p. 289). The researchers gathered 600 undergraduate study participants and presented them with 70 categories that were modified to reflect the aforementioned methodological changes (van Overschelde et al., 2004, p.290). The subjects were instructed to generate as many responses as possible to the category cues that they were given. The results were analyzed by the researchers in order to calculate the generational stability of the terms, amongst other factors. The results confirmed their hypothesis and indicated that the knowledge of category membership had changed significantly for many of the categories that were used in the Battig and Montague category norm tool since 1969, the year that that tool was developed (van Overschelde et al., 2004, p.295). The researchers consequently concluded that there was a genuine need to use “up-to-date norms in psychological research” (van Overschelde et al., 2004, p.295).

Given the promising results of prior studies conducted on the potential of utilizing the list learning paradigm to study the false memory phenomenon, the present study set out to replicate and extend those results so as to further the analysis and understanding of the phenomenon of false memory. The first goal of this study was therefore to replicate the effects of false memory as related to the Deese-Roediger-McDermott paradigm through the use of categorized word lists. The second was to determine whether an increased number of old exemplars presented in a given category would impact the false alarm rate for new items associated with the same studied category. Given the fact that the associate lists that were used to generate the category lists for both the studies conducted by Smith et al. (2000) and by Roediger and McDermott (1995) were shown to be potentially outdated, the exemplars selected for this study were chosen from the work of van Overschelde et al. (2004). The study was set up as a 2 X 2 repeated measures ANOVA design with two category sizes (4 vs. 8 studied exemplars per category) and two types of items (old items vs. new items). It was hypothesized that the false alarm rate would be higher for new words from the studied categories than for the new words from the unstudied categories. If confirmed, this would demonstrate that the current experiment’s results would be in line with the results of the experiments conducted by Roediger and McDermott (1995). It was also hypothesized that the false alarm rate would be greater for the exemplar categories with more members as compared to the false alarm rate for the categories with fewer exemplars. This was expected given the fact that there were 15 items in each list in the second experiment conducted by Roediger and McDermott (1995) as opposed to 12 items in each list in the first and given the fact that the false recall rate obtained in their second experiment was greater than the false recall rate obtained in their first experiment.

Method

Participants
University students enrolled in two sections of a senior laboratory course in cognitive psychology participated in the study. There were 22 students in section 1 and 24 students in section 2, for a total of 46 participants in the study. Subjects were not compensated for their participation as they took part in the study as part of a mandatory course exercise.

Apparatus and Materials
The subjects were presented with stimuli from a study list consisting of 12 different categories with 4 exemplars selected from each of the first 6 categories, and 8 exemplars selected from each of the latter 6 categories, for a grand total of 72 study word exemplars. These exemplars were selected from the van Overschelde et al. (2004) categorized word pool. All stimuli were presented on a screen at the front of the classroom through the use of the PowerPoint computer software as well as an overhead projector.

Procedure
The experiment was conducted in a computer laboratory classroom. Subjects sat at spots distributed throughout the classroom, and therefore different individuals were located at different distances from the screen and viewed the screen from different angles. The screen was positioned at one end of the classroom.

The subjects were presented with a total of 72 study word exemplars with 4 exemplars from each of the first 6 distinct categories, and with 8 exemplars from each of the second 6 distinct categories. The items were shown in a random order, but in such a manner that half of the items in each category were shown in each half of the study list. The presentation rate at which the items were presented was 3 seconds per item.

Following this initial presentation, there was a 4-minute retention interval between the study period and the recognition test which was occupied with distributing the physical recording sheets and instructing the study participants how to proceed. Following this, the participants were presented with a 144-item recognition test which included 72 old studied items and 72 new items. The old studied items consisted of 24 4-exemplar items and 48 8-exemplar study items. The 72 new items consisted of 4 new exemplars from each of the 6 different 4-examplar study categories, of 4 new exemplars from each of the 6 different 8-exemplar study categories, and with 4 new exemplars from each of the 6 new previously unstudied categories. These items were presented in random order at a presentation rate of 6 seconds per item. Therefore, the independent variable that was manipulated was whether the words presented were sourced from the old studied lists with 4 exemplars, from the old lists with 8 exemplars, or were new exemplars from the 4-exemplar study categories, new exemplars from the 8-examplar study categories, or were words from previously unstudied categories. The category size was counterbalanced across the two sections of the class, wherein the categories containing 4 exemplars for section A contained 8 exemplars for section B, and the categories containing 8 exemplars for section A were made to have 4 exemplars for section B.

All participants were asked to indicate whether they believed that any given item was presented to them prior. The results were recorded by the participant physically and processed electronically. These results indicated each participant's individual accurate recall rate for the old words from the categories with 4 items, their accurate recall rate for the old words from the categories with 8 items, their false alarm rate for new words from the categories with 4 items, their false alarm rate for new words from the categories with 8 items, and their false alarm rate for the items from the novel categories.

Results
The results indicated that the 24 4-exemplar old studied items were recalled by the participants at a mean recall rate of .78 which was greater than the mean recall rate for the 48 8-exemplar old studied items which was .74. At the same time, the mean rate of false alarms for new exemplars for the 4-exemplar study categories was .15, which was lower than the mean rate of false alarms for the new exemplars from the 8-exemplar study categories which was .18. The mean rate of false alarms for new exemplars from the previously unstudied categories was .11. This showed that the false alarm rate was the lowest for the previously unstudied categories, as hypothesized. Additionally, it ought to be noted that the present study design was imbalanced due to the presence of new words from previously unstudied categories on the test which was a category of words that did not have an old word equivalent category counterpart.

These results were in line with the conclusions from the Smith et al. (2000) paper that indicated that the most accessible incorrect responses would be the most likely to occur. The accessibility of the intrusions of the new words from the previously studied categories could be explained as an effect of priming that lead to these terms being more accessible as opposed to the new terms from the unstudied categories. Moreover, the Roediger and McDermott (1995) mean false alarm rates from their first and second experiments were .40 and .55 respectively which were significantly higher rates than the false alarm rates obtained in this study which were .15 and .18. This fact was partially explained by differences in procedures as well as the disparate list lengths between the two studies.

It was notable that counterbalancing was performed between the two sections and this consisted of switching which categories contained 4 or 8 exemplars. This fact showed that the results were not based on the specific items in each list.

The data collected were transcribed from the physical sheets that were initially completed by the students into an electronic format by the researchers. This data was provided in Table 1 (available upon special request), and it was this data that was subsequently analyzed as a 2 X 2 repeated measures ANOVA by using the IBM SPSS statistical analysis software. By analyzing the F-statistic on the test effect of old vs. new, F(1,45) = 436.8 was obtained, MSe = 0.037, p =< .001. This showed that the test type was significant and indicated that test performance was above what could be obtained by chance; in other words indicating that students were not merely guessing on the tests that they performed. The F-statistic for the category effect (4-exemplar vs 8-exemplar) was F(1,45) = .58. Seeing as this obtained F-statistic was less than 1 for the category effect, this indicated that there was no significant main effect for the categories in isolation. Finally, the F-statistic for the interaction between test and category effect was F(1,45) = 17.31, MSe = 0.04, p =< .001 which indicated that an interaction effect was present. This finding suggested that a smaller number of items within a category led to an increase in the hit rate as well as to a decrease in the false alarm rate. Simultaneously, when there was a greater number of items in a category, the hit rate decreased, and the false alarm rate increased.

Given the fact that an interaction effect was found, paired sample t-tests were performed in order to further explore the nature of this interaction and said t-tests were performed on two pairs, the former pair being made up of the old categories with 4 exemplars (old-4) and the old categories with 8 exemplars (old-8), and the latter being composed of new words from the categories with 4 exemplars (new-4) and new words from the categories with 8 exemplars (new-8). The result for the first pair was (old-4 vs old-8), t(45) = 3.21, p = .002 and thus was found to be significant. The second paired t-test result was t(45) = 2.28, p = 0.027 and was therefore also significant.

Discussion
The primary goal of the present study was to replicate and expand the previous body of research analyzing the false memory phenomenon through the use of the list learning paradigm. Firstly, the current study replicated the effect of false memory as observed in the experiments performed by Roediger and McDermott (1995). Secondly, the present study also aimed to determine whether an increased number of old exemplars in a given category would impact the false alarm rate for new items associated with that same category. It was predicted that the obtained false alarm rate would be greater for new words belonging to the previously studied categories than the false alarm rate for the new words that belonged to new categories. It was also hypothesized that the false alarm rate would be greater for the exemplar categories that contained a greater number of presented members rather than fewer. In other words, it was predicted that the false alarm rate would be greater for new words belonging to the categories that had 8 previously presented exemplars as compared to the false alarm rate for the new words belonging to the categories that had only 4 previously presented exemplars.

The results of the study showed that the mean recall rate for the 4-examplar categories was .78, greater than the .74 mean recall rate for the 8-exemplar categories, as predicted. Moreover, the mean false alarm rate for new words from the 4-examplar categories was .15, lower than the false alarm rate for new words from the 8-examplar categories which was .18. This confirmed the hypothesis that the false alarm rate for the 8-exemplar categories would be greater than the false alarm rate for the 4-examplar categories. Moreover, both false alarm rates exceeded the .11 false alarm rate for new words from the unstudied categories. This led the researchers to conclude that both hypotheses were confirmed by the results of this study.

These results were expected given that they are consistent with the results obtained by Smith et al. (2000), wherein those researchers found that the most accessible incorrect responses would be the most likely to intrude or cause false alarms. The results obtained by the present study were also in line with the research performed by Roediger and McDermott (1995) as their research also found a greater mean false alarm rate for categories with a greater number of studied items.

In addition, it was found that the test type effect was significant and showed that the test performance of the subjects was above chance, (F(1,45) = 436.8, MSe = 0.037, p =< .001). On the other hand, the category effect in isolation was not found to be significant, (F(1,45) = .58, F < 1). Notably, a significant interaction effect between test type and category effect was obtained, (F(1,45) = 17.31, MSe = 0.04, p =< .001) which demonstrated a negative relationship between hit rate and category size, and a positive relationship between false alarm rate and category size. Put another way, a smaller number of items in a given category predicted a higher hit rate and a lower false alarm rate, and vice versa.

These results led the researchers to empirically conclude that it was possible to effectively observe the false memory phenomenon through the use of the list learning paradigm. This indicates that this paradigm could be effectually extended and utilized in future studies to further the understanding of the false memory phenomenon.

The present study was limited by a relatively small number of participants and the fact that the study subjects did not have precisely equal experiences of the words presented due to their differing viewing angles of the presentation screen and the fact that students were sat at different distances away from it. Future research could extend the present findings by generating lists with a greater contrast in the number of exemplars per list, therefore creating lists with far more exemplars and comparing them with lists that contain far fewer exemplars. It may also be interesting to observe how adjusting the retention intervals between presentation and recall testing may affect false alarm rates.

Clearly, more research is required into the contentious topic of false memories. Given the fact that many prominent cases that take place in many diverse disciplines from legal to therapeutic environments have to do with individual subjective judgments of events long past, it is crucially important to consider the potential and the probability of the false memory phenomenon and the related phenomenon of false recall. The study of these two related phenomena is absolutely vital so as to better understand the nature of human memory in general, and to gain insight into how to reliably identify and avoid the phenomenon of false recall when it does inevitably occur in particular. Only armed with this knowledge can one ever hope to positively hold the guilty responsible for the things that they have truly done in the distant past, while simultaneously defending the good names and the rights of the innocent accused by individuals that mistake their false memories for events that genuinely took place.

REFERENCES

i Re, G. (2018, September 17). California professor Christine Ford claims Kavanaugh sexually assaulted her: 'It derailed me'. Retrieved from https://www.foxnews.com/politics/california-professor-christine-ford-claims-kavanaugh-sexually-assaulted-her-it-derailed-me

ii Roediger, H. L., III, & McDermott, K. B. (1995). Creating false memories: Remembering words not presented in lists. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21(4), 803-814. doi:10.1037//0278-7393.21.4.803

iii Smith, S. M., Ward, T. B., Tindell, D. R., Sifonis, C. M., & Wilkenfeld, M. J. (2000). Category structure and created memories. Memory & Cognition, 28(3), 386-395. doi:10.3758/bf03198554

iv van Overschelde, J.P., Rawson, K.A., & Dunlosky, J. (2004). Category norms: An updated and expanded version of the Battig and Montague (1969) norms. Journal of Memory and Language, 50(3), 289-335. doi:10.1016/j.jml.2003.10.003

© 2025 Bogdan ZADOROZHNY