Educational testing

Educational testing shifted drastically in the early 1900s due to the impact of Lewis M. Terman (1877-1956) and Carl Campbell Brigham (1890-1943), two fervent eugenicists who believed that intelligence was a static trait that could be measured by tests. Prior to the creation of the Binet scale to determine intelligence, testing in education was strictly related to achievement and was encompassed by chapter and unit exams in a particular course. Such testing is used to determine how much knowledge a student has retained or can reproduce on a particular topic. In 1905, with the publication of the original Binet scale by Alfred Binet (1857-1911), testing shifted from being purely achievement based to being about aptitude. The main distinction between achievement and aptitude testing is that the latter is used to predict future behaviors.

In the United States specifically, aptitude testing was supported by eugenicists such as Terman who assumed that a) intelligence was hereditary; b) intelligence was a linear characteristic of humans; and c) intelligence was normally distributed within a population. These three assumptions were the basis for Terman’s development of the US IQ test, which itself was modeled on Binet’s original intelligence scale. Terman’s test identified what he termed high-grade defectives, placing them under surveillance (Gould, 1996, p. 209). As a supporter of the eugenics movement himself, Terman was instrumental in creating the multiple choice standardized version of the IQ test and pushing for their widespread use in schools. He also determined that the norm of intelligence testing be a score of 100, and helped with the evolution of IQ testing into college entrance exams such as the SAT. Thus, eugenic thinking is connected with educational testing, both historically because of the involvement of eugenicists in educational testing, as well as ideologically in how tests based on eugenic ideology are created and used in the educational system.

Test Creation: Norm-referenced interpretation
Norm-referenced tests are tests where a child’s “performance is compared to a normative sample in order to interpret the child’s performance relative to a reference population” (Spaulding, Szulga, & Figueroa, 2012, p. 178, emphasis added). When looking at the ways in which Binet and Terman created their reference population, we begin to see how norm-referenced interpretation works. Then, depending on which group is chosen, a reference population can be as small as a classroom or a school or as large as all Grade 8 students in a given province or state.

To begin, both Binet and Terman chose children for their original scales who they deemed “typical” of a particular age. They then proceeded to give these children a set of test items to establish what “typical” children of that age could accomplish. This meant that if a particular 8 year old was unable to complete the same tasks as the “typical” 8 year old, that child would be labeled as mentally deficient to varying degrees depending on how far removed the child’s performance was from the “typical” child. The emergence of the notion of a “typical” child of a particular age marked the creation of the first norm in educational testing.

Later, in Terman and Merrill’s 1937 revision of the Stanford-Binet Scale, the normative sample of students was based on 3,184 subjects, all American-born, all belonging to the white race, from 17 communities in 11 states, selected from “average schools” (p. 12). Terman believed that intelligence was normally distributed; he, therefore, eliminated any test scores and test items that did not fall into a normal distribution for his reference population. Any student who took the 1937 Stanford-Binet IQ test had their performance compared to the normally distributed scores of 3,184 white American students from “average schools.” Terman and Merill’s test was used to shape children’s futures by determining which students required special educational training or which students should be placed in the honors versus the vocational track in high school.

Test Use: Stratification/Tracking
Norm-referenced interpretation is still commonly used in education as a placement method in both high school and university, which can be referred to as stratification, tracking, sequencing, or streaming.

Often, though not always, the process of tracking in high school is done by testing students on a norm-referenced scale, whether that be an actual intelligence test, or a test of a specific subject matter based on a defined norm or average. In the UK, for example, the 11+ exams were first proposed by Cyril Burt, a eugenicist, who believed that intelligence was set by age 11. In his mind, testing students at the age of 11 would allow the government to determine the best path for students continuing on in their education, whether that be grammar school, apprenticeship, or the world of work. Alternatively, in the Alberta mathematics curriculum, there is no defined test that students are required to take but there are quotas for how many students are expected to be in each track once they enter secondary school. Therefore, each student’s performance in mathematics in Grade 9 is compared to all of the other students currently in Grade 9 in a particular school, such that 40-60% of students going into Grade 10 are placed in the highest track, while 25-35% are in the middle track, and 15-25% are in the lowest track (Alberta Curriculum Branch, 1970). Thus, the norm-referenced interpretation for Grade 10 mathematics courses is based on the reference population of Grade 9 students in a particular school. Norm-referenced testing is also used to “select students who are prepared for college [or university]” (Cimetta, D’Agostino, & Levin, 2010, p. 10). In his role as chairman of the College Entrance Examination Board (CEEB) from 1923-1926, C.C. Brigham, a known eugenicist, was part of the movement to select university students based on intelligence. Up until this time, the CEEB had an entrance examination that required students to write essays on literature and history. However, given that most of the students writing the exam were New England boarding school boys attempting to get into Ivy League New England universities such as Harvard and Princeton, if a family could pay tuition, students were generally accepted. Brigham based the Scholastic Aptitude Test (SAT) on the Stanford-Binet IQ test, while also increasing the difficulty of the questions, as he had found, when testing his students at Princeton, that the questions in the IQ test were too easy. The first official introduction of the SAT was June 23, 1926 where 8,040 students took the test and had their scores reported to the universities where they were applying (Lemann, 2000). Although the SAT was not actually used to determine acceptance at that time, for several years following its introduction, students’ test scores were kept and compared to their first year grades to determine the reliability of the SAT itself and its ability to predict which students were qualified for university study. From these humble beginnings, the SAT grew from being used to determine Harvard scholarship recipients in 1934 to an entrance requirement for any student going to university.

Conclusion
Educational testing is still living and fighting with the legacy of eugenics in its past, from the eugenicists who created the exams to the ideology behind our ideas of intelligence. One way in which testing is changing is that tests like the SAT are no longer taken as a measure of intelligence by all. There are movements to remove the requirement from university acceptance in the US altogether, and recent reports suggest that more than 800 four-year universities no longer require standardized tests for entrance (FairTest, 2015). However, norm-referenced interpretations are still being used to track students in secondary schools in varying ways, allowing that legacy to continue.

-Michelle Hawks

Alberta Curriculum Branch (1970). Revision of the high school mathematics curriculum. Edmonton, AB: Department of Education.
Block, N. J. & Dworkin, G. (Eds.). (1976). The IQ controversy: Critical readings. New York: Pantheon Books.
Cimetta, A. D.; D’Agostino, J. V., & Levin, J. R. (2010). Can high school achievement tests serve to select college students? Educational Measurement: Issues and Practice, 29(2), 3-12.
Darling-Hammond, L. (2004). From “separate but equal” to “No Child Left Behind”: The collision of new standards and old inequalities. In D. Meier, & G. Wood (Eds.) Many children left behind: How the No Child Left Behind Act is damaging our children and our schools (pp. 3-32). Boston, MA: Beacon Press.
Frisbie, D. A. (2005). Measurement 101: Some fundamentals revisited (Presidential Address). Educational Measurement: Issues and Practice, 24(3), 21-28.
Gould, S. J. (1996). The mismeasure of man. New York: W.W. Norton & Company.
Jacoby, R. & Glauberman, N. (1995). The bell curve debate: History, documents, opinions. New York: Times Books.
Lemann, N. (2000). The big test: The secret history of the American meritocracy. New York: Farrar, Straus and Giroux.
Lindeman, R. H. (1967). Educational measurement. Glenview, Ill: Scott, Foresman and Company.
Lucas, S. R. (1999). Tracking inequality: Stratification and mobility in American high schools. New York: Teachers College Press.
National Center for the Fair and Open Testing (FairTest) (2015). College and university admissions testing. Retrieved from: http://www.fairtest.org/university
Norris, S. P., Macnab, J. S., & Phillips, L. M. (2007). Cognitive modeling of performance on diagnostic achievement tests: A philosophical analysis and justification. In J. P. Leighton & M. J. Gierl [Eds.] Cognitive diagnostic assessment for education (pp. 61-84). New York: Cambridge University Press.
Spaulding, T. J.; Szulga, M. S., & Figueroa, C. (2012). Using norm-referenced tests to determine severity of language impairment in children: Disconnect between US Policy makers and test developers. Language, Speech, and Hearing Services in Schools, 43, 176-190.
Terman, L. M. & Merrill, M. A. (1937). Measuring intelligence: A guide to the administration of the new revised Standford-Binet tests of intelligence. Cambridge, MA: Houghton Mifflin Company.

Educational testing

-Michelle Hawks

Alberta Curriculum Branch (1970). Revision of the high school mathematics curriculum. Edmonton, AB: Department of Education.
Block, N. J. & Dworkin, G. (Eds.). (1976). The IQ controversy: Critical readings. New York: Pantheon Books.
Cimetta, A. D.; D’Agostino, J. V., & Levin, J. R. (2010). Can high school achievement tests serve to select college students? Educational Measurement: Issues and Practice, 29(2), 3-12.
Darling-Hammond, L. (2004). From “separate but equal” to “No Child Left Behind”: The collision of new standards and old inequalities. In D. Meier, & G. Wood (Eds.) Many children left behind: How the No Child Left Behind Act is damaging our children and our schools (pp. 3-32). Boston, MA: Beacon Press.
Frisbie, D. A. (2005). Measurement 101: Some fundamentals revisited (Presidential Address). Educational Measurement: Issues and Practice, 24(3), 21-28.
Gould, S. J. (1996). The mismeasure of man. New York: W.W. Norton & Company.
Jacoby, R. & Glauberman, N. (1995). The bell curve debate: History, documents, opinions. New York: Times Books.
Lemann, N. (2000). The big test: The secret history of the American meritocracy. New York: Farrar, Straus and Giroux.
Lindeman, R. H. (1967). Educational measurement. Glenview, Ill: Scott, Foresman and Company.
Lucas, S. R. (1999). Tracking inequality: Stratification and mobility in American high schools. New York: Teachers College Press.
National Center for the Fair and Open Testing (FairTest) (2015). College and university admissions testing. Retrieved from: http://www.fairtest.org/university
Norris, S. P., Macnab, J. S., & Phillips, L. M. (2007). Cognitive modeling of performance on diagnostic achievement tests: A philosophical analysis and justification. In J. P. Leighton & M. J. Gierl [Eds.] Cognitive diagnostic assessment for education (pp. 61-84). New York: Cambridge University Press.
Spaulding, T. J.; Szulga, M. S., & Figueroa, C. (2012). Using norm-referenced tests to determine severity of language impairment in children: Disconnect between US Policy makers and test developers. Language, Speech, and Hearing Services in Schools, 43, 176-190.
Terman, L. M. & Merrill, M. A. (1937). Measuring intelligence: A guide to the administration of the new revised Standford-Binet tests of intelligence. Cambridge, MA: Houghton Mifflin Company.

Encyc

Educational testing

Educational testing