Thomas N. Robb
Kyoto Sangyo University
Jay Ercanbrack
Kyoto Sangyo University
In order to study the effect of direct test preparation on TOEIC gain scores, two samples of students (i.e., English majors and non-majors) at a Japanese university were divided into three treatment groups: 1) TOEIC Preparation, 2) Business English and 3) "General" (four-skills) English. The results indicate that usage of TOEIC preparatory materials led to a statistically significant gain on post-test scores for the non-majors' reading component only. The authors conclude that TOEIC preparatory materials are of little benefit to students enrolled in a comprehensive program of English language study, but might boost the score of the reading component of students enrolled in a university-level general English course in Japan.
There is a clear tendency for students, not only in Japan, but around the world, to study for a test by reviewing past tests and concentrating their efforts on the types of language and test items that are known to appear on such tests. It is equally clear that if a test can be prepared for, then the test no longer can be said to measure general proficiency. Rather, it measures how well people have studied for the test.
Thus there is the inherent danger of the test becoming the tail that wags the curriculum dog. Henning expressed this concern thus: "If there is no concerted effort to subordinate testing to explicit curricular goals, there is an ever-present potential danger that tests themselves with all their inherent limitations will become the purpose of the educational encounter by default" (1990:380).
This study was designed to determine whether "teaching for the test" in a Japanese university setting does, in fact, result in higher test scores. Specifically, we set out to determine if students who use material designed for TOEIC test preparation or for "Business English" achieve higher gain scores than students who study an equal amount of time with standard language study materials.
The authors have found little previous research related to gain scores on the TOEIC test and only two studies, Alderson & Wall (1993) and Alderson & Hamp-lyons (1996) concerning the TOEFL examination. These, however, were more concerned with the 'washback effect' on such test preparation on the actual content of classes and contained no objective data concerning the effects of coaching on subsequent test scores.
Fortunately, several studies have been conducted in relation to yet another standardized test, Educational Testing Service's SAT ("Scholastic Aptitude Test"), an exam which is administered to native-speaking high school students and is required as part of the application process for most American universities. Powers' (1993) "Coaching for the SAT: A Summary of the Summaries and an Update" is a thorough survey of such studies, particularly those which employ meta-analytical techniques to synthesize previous research. Much of what follows below has been drawn from this survey.
Preparation programs for academic aptitude and language proficiency tests are currently abundant. Taken together, they constitute a vast industry within the private educational sector. Students are propelled toward such programs by a desire to succeed on tests where the perceived stakes are high. As noted by several researchers (e.g., Mehrens and Kaminsky 1989), the higher the stakes of a test, the greater the desire for guided test preparation and practice.
Yet, despite the great popularity of test preparation courses and programs, relatively little research has been done to document whether special preparation can have a markedly positive effect on test scores. A resolution of this issue is obviously crucial for the creators of standardized tests, as well as for the test-takers themselves. As mentioned, if preparation via coaching in test-taking techniques and strategies is found to be effective, it would indicate that test scores are not reliable indicators of academic ability or language proficiency, but rather reflect, at least to some degree, an ability to take tests. If such a situation exists, the validity of the tests is called into question.
Commercial coaching companies often report considerable gains in test scores by their clientele as proof of the effectiveness of their coaching programs. However, as Powers (1993) points out, variations in an individual's test scores from one test administration to another can be expected to occur, and for a variety of reasons. First, test score gains may be the result of a "practice effect", wherein test takers have a greater sense comfort, familiarity, and confidence when retaking a test than they possessed in their initial experience with the same exam, regardless of whether or not they have been coached in "test-wiseness" (Bachman 1990:114) strategies . Score increases may also reflect growth in an individual's ability over time, encouraged by a number of factors, rather than the direct influence of test coaching programs. Furthermore, variations in scores - either increases or decreases - may be due simply to measurement error. Upon retesting, it is quite usual for some examinees to show large increases in scores, and for others to show large decreases. This phenomenon of regression to the mean was demonstrated in a study by Johnson et al. (1985) . Examining an SAT coaching program, it was noted that the gains or losses recorded by coached students varied greatly depending on their initial test scores. Students who scored lowest on their first encounter with the SAT tended to make the greatest gains upon retesting, while those who at first scored most highly were likely to make the smallest gains or largest drops in their second round scores.
In contrast to the relative dearth of similar research concerning other large-scale, commercially available tests, several other studies have been conducted in the past two decades dealing with the effect of coaching on SAT scores (e.g., Becker 1990; Der Simonian and Laird 1983; Kulik, Bangert-Drowns and Kulik 1984; and Messick and Jungeblut 1981). As summarized by Powers (1993), and taking into account that simply repeating the test may lead to gains of around 15 points on the verbal section and 12 points on math (College Board 1991), these studies reveal that even the most well-known commercial coaching programs (e.g., Stanley Kaplan, Inc., and the Princeton Review) produce only modest score gains, typically 15 to 25 points each on the 200-800 point verbal and mathematical sections of the test. It has also been found that the scores of students who have undergone coaching receive a slightly higher boost on the math section than on the verbal section of the test (Becker 1990, Messick and Jungeblut 1981). In any case when, as suggested by Messick (1982), improvements in percentile ranking for coached students (usually consisting of only a few points) are considered, the gains made by coached SAT-takers appear to be meager. It seems clear then, at least in the case of the SAT, that the effects of coaching may fall far short of students' expectations.
Beyond the direct influence of coaching, differences in scores between coached and uncoached students may reflect other factors and tendencies associated with each group. Powers (1981) offers evidence to show that students who utilize formal or commercialized coaching services also make greater use of other test preparation resources than their uncoached peers. For example, they are more likely to conduct their own review of relevant subject matter, read supplementary test preparation books, and attend review sessions provided by their own schools. Such tendencies make it difficult, if not impossible, to objectively assess the impact that coaching programs alone may have on those who enroll in them.
One related question worth considering with regards to the test coaching issue is this: if coaching does have an effect (and it appears, as we have seen, to have at least some small impact on SAT scores), then what precisely is the reason for this effect? That is, what aspect of coaching is helping to raise test scores? One of the few studies to address this issue, again in the context of the SAT, was that of Johnson et al. (1985). It found that as a result of coaching, many test-takers were able to complete more items on both sections of the SAT. Since providing a correct answer on even a portion of the previously unmarked items would lead to higher overall scores, this newly developed ability was seen by the researchers as being the instrumental element in test score improvements for this set of coached students.
Though, as mentioned, few studies have analyzed the effect of coaching on language proficiency tests, one significant review (Kulik, Bangert-Drowns, and Kulik 1984) compares the SAT with a variety of other standardized aptitude tests, both academic and psychometric (e.g., GRE-Q, Stanford-Binet, WISC, etc.). It found that the affect of coaching for the other tests was much greater (approximately three times) than for the SAT. This may possibly be attributed to the preponderance of relatively simple test item formats found on the SAT. Item format is considered pivotal with regards to coaching since complex formats have been found to be more coachable than those of a simple nature (Powers 1986).
While the literature on the effectiveness of preparation courses and programs for language proficiency tests is bleakly sparse, there are several studies which have concerned themselves with the washback effects of such tests on EFL/ESL classrooms (Wesdorp 1982; Hughes 1988; Khaniya 1990; Wall and Alderson 1993; and Alderson and Hamp-Lyons 1996, among others). Washback, a term popular in British applied linguistics and commonly referred to as "backwash" in the field of general education, may be understood as the influence that a test has on teaching and learning. The "Washback Hypothesis", as explained by Alderson and Wall, assumes that "teachers and learners do things they would not necessarily otherwise do because of the test" (1993:117).
The concept of washback presupposes a belief in the notion that tests are prominent determiners of classroom practices and events. Accordingly, the term itself is neutral in that the influence of a test may be either positive or negative in nature. That is, a "poor" test yields negative washback while a "good" test will have effects perceived as positive. As summarized by Alderson and Wall (1993), some of the negative effects tests have been suspected of producing include narrowing or distortion of the curriculum (Vernon 1956; Madaus 1988; Cooley 1991), loss of instructional time (Smith et al. 1989), reduced emphasis on skills that require complex thinking or problem-solving (Frederickson 1984; Darling-Hammond and Wise 1985) and test score "pollution", meaning gains in test scores without a paralled improvement in actual ability in the construct under examination (Haladnya,Nolan, and Haas 1991).
In contrast, some researchers (i.e., Swain 1985 and Alderson 1986) emphasize the potential positive aspects of test influence and urge the creation of tests which, through constructive washback, will have enlightening effects on language curricula. Certain researchers (i.e., Morrow 1986 and Frederickson and Collins 1989) have suggested that a test's validity should be determined by the degree to which it has a positive influence on teaching. Morrow (1986) refers to this as "washback validity", while Frederickson and Collins (1989) have introduced the term "systemic validity" to refer to a similar process.
Remarkably, while claims of washback and its effects, both positive and negative, are numerous in educational literature, Alderson and Wall (1993) point out that little empirical evidence has been provided to support the argument that tests do indeed influence teaching, that is, that washback actually exists. Assertions concerning washback in past studies have been based primarily on anecdotal evidence, primarily opinions and impressions gathered from teachers and administrators.
To amend this lack of objective data, Alderson and Hamp-Lyons (1996) set out to investigate the existence and extent of washback in one educational setting. Using a combination of classroom observations and interviews with teachers and students, they targeted preparation classes for the English proficiency test TOEFL (Test of English as a Foreign Language), a test of particular importance to non-native speakers of English interested in entering degree programs at American universities. Two teachers received extensive observation in both their TOEFL preparation and regular English classes, and an attempt was made to separate the effects of individual teacher style from TOEFL washback.
The study did not investigate the question of whether or not TOEFL preparation courses were effective in raising scores on the test and, to this date, no previous research of this kind appears to have been conducted. Only the processes of teaching and learning were observed and examined in order to determine the extent of TOEFL washback in this setting. The authors concluded that the TOEFL did indeed affect both what and how teachers taught, but that the effect differed in degree and kind from teacher to teacher. More importantly, they suggested that it is not a test alone that causes washback, but the way that test is approached by administrators (who may determine the necessity of large class sizes), materials writers (who may fail to give proper guidance to teachers on possible ways to teach with a certain set of materials), and teachers themselves (who may devote little energy to finding alternative or innovative ways to teach test preparation classes) which actually creates the phenomenon of washback for a given language proficiency test.
It is, then, against this broad backdrop of hypothesis and information concerning washback, generally, and the effects of coaching on test scores, specifically, that the present study was conducted.
The study was carried out with two distinct samples of freshmen students at Kyoto Sangyo University: English majors (henceforth 'Majors') in the Faculty of Foreign Languages and Non-Majors from other faculties of the university taking freshmen English courses offered by the school's English Language Education and Research Center. These two samples will be treated separately since there are important differences between them that make integration of the data unwise:
The majors were taking seven 90-minute classes per week in English. These students were pseudo-randomly assigned by the school to one of 8 sections ("kumi" in Japanese). Students in a particular section at Kyoto Sangyo University take classes together in all but two of of their seven 'practical' courses. It was not feasible to vary the content of all of the courses, so only two courses, "Extensive Reading" and "Listening/Pronunciation" were used in the experimental design. The other five courses taken by the majors' included "Intensive Reading", "Grammar", "Composition", "Conversation" and "General Cultural Studies". An assumption was made that the content of the other courses would be roughly similar and would therefore not jeopardize the validity of the study. Two of the sections, 7 & 8, were English majors with a specialization in International Relations. These students had a slightly different program with a content course instead of grammar. The results with these classes both included and excluded were essentially the same, so this minor difference will henceforth be ignored.
Most but not all non-majors had two classes per week, one of which, "Applied English", was part of this study. The other class, "Reading" was a traditional reading class which concentrated on the careful reading and understanding of short passages of text. While it would have been better if this class, too, had been included in the study, this was unfeasible. Since all students were receiving a like amount of this reading practice, however, this additional class should have made little difference in the overall outcome. The Applied English class met for a maximum of 27 class times during the school year for a maximum of 40.5 hours of contact time.
The initial level of the majors was considerably higher than that of the non-majors, as would be expected. One implication of this was that the same materials could not be used for both sets of students in most cases.
The majors, having chosen English as their primary area of study for the next 4 years, could be assumed to be more interested in English and more highly motivated than the non-majors.
The English majors were much more likely to do home assignments. This was not so much a matter of intrinsic motivation as a consequence of the fact that their English courses were required. If they had failed to meet the instructor's expectations, they would have had to repeat the course. This, in turn (depending on the number of other failures), might have set back their year of graduation. For the non-majors, the course was not required. If they failed, they could take courses in a variety of other subjects to garner sufficient credits for graduation
The gain scores of all students, regardless of method of study, would be equal.
The researchers had no control over the composition of the classes. The majors in groups 1 through 6 (English--Language & Culture) were assigned pseudo-randomly by the University administration. Groups 7 and 8 (English--International Relations) were assigned alphabetically.
For the non-majors, the students are grouped according to their desired second foreign language. For example, all Business Majors who desired to study French were formed into one or more classes. Since there are normally too many students for a single class, the students are further divided into multiple classes according to their total score on the university's entrance examination. For this experiment, the instructors were assigned to six of the classes which contained the highest ranking students for certain combinations of major + second foreign language.
Majors (Treatment & Instructor) Non-majors (Treatment & Major)
Reading Listening Instructor X Instructor Y
Group Group Group
1 TOEIC TOEIC 101 General 218 General
A H Business Business
2 TOEIC TOEIC Majors Majors
B H/I
3 TOEIC TOEIC 133 TOEIC 292 TOEIC
C I Economics Engineering
4 General General Majors Majors
D J
5 Business Business 141 Business 228 Business
E K Law Economics
6 General General Majors Majors
F L
7 General General
F M
8 Business Business
G K
As can be seen from the above, the Non-major design has a neat 3 x 2 arrangement (treatments x instructors) while there are only two instances of "major" instructors teaching two sections for the entire period. In both of these cases, the instructor has classes of the same treatment.
The students were informed both in the course catalog and in their first class that a basic purpose of the course (regardless of treatment) was to achieve a high score on the TOEIC examination. Students were told that 30% of their final mark for the course would be based on the improvement in their test scores between the initial TOEIC examination (in May) and the final examination in January.
A shortened, demonstration version of the TOEIC (Form MT-93) was administered to all sections as a class activity one to two weeks prior to the actual pre-test. A list of 'hints' and test-taking strategies was also provided to all students in Japanese.
This preparation was considered important as one way to offset the "practice effect", mentioned previously, whereby students generally score higher on second and successive administrations of a test merely due to greater familiarity with the test itself.
In this study we were more interested in assessing the effect of the variation in the teaching of language content rather than differences arising from the acquisition of test-taking strategies. While it was inevitable that the students in the 'TOEIC' treatment would have more exposure over the course of the year to such test-taking strategies, we felt that this could be partially offset by familiarizing all students with the examination beforehand.
The pre-test was administered on May 11, 1996, approximately one month after the start of the school year. For administrative reasons, it was impossible to schedule the test any earlier. The instructors in the TOEIC treatment, in particular, were instructed to avoid any classwork which could be termed "TOEIC test preparation" until the test was over. The 'general' and 'business' treatments had no other exposure to TOEIC-type questions during the course of the experiment, save for the one sample test and the actual pre- and post-tests.
The post-test was administered on January 18, 1997, which was the day after the conclusion of the final term. The pre-test had revealed some slightly significant differences in some of the test groups (at the 0.05 level). It was decided to compensate for these differences in the final analysis by using an analysis of covariance (ANCOVA). Possible intervening variables such as age, club activities, etc. were also taken into account. The results of the pre- and post-test are presented below in the section on "Results and Analysis."
Two of the once-weekly classes of the English majors were used for the experiment, their extensive reading class (I-B) and their listening/pronunciation class (I-F). For the extensive reading course, students in all groups were required to read over 1000 pages a year and to write summaries of what they read in a notebook kept for that purpose. Thus the three treatments were only different in the materials used for the in-class component.
Texts for each course were selected according to the following factors:
For most of the treatments, the cost of the required materials was greater than students would normally be willing to pay for a university course bearing only 2 credits. Students were thus required to pay a maximum of ,A42500 per course, the rest of the expense being subsidized by grant funds from TOEIC.
Reading: For classwork, the students used the SRA "Reading Laboratory 2c" exclusively. There was no 'up front' instruction from the teacher. This is the material normally used in this course for the in-class component of the I-B extensive reading course.
Listening: The students used the following texts.
Improving Your Pronunciation (Meirindo+ tape)
The main text, On Target for the TOEIC (Longman) was shared between the reading and listening sections, each of which did the sections relevant to their particular course. This text was chosen over other available texts because it contained the most 'pedagogical material' in addition to the ubiquitous practice items and vocabulary lists. Copies of the audio tape for this text were distributed to all students. There was also a Japanese language companion text with notes, translations of vocabulary, etc. In addition, the following texts were used in order to provide a sufficient volume of material for the year course:
Reading: Building Skills for TOEIC (Pifer)
TOEIC Kisokara Gambare! - Vocabulary
Listening: TOEIC Kisokara Gambare! - Listening (with tape)
Practice with English Reduced Forms
Eigo Onseigaku no Kiso (The Basics of English Phonetics)
Business Objectives (Oxford) was shared between the reading and listening sections, each doing the relevant parts. Copies of the audio tape for this text were distributed to all students. Other supplementary texts used were:
Reading: English by Newspaper (Heinle & Heinle)
Listening: Business Venture
English by Newspaper, while not a business text, was adopted after two considerations, 1) there were no 'business reading' texts available for students at the low-intermediate level, and 2) the material in "English by Newspaper" contained numerous articles in business and related fields.
Main Text:High Impact (Longman) + Workbook & Tapes
High Impact is a four skills text written primarily for Japanese 'false beginners.' It was targeted at a lower level than the corresponding text used for the English majors, since it was assumed (correctly) that the non-majors' general level of English proficiency would be considerably lower.
Main Text: On Target for the TOEIC (Addison-Wesley)
Japanese companion text + Tapes
TOEIC kara Gambare! (Vocabulary)
Main Text: Business Basics (Oxford) + Tapes
This text was selected for similar reasons to High Impact -- it was deemed to be targeted at the correct level for non-major students.
The content of each course was thus dictated by the assigned texts. No direct control was exerted over the instructors to conform to a specific lesson plans. The instructors, however, were encouraged to coordinate their class with the other instructors teaching the same materials, either through face-to-face meetings or by keeping a log into which each could report their progress. (The teachers for the TOEIC treatment reading classes, in particular, taught on different days and thus rarely had a chance to meet each other in person.)
Non-Majors
TOTGAIN LISTGAIN READGAIN
Business
N OF CASES 53 53 53
MEAN GAIN 6.415 -6.981 13.396
STANDARD DEV 71.403 49.907 40.641
General
N OF CASES 46 46 46
MEAN GAIN 12.609 0.326 12.283
STANDARD DEV 77.386 48.217 51.020
TOEIC
N OF CASES 50 50 50
MEAN GAIN 53.300 5.400 47.900
STANDARD DEV 80.930 44.845 51.844
Majors
TOTGAIN LISTGAIN READGAIN
Business
N OF CASES 60 60 60
MEAN GAIN 60.917 31.333 29.583
STANDARD DEV 58.226 47.299 33.119
General
N OF CASES 83 83 83
MEAN GAIN 52.349 28.795 23.554
STANDARD DEV 59.614 42.667 41.647
TOEIC
N OF CASES 73 73 73
MEAN GAIN 80.000 40.479 39.863
STANDARD DEV 64.253 46.091 43.922
Analysis Of Variance (Non-Majors)
Systat version 5.0 was used to perform an analysis of variance on the data. In order to save space only the most useful data are reported below. For each population (Majors and Non-majors) tests were performed on the scores on the January 11 administration.
As discussed earlier, the preliminary baseline score was used as a covariate to compensate for initial differences in the groups. In two cases with the Non-Majors, total score & reading score, a significant difference appeared, but when the students' self report of any additional English classes and previous overseas experience were taken into consideration, the scores for the treatments were no longer significantly different.
Table 3 -- Analysis of Total Scores (TOTAL118) for Non-majors
DEP VAR:TOTAL118 N: 149 MULTIPLE R: 0.547 SQUARED MULTIPLE R: 0.300
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
TREAT$ 34003.100 2 17001.550 3.331 0.039
TOTAL511 309338.355 1 309338.355 60.601 0.000
ERROR 740151.649 145 5104.494
------------------------------------------------------------------------------
Analysis of TOTAL118 (Non-majors) with the intervening variable, 'OTHERCL' (Other English classes) added to the equation.
DEP VAR:TOTAL118 N: 127 MULTIPLE R: 0.618 SQUARED MULTIPLE R: 0.382
(22 cases deleted due to missing data -- No questionnaire)
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
TREAT$ 23321.456 2 11660.728 2.429 0.092
TOTAL511 245203.451 1 245203.451 51.069 0.000
OTHERCL 74208.663 1 74208.663 15.456 0.000
ERROR 585769.217 122 4801.387
Table 4 -- Analysis of Listening Scores (LIST118) for Non-majors
DEP VAR: LIST118 N: 149 MULTIPLE R: 0.392 SQUARED MULTIPLE R: 0.153
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
TREAT$ 986.725 2 493.362 0.240 0.787
LIST511 53887.223 1 53887.223 26.257 0.000
ERROR 297586.783 145 2052.323
Table 5 -- Analysis of Reading Scores (READ118) for Non-majors
DEP VAR: READ118 N: 149 MULTIPLE R: 0.586 SQUARED MULTIPLE R: 0.343
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
TREAT$ 21184.068 2 10592.034 6.435 0.002
READ511 116316.437 1 116316.437 70.661 0.000
ERROR 238685.870 145 1646.109
DEP VAR: READ118 N: 127 MULTIPLE R: 0.653 SQUARED MULTIPLE R: 0.426
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
TREAT$ 14030.878 2 7015.439 4.060 0.020
READ511 99584.655 1 99584.655 57.630 0.000
CLUB 805.702 1 805.702 0.466 0.496
OUTSIDE 753.678 1 753.678 0.436 0.510
OSEAS 7172.884 1 7172.884 4.151 0.044
ERROR 207360.867 120 1728.007
POST HOC TEST OF READ118
USING MODEL MSE OF 1604.609 WITH 120. DF.
MATRIX OF PAIRWISE MEAN DIFFERENCES:
BUSIN GEN'L TOEIC
BUSIN 0.000
GEN'L -4.198 0.000
TOEIC 21.141 25.339 0.000
SCHEFFE TEST.
MATRIX OF PAIRWISE COMPARISON PROBABILITIES:
BUSIN GEN'L TOEIC
BUSIN 1.000
GEN'L 0.898 1.000
TOEIC 0.079 0.032 1.000
Table 6 -- Analysis of Scores for Majors
TREAT$ 17884.770 2 8942.385 2.811 0.062
TOTAL511 519685.583 1 519685.583 163.389 0.000
ERROR 674302.491 212 3180.672
DEP VAR: LIST118 N: 216 MULTIPLE R: 0.539 SQUARED MULTIPLE R: 0.291
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
TREAT$ 2951.144 2 1475.572 0.879 0.417
LIST511 140482.164 1 140482.164 83.670 0.000
ERROR 355946.662 212 1678.994
DEP VAR: READ118 N: 216 MULTIPLE R: 0.581 SQUARED MULTIPLE R: 0.338
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
TREAT$ 5029.187 2 2514.593 2.071 0.129
READ511 130148.453 1 130148.453 107.167 0.000
ERROR 257463.134 212 1214.449
Our hypothesis was that the gain scores of all students, regardless of method of study, would be equal. This was confirmed in all but one instance: Non-Major students showed a significant gain on the Reading Section compared to those who studied using regular materials or business materials. The students of one instructor for the Non-Majors actually demonstrated a gain of 77 points overall, with 56 of them in the reading section.
Although the gains for one instructor are considerably greater than those of the other instructor, the pattern is similar in that the Reading Section always shows a greater gain than the listening section, and the TOEIC treatment shows a greater gain than the other two treatments which are similar in their total gain scores. No differences emerged on the follow-up questionnaire (Appendix B) which would account for this difference in the results.
It is also clear that whatever gains there might have been with the majors were 'washed out' by the many other courses which they were taking concurrently. Some might claim that it would have been wiser to alter the content of all classes during the week so that clearer results could have been obtained. This, however, would have resulted in an artificial curriculum, one which would not exist in a normal university. Since English majors would take TOEIC preparation as only one element of their course of study, our model closely approximates an possible actual implementation.
One surprising result is that the Non-Major students, with the exception of Instructor Y's TOEIC section, improved very little over the course of the year and in some cases, even showed 'negative gain'. This can be taken as a testament to the poor attitude of Japanese university students towards their 'general education' subjects. The instructors reported that they could assign little homework since there was little expectation that the students would actually do it. Thus most of the students' exposure was limited to the 26 class meetings. It appears that the activities carried out in class did not, for many students, result in any real 'learning' that could be translated into improved TOEIC scores.
TOTAL LISTENING READING
Instructor X
Business
MEAN GAIN -11.4 -21.7 10.3
General
MEAN GAIN -8.2 -8.2 0.0
TOEIC
MEAN GAIN 24.7 -13.3 38.0
Instructor Y
Business
MEAN GAIN 27.9 10.8 17.1
General
MEAN GAIN 27.2 4.6 30.9
TOEIC
MEAN GAIN 77.6 21.3 56.3
Even with the one section that did show great improvement, we cannot ascertain how much of this gain can be attributed to greater 'test wiseness' as opposed to greater knowledge of English. It would appear, however, that a greater knowledge of the schema of the written genre appearing on the TOEIC examination might have been a significant factor. This and other possible causes are discussed in the following section.
Although all groups generally showed improvement over the course of the year, one salient difference between the Non-Majors and Majors lies in where the improvement took place. With the majors, the improvement in the Listening and Reading scores was almost equal, whereas with the Non-Majors, there was little gain in the listening component (-6.9, 0.3 and 5.4 for the three treatments) and a greater rate of improvement in the reading section (13.3, 12.2 and 47.9). There was little improvement in listening even though both of the instructors used English as the medium of instruction. We can tentatively postulate the following reasons for this:
Figure 1
While this study seems to suggest that TOEIC materials can be effective for improving the reading component scores of non-major students at a Japanese university, our results are by no means conclusive. The non-major students, for example, had initial scores far below those of the English majors. It could be that students in this low score range can benefit more from such instruction than can those at a higher level of ability.
Further, the TOEIC course was a substitute for the standard general English course which might have placed greater emphasis on English for communicative purposes. Forcing students to study TOEIC preparatory material might, therefore, being doing them a disservice if communicative ability is the goal of the program.
Care needs to be taken when applying these findings to other teaching situations. Further studies are required to confirm whether these results apply to students of differing levels of ability, nationality, or motivation or in other educational settings such as in-company training programs and language schools.
Alderson, J. Charles, (1986). Innovations in Language Testing? in Portal, M. (ed.), 93-105.
Alderson, J.C. & Wall, D. (1993). Does washback exist? Applied Linguistics, 14, 115-129.
Alderson, J. Charles and Liz Hamp-Lyons (1996). "TOEFL preparation courses: a study of washback." Language Testing 13, 3, 280-297.
Amer, Aly Anwer (1993), "Teaching EFL students to use a test-taking strategy" Language Testing 10, 1, 71-78.
Bachman, Lyle F., (1990). Fundamental Considerations in Language Testing, Oxford University Press, Oxford.
Becker, B. J. (1990). Coaching for the Scholastic Aptitude Test: Further synthesis and appraisal. Review of Educational Research, 60, 373-417.
Cooley, W.W. (1991). Statewide student assessment. Educational Measurement: Issues and Practice 10, 3-6.
Darling-Hammond, L. and Wise, A.E., (1985). Beyond standardization: state standards, and school improvement. The Elementary School Journal 85, 315-36.
DerSimonian, R. and Laird, N. M. (1983). Evaluating the effect of coaching on SAT scores: A meta-analysis. Harvard Educational Review, 53, 1-15.
Frederickson, J.R. (1984). The real test bias: influences of testing on teaching and learning. American Psychologist 39, 193-202.
Frederickson, J.R. and Collins, A. (1989). A systems approach to educational testing. Educational Researcher 18, 27-32.
Haladnya, T.M., Nolan S.B. and Haas, N.S. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher 20, 2-20.
Henning, Grant (1990). "Priority Issues in the Assessment of Communicative Language Abilities", Foreign Language Annals, 23:5 October 1990, 379-384.
Hughes, A. (1988). Introducing a needs-based test of English language proficiency into an English-medium university in Turkey. In Hughes, A., ed., 134-53.
Hughes, A., ed. (1988). Testing English for university study. ELT Document 127, London: Modern English Publications,
Johnson S. T., Asbury, C. A., Wallace M. B., Robinson S. & Vaughn J. (1985), The effectiveness of a program to increase Scholastic Aptitude Test scores of Black students in three cities. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, April 1985.
Khaniya, T.R. (1990), Examinations as instruments for educational change: inverstigating the washback effect of the Nepalese English exams. Unpublished PhD dissertation, University of Edinburgh.
Kulik, J.A. Bangert-Drowns, R.L. & Kulik, C.C. (1984) Effectiveness of coaching for aptitudfe tests. Psychological Bulletin, 95, 179-188.
Madaus, G.F. (1988), The influence of tsting on the curriculum. In Travers, L., editor, Critical issues in curriculum (87th yearbook of the Society for the Study of Education), Part 1, Chicago, IL: Chicago University Press, 83-121.
Messick, S and Jungeblut, A. (1981), Time and method in coaching for the SAT. Psychological Bulletin, 89, 191-216.
Morrow, K. (1986). The evaluation of tests of communicative performatnce, in Portal(ed).
Portal, M. (ed.), Innovations in Language Testing. London: NFER/Nelson,
Powers, Donald E.(1993), Educational Measurement: Issues and Practice, Summer 1993, 24-31.
Smith. M.L., Edelsky, C., Draper, K., Rottenberg, C. and Cherland, M. (1989), The role of testing in elementary schools. Los Angles, CA: Center for Research on Educational Standards and Student Tests, Graduate School of Education, UCLA.
Swain, M. (1985). Large-scale communicative testing in Lee, Yp> Fok, C.Y.Y., Lord, R. and Low, G. (eds) New Directions in Language Testing. Hong Kong: Pergamon Press.
Vernon, P.E. (1956). The Measurement of Abilities (2nd edn.) London: University of London Press.
Wall, D. and Alderson, J.C. (1993). Examining Washback: the Sri Lankan Impact Study, Language Testing 10, 41-70
Wesdorp, H. (1982) Backwash effects of language testing in primary and secondary education. Stichting Centrum voor onderwijsonderzoek van de Universiteit van Amsterdam.
We would like to acknowlege the generous assitance of Steve Ross, who offered advice at every stage of this project, from its inception to the final report.
We would like to thank the following organizations for their assistance with this research.
The Chauncey Group International Ltd. and IIEC(Japan) for the funding that made this research possible.
Oxford University Press and Addison-Wesley/Longman for generously allowing us to duplicate the tapes of their texts locally at a reduced rate for research purposes.
Non-Majors -- Pre-Test (May 11, 1996)
TOTAL511 LIST511 READ511
Both instructors
Business
N OF CASES 53 53 53
MEAN 339.906 180.377 159.528
STANDARD DEV 81.886 38.566 56.664
General
N OF CASES 46 46 46
MEAN 327.283 172.391 154.891
STANDARD DEV 56.977 29.359 41.318
TOEIC
N OF CASES 50 50 50
MEAN 303.900 166.600 137.300
STANDARD DEV 79.259 34.941 60.603
Instructor X
Business
N OF CASES 29 29 29
MEAN 332.759 179.828 152.931
STANDARD DEV 68.473 33.580 49.290
General
N OF CASES 19 19 19
MEAN 347.105 178.158 168.947
STANDARD DEV 57.839 36.485 38.427
TOEIC
N OF CASES 23 23 23
MEAN 298.696 167.609 131.087
STANDARD DEV 69.335 31.001 51.234
Instructor Y
Business
N OF CASES 24 24 24
MEAN 348.542 181.042 167.500
STANDARD DEV 96.487 44.599 64.656
General
N OF CASES 27 27 27
MEAN 313.333 168.333 145.000
STANDARD DEV 53.042 22.997 41.067
TOEIC
N OF CASES 27 27 27
MEAN 308.333 165.741 142.593
STANDARD DEV 87.903 38.548 68.097
Non-Majors -- Post-Test (January 18, 1997)
TOTAL511 LIST511 READ511
Both Instructors
Business
N OF CASES 53 53 53
MEAN 336.400 169.473 168.266
STANDARD DEV 9.896 6.270 5.601
General
N OF CASES 46 46 46
MEAN 337.808 173.214 164.945
STANDARD DEV 10.538 6.680 5.988
TOEIC
N OF CASES 50 50 50
MEAN 369.633 175.702 192.188
STANDARD DEV 10.229 6.447 5.798
Instructor X
Business
N OF CASES 29 29 29
MEAN 321.379 158.103 163.276
STANDARD DEV 75.674 50.399 44.406
General
N OF CASES 19 19 19
MEAN 338.947 170.000 168.947
STANDARD DEV 66.637 43.589 31.072
TOEIC
N OF CASES 23 23 23
MEAN 323.478 154.348 169.130
STANDARD DEV 63.932 29.974 43.788
Instructor Y
Business
N OF CASES 24 24 24
MEAN 376.458 191.875 184.583
STANDARD DEV 107.496 64.028 54.492
General
N OF CASES 27 27 27
MEAN 340.556 174.630 165.926
STANDARD DEV 80.160 42.289 55.575
TOEIC
N OF CASES 27 27 27
MEAN 385.926 187.037 198.889
STANDARD DEV 87.300 46.868 53.553
Majors
Pre-test (May 11, 1996)
TOTAL511 LIST511 READ511
Business
N OF CASES 60 60 60
MEAN 436.083 223.250 212.833
STANDARD DEV 83.184 53.622 46.243
General
N OF CASES 83 83 83
MEAN 430.361 213.313 217.048
STANDARD DEV 67.818 43.953 41.039
TOEIC
N OF CASES 73 73 73
MEAN 411.370 205.000 205.959
STANDARD DEV 68.345 37.352 48.309
Post-Test (January 18, 1997)
TOTAL118 LIST118 READ118
Business
N OF CASES 60 60 60
MEAN 490.743 249.797 242.203
STANDARD DEV 76.708 51.497 43.144
General
N OF CASES 83 83 83
MEAN 482.711 242.108 240.602
STANDARD DEV 69.235 44.394 39.820
TOEIC
N OF CASES 73 73 73
MEAN 491.370 245.479 245.890
STANDARD DEV 79.387 49.771 45.303
Follow-up Questionnaire (Translated into English)
Name___________________________ Student Number___________Day_____Period______
Please answer these questions about your course truthfully. This information
will be used in order to make the course better in the future. There is no
connection between this questionnaire and your grade for this course.
A. Questions about you.
During this school year did you,
1. belong to a club for studying English? Yes No
2. study English outside of school? Yes No
3. take any other 'general education' classes in English? Yes No
4. speak often with an English-speaking friend? Yes No
5. Before coming to this university did you, Yes No
go abroad? (Where?__________________ How long? ________________)
B. Questions about this course.
Agree Disagree
1. The pace of this class was too fast. 5 4 3 2 1
2. More time should have been spent on each exercise. 5 4 3 2 1
3. I think that I can read more quickly now thanks 5 4 3 2 1
to this course.
4. I think that I can understand what I read better 5 4 3 2 1
thanks to this course.
5. My ability to understand spoken English improved 5 4 3 2 1
thanks to this course.
6. The material was too difficult for me. 5 4 3 2 1
7. There was too much homework. 5 4 3 2 1
8. I listened to the tapes at home regularly. 5 4 3 2 1
9. What I learned will be useful to me in the future. 5 4 3 2 1
10. This class will help me get a higher score on the 5 4 3 2 1
TOEIC test.
11. I could understand almost everything that the 5 4 3 2 1
teacher said in English.
12. The contents of the class was interesting. 5 4 3 2 1
13. The teacher spoke in Japanese too much. 5 4 3 2 1
14. The teacher spoke too fast for me to understand. 5 4 3 2 1
The results of the questionnaire showed that, for most items, there was very little difference between the groups in their responses. Some items did, indeed, result in a 'significant difference' but the magnitude of the absolute difference in the values is so small that the significant differences' have little import.
For example, for item B-8, "I listened to the tapes at home regularly," we find the following results for the Non-Majors:
N Mean
TOEIC 50 2.1
REGULAR 46 2.4
BUSINESS 51 1.9
The Fisher PSLD Post-Hoc test reports a significant difference between the Business and Regular treatments, with p =.0045.
While this does show that the regular treatment probably listened to their tapes more than those in the other treatment groups, The absolute difference between these two groups is only 0.5 on a scale of 1 to 5, More importantly, all groups are below the mid-point and have disagreed to some extent with the statement.
The results of part B of the questionniare are reported below.
N Sizes
TOEIC REGULAR BUSINESS
Non-Majors 50 46 51
Majors-List 80 80 80
Majors-Read 76 84 66
1. The pace of this class was too fast.
TOEIC REGULAR BUSINESS Significance
Non-Majors 2.8 2.5 2.3 T > B p = .0070
Majors-List 2.5 2.5 2.7 ns
Majors-Read 2.3 2.5 2.6 B > T p = .0490
2. More time should have been spent on each exercise.
TOEIC REGULAR BUSINESS Significance
Non-Majors 3.1 2.8 2.9 ns
Majors-List 3.4 3.1 2.9 T > B p = .0318
Majors-Read 2.9 3.2 3.0 R > T p = .0214
Note: One of the three TOEIC Major listening sections received an average score of 3.9 as opposed to 3.3 and 2.9 for the other sections. Thus the pace of the class was significantly greater in only one class, not the treatment as a whole.
3. I think that I can read more quickly now thanks to this course.
TOEIC REGULAR BUSINESS Significance
Non-Majors 2.7 2.7 2.4 ns
Majors-List N/A
Majors-Read 3.5 3.4 3.5 ns
4. I think that I can understand what I read better thanks to this course.
TOEIC REGULAR BUSINESS Significance
Non-Majors 2.9 2.6 2.4 T > B p = .0079
Majors-List N/A
Majors-Read 3.6 3.3 3.5 T > R p = .0370
Note: The Non-Majors all scored below the half-way mark of 3 on the Agree/Disagree scale. Despite the fact that it was only in the reading section that the TOEIC treatment improved more than the other treatments, the students themselves apparently did not perceive themselves as having improved in their reading ability
5. My ability to understand spoken English improved thanks to this course.
TOEIC REGULAR BUSINESS Significance
Non-Majors 3.3 3.3 3.0 ns
Majors-List 3.4 3.7 2.9 R>T p=.0069; R>B p<.0001
Majors-Read 1.9 1.8 2.5 B>T p<.0012; B>R p<.0001
Note: The Major Regular course included a greater emphasis on pronunciation and sound discrimination exercises which might have caused this difference in perceived improvement. The nature of the Major Reading Business treatment required much more "teacher talk" which explains the higher rating.
6. The material was too difficult for me.
TOEIC REGULAR BUSINESS Significance
Non-Majors 2.8 2.4 2.4 T > B p = .0230
Majors-List 2.6 2.4 2.6 ns
Majors-Read 2.9 3.0 3.2 ns
7. There was too much homework.
TOEIC REGULAR BUSINESS Significance
Non-Majors 2.1 2.3 1.8 R > B p = .0067
Majors-List 2.1 1.9 1.8
Majors-Read 3.8 4.0 3.7
Note: The high ratings for the Major-Reading courses are due to the large amount of outside reading required for all the students, regardless of the treatment. Generally speaking, only the in-class work was varied depending on the treatment.
8. I listened to the tapes at home regularly.
TOEIC REGULAR BUSINESS Significance
Non-Majors 2.1 2.4 1.9 R > B p = .0045
Majors-List 2.4 3.0 2.5 R>T p=.0004; R>B p=.0046
Majors-Read N/A
9. What I learned will be useful to me in the future.
TOEIC REGULAR BUSINESS Significance
Non-Majors 3.6 3.3 3.4 ns
Majors-List 3.7 4.0 3.0 R>B p<.0001; T>B p<.0046
Majors-Read 3.6 3.5 3.4 ns
10. This class will help me get a higher score on the TOEIC test.
TOEIC REGULAR BUSINESS Significance
Non-Majors 3.8 2.9 3.2 T>B p=.0007; T>R p<.0001
Majors-List 3.7 2.5 2.8 T>B p=.0009; T>R p<.0001
Majors-Read 3.6 3.2 3.0 T>B p=.0013; T>R p=.0114
Note: Predictably those who were directly studying TOEIC materials believed that this material was helpful, despite our experimental evidence to the contrary.
11. I could understand almost everything that the teacher said in English.
TOEIC REGULAR BUSINESS Significance
Non-Majors 2.6 2.8 2.7 ns
Majors-List 2.9 3.3 2.8 R>B p=.0066; T>R p=.0145
Majors-Read 3.2 3.3 3.3
Note: The instructor for one of the Major-Listening Regular sections conducted his class completely in Japanese and instructed his student not to respond to this question. Only two sections, with an N=49 are included here.
12. The contents of the class was interesting.
TOEIC REGULAR BUSINESS Significance
Non-Majors 3.2 3.3 3.4 ns
Majors-List 3.4 3.4 2.7 T>B p=.0006; R>B p<.0001
Majors-Read 3.1 2.8 2.7 T>B p=.0044; T>R p=.0155
Note: Despite the fact that the bulk of the English Major graduates find themselves working in business, it appears that they do not find studying Business English interesting.
13. The teacher spoke in Japanese too much.
TOEIC REGULAR BUSINESS Significance
Non-Majors 1.6 1.7 1.5 ns
Majors-List 3.0 2.7 2.8
Majors-Read 2.0@@@@1.2 1.1 T>B p<.0001; T>R p<.0001
Note: This item suffered from the same problem as item 11, thus only two sections were tabulated for the Majors-Listing Regular treatment.
14. The teacher spoke too fast for me to understand.
TOEIC REGULAR BUSINESS Significance
Non-Majors 3.2 2.9 2.8 ns
Majors-List 2.0 2.2 1.9 ns
Majors-Read 2.2 2.5 2.5 R > T p = .0096