A Study of the Effect of Direct Test Preparation on the TOEIC Scores of Japanese University Students

Thomas N. Robb
Kyoto Sangyo University

Jay Ercanbrack
Kyoto Sangyo University

Not for citation! This is the current draft of a work in progress. Any comments that you might have would be appreciated and appropriately acknowledged. Please post them to <trobb@cc.kyoto-su.ac.jp>.

Abstract

In order to study the effect of direct test preparation on TOEIC gain scores, two samples of students (i.e., English majors and non-majors) at a Japanese university were divided into three treatment groups: 1) TOEIC Preparation, 2) Business English and 3) "General" (four-skills) English. The results indicate that usage of TOEIC preparatory materials led to a statistically significant gain on post-test scores for the non-majors' reading component only. The authors conclude that TOEIC preparatory materials are of little benefit to students enrolled in a comprehensive program of English language study, but might boost the score of the reading component of students enrolled in a university-level general English course in Japan.

Introduction

There is a clear tendency for students, not only in Japan, but around the world, to study for a test by reviewing past tests and concentrating their efforts on the types of language and test items that are known to appear on such tests. It is equally clear that if a test can be prepared for, then the test no longer can be said to measure general proficiency. Rather, it measures how well people have studied for the test.

Thus there is the inherent danger of the test becoming the tail that wags the curriculum dog. Henning expressed this concern thus: "If there is no concerted effort to subordinate testing to explicit curricular goals, there is an ever-present potential danger that tests themselves with all their inherent limitations will become the purpose of the educational encounter by default" (1990:380).

This study was designed to determine whether "teaching for the test" in a Japanese university setting does, in fact, result in higher test scores. Specifically, we set out to determine if students who use material designed for TOEIC test preparation or for "Business English" achieve higher gain scores than students who study an equal amount of time with standard language study materials.

Past Studies

The authors have found little previous research related to gain scores on the TOEIC test and only two studies, Alderson & Wall (1993) and Alderson & Hamp-lyons (1996) concerning the TOEFL examination. These, however, were more concerned with the 'washback effect' on such test preparation on the actual content of classes and contained no objective data concerning the effects of coaching on subsequent test scores.

Fortunately, several studies have been conducted in relation to yet another standardized test, Educational Testing Service's SAT ("Scholastic Aptitude Test"), an exam which is administered to native-speaking high school students and is required as part of the application process for most American universities. Powers' (1993) "Coaching for the SAT: A Summary of the Summaries and an Update" is a thorough survey of such studies, particularly those which employ meta-analytical techniques to synthesize previous research. Much of what follows below has been drawn from this survey.

Preparation Programs for the SAT

Preparation programs for academic aptitude and language proficiency tests are currently abundant. Taken together, they constitute a vast industry within the private educational sector. Students are propelled toward such programs by a desire to succeed on tests where the perceived stakes are high. As noted by several researchers (e.g., Mehrens and Kaminsky 1989), the higher the stakes of a test, the greater the desire for guided test preparation and practice.

Yet, despite the great popularity of test preparation courses and programs, relatively little research has been done to document whether special preparation can have a markedly positive effect on test scores. A resolution of this issue is obviously crucial for the creators of standardized tests, as well as for the test-takers themselves. As mentioned, if preparation via coaching in test-taking techniques and strategies is found to be effective, it would indicate that test scores are not reliable indicators of academic ability or language proficiency, but rather reflect, at least to some degree, an ability to take tests. If such a situation exists, the validity of the tests is called into question.

Commercial coaching companies often report considerable gains in test scores by their clientele as proof of the effectiveness of their coaching programs. However, as Powers (1993) points out, variations in an individual's test scores from one test administration to another can be expected to occur, and for a variety of reasons. First, test score gains may be the result of a "practice effect", wherein test takers have a greater sense comfort, familiarity, and confidence when retaking a test than they possessed in their initial experience with the same exam, regardless of whether or not they have been coached in "test-wiseness" (Bachman 1990:114) strategies . Score increases may also reflect growth in an individual's ability over time, encouraged by a number of factors, rather than the direct influence of test coaching programs. Furthermore, variations in scores - either increases or decreases - may be due simply to measurement error. Upon retesting, it is quite usual for some examinees to show large increases in scores, and for others to show large decreases. This phenomenon of regression to the mean was demonstrated in a study by Johnson et al. (1985) . Examining an SAT coaching program, it was noted that the gains or losses recorded by coached students varied greatly depending on their initial test scores. Students who scored lowest on their first encounter with the SAT tended to make the greatest gains upon retesting, while those who at first scored most highly were likely to make the smallest gains or largest drops in their second round scores.

In contrast to the relative dearth of similar research concerning other large-scale, commercially available tests, several other studies have been conducted in the past two decades dealing with the effect of coaching on SAT scores (e.g., Becker 1990; Der Simonian and Laird 1983; Kulik, Bangert-Drowns and Kulik 1984; and Messick and Jungeblut 1981). As summarized by Powers (1993), and taking into account that simply repeating the test may lead to gains of around 15 points on the verbal section and 12 points on math (College Board 1991), these studies reveal that even the most well-known commercial coaching programs (e.g., Stanley Kaplan, Inc., and the Princeton Review) produce only modest score gains, typically 15 to 25 points each on the 200-800 point verbal and mathematical sections of the test. It has also been found that the scores of students who have undergone coaching receive a slightly higher boost on the math section than on the verbal section of the test (Becker 1990, Messick and Jungeblut 1981). In any case when, as suggested by Messick (1982), improvements in percentile ranking for coached students (usually consisting of only a few points) are considered, the gains made by coached SAT-takers appear to be meager. It seems clear then, at least in the case of the SAT, that the effects of coaching may fall far short of students' expectations.

Beyond the direct influence of coaching, differences in scores between coached and uncoached students may reflect other factors and tendencies associated with each group. Powers (1981) offers evidence to show that students who utilize formal or commercialized coaching services also make greater use of other test preparation resources than their uncoached peers. For example, they are more likely to conduct their own review of relevant subject matter, read supplementary test preparation books, and attend review sessions provided by their own schools. Such tendencies make it difficult, if not impossible, to objectively assess the impact that coaching programs alone may have on those who enroll in them.

One related question worth considering with regards to the test coaching issue is this: if coaching does have an effect (and it appears, as we have seen, to have at least some small impact on SAT scores), then what precisely is the reason for this effect? That is, what aspect of coaching is helping to raise test scores? One of the few studies to address this issue, again in the context of the SAT, was that of Johnson et al. (1985). It found that as a result of coaching, many test-takers were able to complete more items on both sections of the SAT. Since providing a correct answer on even a portion of the previously unmarked items would lead to higher overall scores, this newly developed ability was seen by the researchers as being the instrumental element in test score improvements for this set of coached students.

Though, as mentioned, few studies have analyzed the effect of coaching on language proficiency tests, one significant review (Kulik, Bangert-Drowns, and Kulik 1984) compares the SAT with a variety of other standardized aptitude tests, both academic and psychometric (e.g., GRE-Q, Stanford-Binet, WISC, etc.). It found that the affect of coaching for the other tests was much greater (approximately three times) than for the SAT. This may possibly be attributed to the preponderance of relatively simple test item formats found on the SAT. Item format is considered pivotal with regards to coaching since complex formats have been found to be more coachable than those of a simple nature (Powers 1986).

"Washback" Effects on Language Preparation Programs

While the literature on the effectiveness of preparation courses and programs for language proficiency tests is bleakly sparse, there are several studies which have concerned themselves with the washback effects of such tests on EFL/ESL classrooms (Wesdorp 1982; Hughes 1988; Khaniya 1990; Wall and Alderson 1993; and Alderson and Hamp-Lyons 1996, among others). Washback, a term popular in British applied linguistics and commonly referred to as "backwash" in the field of general education, may be understood as the influence that a test has on teaching and learning. The "Washback Hypothesis", as explained by Alderson and Wall, assumes that "teachers and learners do things they would not necessarily otherwise do because of the test" (1993:117).

The concept of washback presupposes a belief in the notion that tests are prominent determiners of classroom practices and events. Accordingly, the term itself is neutral in that the influence of a test may be either positive or negative in nature. That is, a "poor" test yields negative washback while a "good" test will have effects perceived as positive. As summarized by Alderson and Wall (1993), some of the negative effects tests have been suspected of producing include narrowing or distortion of the curriculum (Vernon 1956; Madaus 1988; Cooley 1991), loss of instructional time (Smith et al. 1989), reduced emphasis on skills that require complex thinking or problem-solving (Frederickson 1984; Darling-Hammond and Wise 1985) and test score "pollution", meaning gains in test scores without a paralled improvement in actual ability in the construct under examination (Haladnya,Nolan, and Haas 1991).

In contrast, some researchers (i.e., Swain 1985 and Alderson 1986) emphasize the potential positive aspects of test influence and urge the creation of tests which, through constructive washback, will have enlightening effects on language curricula. Certain researchers (i.e., Morrow 1986 and Frederickson and Collins 1989) have suggested that a test's validity should be determined by the degree to which it has a positive influence on teaching. Morrow (1986) refers to this as "washback validity", while Frederickson and Collins (1989) have introduced the term "systemic validity" to refer to a similar process.

Remarkably, while claims of washback and its effects, both positive and negative, are numerous in educational literature, Alderson and Wall (1993) point out that little empirical evidence has been provided to support the argument that tests do indeed influence teaching, that is, that washback actually exists. Assertions concerning washback in past studies have been based primarily on anecdotal evidence, primarily opinions and impressions gathered from teachers and administrators.

To amend this lack of objective data, Alderson and Hamp-Lyons (1996) set out to investigate the existence and extent of washback in one educational setting. Using a combination of classroom observations and interviews with teachers and students, they targeted preparation classes for the English proficiency test TOEFL (Test of English as a Foreign Language), a test of particular importance to non-native speakers of English interested in entering degree programs at American universities. Two teachers received extensive observation in both their TOEFL preparation and regular English classes, and an attempt was made to separate the effects of individual teacher style from TOEFL washback.

The study did not investigate the question of whether or not TOEFL preparation courses were effective in raising scores on the test and, to this date, no previous research of this kind appears to have been conducted. Only the processes of teaching and learning were observed and examined in order to determine the extent of TOEFL washback in this setting. The authors concluded that the TOEFL did indeed affect both what and how teachers taught, but that the effect differed in degree and kind from teacher to teacher. More importantly, they suggested that it is not a test alone that causes washback, but the way that test is approached by administrators (who may determine the necessity of large class sizes), materials writers (who may fail to give proper guidance to teachers on possible ways to teach with a certain set of materials), and teachers themselves (who may devote little energy to finding alternative or innovative ways to teach test preparation classes) which actually creates the phenomenon of washback for a given language proficiency test.

It is, then, against this broad backdrop of hypothesis and information concerning washback, generally, and the effects of coaching on test scores, specifically, that the present study was conducted.

Experimental Design

The study was carried out with two distinct samples of freshmen students at Kyoto Sangyo University: English majors (henceforth 'Majors') in the Faculty of Foreign Languages and Non-Majors from other faculties of the university taking freshmen English courses offered by the school's English Language Education and Research Center. These two samples will be treated separately since there are important differences between them that make integration of the data unwise:

  1. Contact hours/week.

    The majors were taking seven 90-minute classes per week in English. These students were pseudo-randomly assigned by the school to one of 8 sections ("kumi" in Japanese). Students in a particular section at Kyoto Sangyo University take classes together in all but two of of their seven 'practical' courses. It was not feasible to vary the content of all of the courses, so only two courses, "Extensive Reading" and "Listening/Pronunciation" were used in the experimental design. The other five courses taken by the majors' included "Intensive Reading", "Grammar", "Composition", "Conversation" and "General Cultural Studies". An assumption was made that the content of the other courses would be roughly similar and would therefore not jeopardize the validity of the study. Two of the sections, 7 & 8, were English majors with a specialization in International Relations. These students had a slightly different program with a content course instead of grammar. The results with these classes both included and excluded were essentially the same, so this minor difference will henceforth be ignored.

    Most but not all non-majors had two classes per week, one of which, "Applied English", was part of this study. The other class, "Reading" was a traditional reading class which concentrated on the careful reading and understanding of short passages of text. While it would have been better if this class, too, had been included in the study, this was unfeasible. Since all students were receiving a like amount of this reading practice, however, this additional class should have made little difference in the overall outcome. The Applied English class met for a maximum of 27 class times during the school year for a maximum of 40.5 hours of contact time.

  2. Level of English.

    The initial level of the majors was considerably higher than that of the non-majors, as would be expected. One implication of this was that the same materials could not be used for both sets of students in most cases.

  3. Motivation.

    The majors, having chosen English as their primary area of study for the next 4 years, could be assumed to be more interested in English and more highly motivated than the non-majors.

  4. Homework.

    The English majors were much more likely to do home assignments. This was not so much a matter of intrinsic motivation as a consequence of the fact that their English courses were required. If they had failed to meet the instructor's expectations, they would have had to repeat the course. This, in turn (depending on the number of other failures), might have set back their year of graduation. For the non-majors, the course was not required. If they failed, they could take courses in a variety of other subjects to garner sufficient credits for graduation

Hypothesis

The gain scores of all students, regardless of method of study, would be equal.

Initial Setup

Class Configurations

The researchers had no control over the composition of the classes. The majors in groups 1 through 6 (English--Language & Culture) were assigned pseudo-randomly by the University administration. Groups 7 and 8 (English--International Relations) were assigned alphabetically.

For the non-majors, the students are grouped according to their desired second foreign language. For example, all Business Majors who desired to study French were formed into one or more classes. Since there are normally too many students for a single class, the students are further divided into multiple classes according to their total score on the university's entrance examination. For this experiment, the instructors were assigned to six of the classes which contained the highest ranking students for certain combinations of major + second foreign language.


Table 1 -- Treatment Groups

Majors (Treatment & Instructor)         Non-majors (Treatment & Major)

      Reading    Listening              Instructor X     Instructor Y
Group                                 Group            Group

 1    TOEIC        TOEIC               101  General      218   General
        A            H                      Business           Business
 2    TOEIC        TOEIC                      Majors             Majors
        B           H/I  
 3    TOEIC        TOEIC               133  TOEIC        292   TOEIC
        C            I                      Economics          Engineering
 4    General      General                     Majors            Majors
        D            J                
 5    Business     Business            141  Business     228   Business
        E            K                      Law                Economics
 6    General      General                    Majors             Majors
        F            L
 7    General      General
        F            M
 8    Business     Business
        G            K


As can be seen from the above, the Non-major design has a neat 3 x 2 arrangement (treatments x instructors) while there are only two instances of "major" instructors teaching two sections for the entire period. In both of these cases, the instructor has classes of the same treatment.

Prior Information Provided to the Students

The students were informed both in the course catalog and in their first class that a basic purpose of the course (regardless of treatment) was to achieve a high score on the TOEIC examination. Students were told that 30% of their final mark for the course would be based on the improvement in their test scores between the initial TOEIC examination (in May) and the final examination in January.

Preparation for Pre-Test

A shortened, demonstration version of the TOEIC (Form MT-93) was administered to all sections as a class activity one to two weeks prior to the actual pre-test. A list of 'hints' and test-taking strategies was also provided to all students in Japanese.

This preparation was considered important as one way to offset the "practice effect", mentioned previously, whereby students generally score higher on second and successive administrations of a test merely due to greater familiarity with the test itself.

In this study we were more interested in assessing the effect of the variation in the teaching of language content rather than differences arising from the acquisition of test-taking strategies. While it was inevitable that the students in the 'TOEIC' treatment would have more exposure over the course of the year to such test-taking strategies, we felt that this could be partially offset by familiarizing all students with the examination beforehand.

Pre- and Post-Tests

The pre-test was administered on May 11, 1996, approximately one month after the start of the school year. For administrative reasons, it was impossible to schedule the test any earlier. The instructors in the TOEIC treatment, in particular, were instructed to avoid any classwork which could be termed "TOEIC test preparation" until the test was over. The 'general' and 'business' treatments had no other exposure to TOEIC-type questions during the course of the experiment, save for the one sample test and the actual pre- and post-tests.

The post-test was administered on January 18, 1997, which was the day after the conclusion of the final term. The pre-test had revealed some slightly significant differences in some of the test groups (at the 0.05 level). It was decided to compensate for these differences in the final analysis by using an analysis of covariance (ANCOVA). Possible intervening variables such as age, club activities, etc. were also taken into account. The results of the pre- and post-test are presented below in the section on "Results and Analysis."

Conduct of the courses

English Majors

Two of the once-weekly classes of the English majors were used for the experiment, their extensive reading class (I-B) and their listening/pronunciation class (I-F). For the extensive reading course, students in all groups were required to read over 1000 pages a year and to write summaries of what they read in a notebook kept for that purpose. Thus the three treatments were only different in the materials used for the in-class component.

Texts for each course were selected according to the following factors:

  1. The material needed to be relevant to the specific treatment ('General', 'TOEIC or 'business' English).

  2. The material had to be targeted at the appropriate ability level for the students in that particular course. In general, less challenging material was required for the non-major groups.

  3. There had to be a sufficient volume of material for the number of class hours and expected hours of homework assignments.

  4. For the listening component, we required a text that was accompanied by tapes that the students could listen to at home. (A special license was arranged with each publisher to duplicate their tapes for a modest fee.)

For most of the treatments, the cost of the required materials was greater than students would normally be willing to pay for a university course bearing only 2 credits. Students were thus required to pay a maximum of ,A42500 per course, the rest of the expense being subsidized by grant funds from TOEIC.

Non-Majors

Quizzes

Quizzes were prepared for each of the taped listening sections of each text and for the "TOEIC Kiso Kara Gambare" vocabulary text. The teachers were to use them to check how well the students had prepared the assigned homework. (Japanese students are apt to ignore homework when their is no specific way to assess whether they have done it or not.) These quiz scores were centrally recorded, although differences in frequency and manner or administration rendered them unusable for purposes of this study. Nevertheless, they were important as a 'motivational tool and as an element of the final grade given to each student.

Questionnaire

A questionnaire was administered to all students at the end of the year to 1) gather statistical data on other activities which might have influenced their progress in English and 2) measure their attitudes towards different aspects of the course. In particular, data concerning additional English classes studied and prior overseas experience proved meaningful for proper interpretation of the data. The questionnaire with analysis are presented in Appendix B.

Results & Analysis

Below are the resulting gain scores (Post-test scores minus Pre-test scores). The actual pre-test and post-test scores are presented in Appendix A

Table 2 -- Net Gain in Scores by Treatment


Non-Majors
                      TOTGAIN   LISTGAIN  READGAIN

 Business
   N OF CASES               53          53          53
   MEAN GAIN             6.415      -6.981      13.396
   STANDARD DEV         71.403      49.907      40.641
 General
   N OF CASES               46          46          46
   MEAN GAIN            12.609       0.326      12.283
   STANDARD DEV         77.386      48.217      51.020
 TOEIC
   N OF CASES               50          50          50
   MEAN GAIN            53.300       5.400      47.900
   STANDARD DEV         80.930      44.845      51.844

Majors

                     TOTGAIN   LISTGAIN  READGAIN
 Business
  N OF CASES               60          60          60
  MEAN GAIN            60.917      31.333      29.583
  STANDARD DEV         58.226      47.299      33.119
 General
  N OF CASES               83          83          83
  MEAN GAIN            52.349      28.795      23.554
  STANDARD DEV         59.614      42.667      41.647
 TOEIC
  N OF CASES               73          73          73
  MEAN GAIN            80.000      40.479      39.863
  STANDARD DEV         64.253      46.091      43.922

Analysis Of Variance (Non-Majors) Systat version 5.0 was used to perform an analysis of variance on the data. In order to save space only the most useful data are reported below. For each population (Majors and Non-majors) tests were performed on the scores on the January 11 administration. As discussed earlier, the preliminary baseline score was used as a covariate to compensate for initial differences in the groups. In two cases with the Non-Majors, total score & reading score, a significant difference appeared, but when the students' self report of any additional English classes and previous overseas experience were taken into consideration, the scores for the treatments were no longer significantly different.

Results for Non-Majors

Total Score

A slightly signifcant effect was found (p=0.039) for the treatment when only the pre-test as used as a covariate in the anaysis, but once the students' response to the question concerning outside English study was taken into consideration, this signficant difference disappeared (p=0.092).

Table 3 -- Analysis of Total Scores (TOTAL118) for Non-majors

DEP VAR:TOTAL118      N:     149  MULTIPLE R: 0.547  SQUARED MULTIPLE R: 0.300


                      ANALYSIS OF VARIANCE

SOURCE       SUM-OF-SQUARES   DF  MEAN-SQUARE     F-RATIO       P

TREAT$           34003.100    2    17001.550       3.331       0.039
TOTAL511        309338.355    1   309338.355      60.601       0.000

ERROR           740151.649  145     5104.494

------------------------------------------------------------------------------
Analysis of TOTAL118 (Non-majors) with the intervening variable, 'OTHERCL' (Other English classes) added to the equation.

DEP VAR:TOTAL118      N:     127  MULTIPLE R: 0.618  SQUARED MULTIPLE R: 0.382
   (22 cases deleted due to missing data -- No questionnaire)

                        ANALYSIS OF VARIANCE


 SOURCE       SUM-OF-SQUARES   DF  MEAN-SQUARE     F-RATIO       P

 TREAT$           23321.456    2    11660.728       2.429       0.092
 TOTAL511        245203.451    1   245203.451      51.069       0.000
 OTHERCL          74208.663    1    74208.663      15.456       0.000

 ERROR           585769.217  122     4801.387



Listening Sub-Test Only

No signficant difference appeared between the groups (0.787).

Table 4 -- Analysis of Listening Scores (LIST118) for Non-majors

 DEP VAR: LIST118      N:     149  MULTIPLE R: 0.392  SQUARED MULTIPLE R: 0.153

                        ANALYSIS OF VARIANCE

 SOURCE       SUM-OF-SQUARES   DF  MEAN-SQUARE     F-RATIO       P

 TREAT$             986.725    2      493.362       0.240       0.787
 LIST511          53887.223    1    53887.223      26.257       0.000
 
 ERROR           297586.783  145     2052.323


Reading Sub-Test Only

A highly significant difference (p=0.002) appeared between the groups when only the pre-test was taken into consideration. Once the intervening variables 'CLUB' (Particpation in a club for studying English), 'OSEAS' (Overseas experience) and "OUTSIDE" (Language classes outside the university) were added to the equation, the significance of the difference decreased 10-fold in magnitude, nevertheless remaining signifcant at the p=0.02 level. A post hoc Scheffé was then performed to determine where the signifcant difference lay. As indicated in Table 4, the results of the TOEIC treatment group turned out to be significantly different from only the General treatment group.

Table 5 -- Analysis of Reading Scores (READ118) for Non-majors

DEP VAR: READ118      N:     149  MULTIPLE R: 0.586  SQUARED MULTIPLE R: 0.343


                       ANALYSIS OF VARIANCE

SOURCE       SUM-OF-SQUARES   DF  MEAN-SQUARE     F-RATIO       P

TREAT$           21184.068    2    10592.034       6.435       0.002
READ511         116316.437    1   116316.437      70.661       0.000

ERROR           238685.870  145     1646.109


DEP VAR: READ118      N:     127  MULTIPLE R: 0.653  SQUARED MULTIPLE R: 0.426


                       ANALYSIS OF VARIANCE

SOURCE       SUM-OF-SQUARES   DF  MEAN-SQUARE     F-RATIO       P

TREAT$           14030.878    2     7015.439       4.060       0.020

READ511          99584.655    1    99584.655      57.630       0.000
CLUB               805.702    1      805.702       0.466       0.496
OUTSIDE            753.678    1      753.678       0.436       0.510
OSEAS             7172.884    1     7172.884       4.151       0.044

ERROR           207360.867  120     1728.007



POST HOC TEST OF  READ118


USING MODEL MSE OF     1604.609 WITH    120. DF.
MATRIX OF PAIRWISE MEAN DIFFERENCES:

                        BUSIN      GEN'L       TOEIC
             BUSIN       0.000
             GEN'L      -4.198      0.000
             TOEIC      21.141      25.339       0.000


SCHEFFE TEST.
MATRIX OF PAIRWISE COMPARISON PROBABILITIES:


                        BUSIN      GEN'L       TOEIC
             BUSIN      1.000
             GEN'L      0.898      1.000
             TOEIC      0.079      0.032       1.000


Results for Non-Majors

Table 6 presents the analysis of variance for the Total, Listening and Reading Scores for the Majors. None resulted in a signficant difference at the criterion level of 0.05. We can therefore state that for the majors, there was no signficant difference in the test scores depending on the treatment.

Analysis Of Variance for Gain Scores (Majors)

Table 6 -- Analysis of Scores for Majors

TREAT$           17884.770    2     8942.385       2.811       0.062
TOTAL511        519685.583    1   519685.583     163.389       0.000

ERROR           674302.491  212     3180.672



DEP VAR: LIST118      N:     216  MULTIPLE R: 0.539  SQUARED MULTIPLE R: 0.291


                       ANALYSIS OF VARIANCE

SOURCE       SUM-OF-SQUARES   DF  MEAN-SQUARE     F-RATIO       P

TREAT$            2951.144    2     1475.572       0.879       0.417
LIST511         140482.164    1   140482.164      83.670       0.000

ERROR           355946.662  212     1678.994





DEP VAR: READ118      N:     216  MULTIPLE R: 0.581  SQUARED MULTIPLE R: 0.338


                       ANALYSIS OF VARIANCE

SOURCE       SUM-OF-SQUARES   DF  MEAN-SQUARE     F-RATIO       P

TREAT$            5029.187    2     2514.593       2.071       0.129
READ511         130148.453    1   130148.453     107.167       0.000

ERROR           257463.134  212     1214.449


Discussion

Our hypothesis was that the gain scores of all students, regardless of method of study, would be equal. This was confirmed in all but one instance: Non-Major students showed a significant gain on the Reading Section compared to those who studied using regular materials or business materials. The students of one instructor for the Non-Majors actually demonstrated a gain of 77 points overall, with 56 of them in the reading section.

Although the gains for one instructor are considerably greater than those of the other instructor, the pattern is similar in that the Reading Section always shows a greater gain than the listening section, and the TOEIC treatment shows a greater gain than the other two treatments which are similar in their total gain scores. No differences emerged on the follow-up questionnaire (Appendix B) which would account for this difference in the results.

It is also clear that whatever gains there might have been with the majors were 'washed out' by the many other courses which they were taking concurrently. Some might claim that it would have been wiser to alter the content of all classes during the week so that clearer results could have been obtained. This, however, would have resulted in an artificial curriculum, one which would not exist in a normal university. Since English majors would take TOEIC preparation as only one element of their course of study, our model closely approximates an possible actual implementation.

One surprising result is that the Non-Major students, with the exception of Instructor Y's TOEIC section, improved very little over the course of the year and in some cases, even showed 'negative gain'. This can be taken as a testament to the poor attitude of Japanese university students towards their 'general education' subjects. The instructors reported that they could assign little homework since there was little expectation that the students would actually do it. Thus most of the students' exposure was limited to the 26 class meetings. It appears that the activities carried out in class did not, for many students, result in any real 'learning' that could be translated into improved TOEIC scores.

Mean Gains for Non-Major Sections, By Instructor


                TOTAL   LISTENING    READING
  Instructor X

   Business
     MEAN GAIN      -11.4      -21.7         10.3
   General
     MEAN GAIN       -8.2       -8.2          0.0
   TOEIC
     MEAN GAIN       24.7      -13.3         38.0

  Instructor Y

   Business
     MEAN GAIN       27.9       10.8         17.1
   General
     MEAN GAIN       27.2        4.6         30.9
   TOEIC
     MEAN GAIN       77.6       21.3         56.3
 

Even with the one section that did show great improvement, we cannot ascertain how much of this gain can be attributed to greater 'test wiseness' as opposed to greater knowledge of English. It would appear, however, that a greater knowledge of the schema of the written genre appearing on the TOEIC examination might have been a significant factor. This and other possible causes are discussed in the following section.

Improvement in Reading vs Listening

Although all groups generally showed improvement over the course of the year, one salient difference between the Non-Majors and Majors lies in where the improvement took place. With the majors, the improvement in the Listening and Reading scores was almost equal, whereas with the Non-Majors, there was little gain in the listening component (-6.9, 0.3 and 5.4 for the three treatments) and a greater rate of improvement in the reading section (13.3, 12.2 and 47.9). There was little improvement in listening even though both of the instructors used English as the medium of instruction. We can tentatively postulate the following reasons for this:

  1. The students did not spend much time at home listening to the tapes. This is supported by their responses to the question "I listened to the tapes at home regularly" where the average of their self-reports was under 3.0, with '5' meaning 'agree' and '1' meaning 'disagree'. (Figure 1).

    Figure 1

  2. The 'teacher talk' may have been significantly different in its essential nature from the language used on the TOEIC and therefore of little help in improving their scores on the listening test items.

  3. The main text for the 'TOEIC' groupOn Target for the TOEIC was used in the order that the material was presented in the book, where the listening material precedes the reading material. At the time of the post-test, then, the students had just completed the reading section, while the listening section had been finished months earlier.

  4. The Listening section is administered before the Reading section in the actual TOEIC examination. Since the students at the beginning of the year were not familiar with the test, it could well be that they tired towards the end. This would have resulted in poorer performance and a measurement which under-estimated their actual ability in reading. (=One result of the 'practice effect'.)

  5. Concerning the TOEIC treatment which demonstrated the largest gain on the reading section, one of the instructors noted a positive reaction among the students to one particular section of the text which dealt with finding details in written texts such as letters and advertisements. It appears that this kind of material was completely new to them and, indeed, it is not part of the general high school English curriculum in Japan. Thus not only did they actually learn some important skills here, these are skills which were not covered to the same extent in the Business or General treatments' texts, although both of them did also contain some letters as part of their instructional content.

  6. One instructor reported that he had stressed more heavily the fact that improvement in the TOEIC score would be an important factor in their final grade. Actually, this policy applied to students in all treatments, and they were informed of this in the course prospectus at the beginning of the year. It seems however, that this point may have received more emphasis in the TOEIC treatments.

  7. Concerning the Majors, they achieved similar gains on both sections due to the more balanced curriulum, as explained in the presentation of the experimental design in the section entitled "Contact Hours/Week" above.

Conclusion

While this study seems to suggest that TOEIC materials can be effective for improving the reading component scores of non-major students at a Japanese university, our results are by no means conclusive. The non-major students, for example, had initial scores far below those of the English majors. It could be that students in this low score range can benefit more from such instruction than can those at a higher level of ability.

Further, the TOEIC course was a substitute for the standard general English course which might have placed greater emphasis on English for communicative purposes. Forcing students to study TOEIC preparatory material might, therefore, being doing them a disservice if communicative ability is the goal of the program.

Care needs to be taken when applying these findings to other teaching situations. Further studies are required to confirm whether these results apply to students of differing levels of ability, nationality, or motivation or in other educational settings such as in-company training programs and language schools.

Bibliography

Alderson, J. Charles, (1986). Innovations in Language Testing? in Portal, M. (ed.), 93-105.

Alderson, J.C. & Wall, D. (1993). Does washback exist? Applied Linguistics, 14, 115-129.

Alderson, J. Charles and Liz Hamp-Lyons (1996). "TOEFL preparation courses: a study of washback." Language Testing 13, 3, 280-297.

Amer, Aly Anwer (1993), "Teaching EFL students to use a test-taking strategy" Language Testing 10, 1, 71-78.

Bachman, Lyle F., (1990). Fundamental Considerations in Language Testing, Oxford University Press, Oxford.

Becker, B. J. (1990). Coaching for the Scholastic Aptitude Test: Further synthesis and appraisal. Review of Educational Research, 60, 373-417.

Cooley, W.W. (1991). Statewide student assessment. Educational Measurement: Issues and Practice 10, 3-6.

Darling-Hammond, L. and Wise, A.E., (1985). Beyond standardization: state standards, and school improvement. The Elementary School Journal 85, 315-36.

DerSimonian, R. and Laird, N. M. (1983). Evaluating the effect of coaching on SAT scores: A meta-analysis. Harvard Educational Review, 53, 1-15.

Frederickson, J.R. (1984). The real test bias: influences of testing on teaching and learning. American Psychologist 39, 193-202.

Frederickson, J.R. and Collins, A. (1989). A systems approach to educational testing. Educational Researcher 18, 27-32.

Haladnya, T.M., Nolan S.B. and Haas, N.S. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher 20, 2-20.

Henning, Grant (1990). "Priority Issues in the Assessment of Communicative Language Abilities", Foreign Language Annals, 23:5 October 1990, 379-384.

Hughes, A. (1988). Introducing a needs-based test of English language proficiency into an English-medium university in Turkey. In Hughes, A., ed., 134-53.

Hughes, A., ed. (1988). Testing English for university study. ELT Document 127, London: Modern English Publications,

Johnson S. T., Asbury, C. A., Wallace M. B., Robinson S. & Vaughn J. (1985), The effectiveness of a program to increase Scholastic Aptitude Test scores of Black students in three cities. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, April 1985.

Khaniya, T.R. (1990), Examinations as instruments for educational change: inverstigating the washback effect of the Nepalese English exams. Unpublished PhD dissertation, University of Edinburgh.

Kulik, J.A. Bangert-Drowns, R.L. & Kulik, C.C. (1984) Effectiveness of coaching for aptitudfe tests. Psychological Bulletin, 95, 179-188.

Madaus, G.F. (1988), The influence of tsting on the curriculum. In Travers, L., editor, Critical issues in curriculum (87th yearbook of the Society for the Study of Education), Part 1, Chicago, IL: Chicago University Press, 83-121.

Messick, S and Jungeblut, A. (1981), Time and method in coaching for the SAT. Psychological Bulletin, 89, 191-216.

Morrow, K. (1986). The evaluation of tests of communicative performatnce, in Portal(ed).

Portal, M. (ed.), Innovations in Language Testing. London: NFER/Nelson,

Powers, Donald E.(1993), Educational Measurement: Issues and Practice, Summer 1993, 24-31.

Smith. M.L., Edelsky, C., Draper, K., Rottenberg, C. and Cherland, M. (1989), The role of testing in elementary schools. Los Angles, CA: Center for Research on Educational Standards and Student Tests, Graduate School of Education, UCLA.

Swain, M. (1985). Large-scale communicative testing in Lee, Yp> Fok, C.Y.Y., Lord, R. and Low, G. (eds) New Directions in Language Testing. Hong Kong: Pergamon Press.

Vernon, P.E. (1956). The Measurement of Abilities (2nd edn.) London: University of London Press.

Wall, D. and Alderson, J.C. (1993). Examining Washback: the Sri Lankan Impact Study, Language Testing 10, 41-70

Wesdorp, H. (1982) Backwash effects of language testing in primary and secondary education. Stichting Centrum voor onderwijsonderzoek van de Universiteit van Amsterdam.

Acknowledgements

We would like to acknowlege the generous assitance of Steve Ross, who offered advice at every stage of this project, from its inception to the final report.

We would like to thank the following organizations for their assistance with this research.

The Chauncey Group International Ltd. and IIEC(Japan) for the funding that made this research possible.

Oxford University Press and Addison-Wesley/Longman for generously allowing us to duplicate the tapes of their texts locally at a reduced rate for research purposes.

Appendix A

Non-Majors -- Pre-Test (May 11, 1996)

                      TOTAL511   LIST511    READ511
 Both instructors

   Business
     N OF CASES               53          53          53
     MEAN                339.906     180.377     159.528
     STANDARD DEV         81.886      38.566      56.664
   General
     N OF CASES               46          46          46
     MEAN                327.283     172.391     154.891
     STANDARD DEV         56.977      29.359      41.318
   TOEIC
     N OF CASES               50          50          50
     MEAN                303.900     166.600     137.300
     STANDARD DEV         79.259      34.941      60.603


   Instructor X

   Business
     N OF CASES               29          29          29
     MEAN                332.759     179.828     152.931
     STANDARD DEV         68.473      33.580      49.290
   General
     N OF CASES               19          19          19
     MEAN                347.105     178.158     168.947
     STANDARD DEV         57.839      36.485      38.427
   TOEIC
     N OF CASES               23          23          23
     MEAN                298.696     167.609     131.087
     STANDARD DEV         69.335      31.001      51.234


  Instructor Y

   Business
     N OF CASES               24          24          24
     MEAN                348.542     181.042     167.500
     STANDARD DEV         96.487      44.599      64.656
   General
     N OF CASES               27          27          27
     MEAN                313.333     168.333     145.000
     STANDARD DEV         53.042      22.997      41.067
   TOEIC
     N OF CASES               27          27          27
     MEAN                308.333     165.741     142.593
     STANDARD DEV         87.903      38.548      68.097


Non-Majors -- Post-Test (January 18, 1997)

                      TOTAL511   LIST511    READ511
  Both Instructors

    Business
      N OF CASES               53          53          53
      MEAN                336.400     169.473     168.266
      STANDARD DEV          9.896       6.270       5.601
    General
      N OF CASES               46          46          46
      MEAN                337.808     173.214     164.945
      STANDARD DEV         10.538       6.680       5.988
    TOEIC
      N OF CASES               50          50          50
      MEAN                369.633     175.702     192.188
      STANDARD DEV         10.229       6.447       5.798



  Instructor X

   Business
     N OF CASES               29          29          29 
     MEAN                321.379     158.103     163.276  
     STANDARD DEV         75.674      50.399      44.406
   General
     N OF CASES               19          19          19
     MEAN                338.947     170.000     168.947   
     STANDARD DEV         66.637      43.589      31.072
   TOEIC
     N OF CASES               23          23          23
     MEAN                323.478     154.348     169.130 
     STANDARD DEV         63.932      29.974      43.788

  Instructor Y

   Business
     N OF CASES               24          24          24
     MEAN                376.458     191.875     184.583
     STANDARD DEV        107.496      64.028      54.492
   General
     N OF CASES               27          27          27
     MEAN                340.556     174.630     165.926
     STANDARD DEV         80.160      42.289      55.575
   TOEIC
     N OF CASES               27          27          27
     MEAN                385.926     187.037     198.889
     STANDARD DEV         87.300      46.868      53.553


Majors

Pre-test (May 11, 1996)

                   TOTAL511   LIST511   READ511
   Business
     N OF CASES               60          60          60
     MEAN                436.083     223.250     212.833
     STANDARD DEV         83.184      53.622      46.243
   General
     N OF CASES               83          83          83
     MEAN                430.361     213.313     217.048
     STANDARD DEV         67.818      43.953      41.039
   TOEIC
     N OF CASES               73          73          73
     MEAN                411.370     205.000     205.959
     STANDARD DEV         68.345      37.352      48.309

Post-Test (January 18, 1997)

                     TOTAL118   LIST118    READ118
   Business
     N OF CASES               60          60          60
     MEAN                490.743     249.797     242.203
     STANDARD DEV         76.708      51.497      43.144
   General
     N OF CASES               83          83          83
     MEAN                482.711     242.108     240.602
     STANDARD DEV         69.235      44.394      39.820
   TOEIC
     N OF CASES               73          73          73
     MEAN                491.370     245.479     245.890
     STANDARD DEV         79.387      49.771      45.303

Appendix B

Follow-up Questionnaire (Translated into English)

Name___________________________  Student Number___________Day_____Period______

Please answer these questions about your course truthfully.  This information
will be used in order to make the course better in the future.  There is no
connection between this questionnaire and your grade for this course.


A. Questions about you.

  During this school year did you,

1.   belong to a club for studying English?                  Yes     No

2.   study English outside of school?                        Yes     No

3.   take any other 'general education' classes in English?  Yes     No

4.   speak often with an English-speaking friend?            Yes     No

5.  Before coming to this university did you,                Yes     No

     go abroad?     (Where?__________________ How long? ________________)


B. Questions about this course.



                                                      Agree         Disagree
  
 1. The pace of this class was too fast.                 5   4   3   2   1

 2. More time should have been spent on each exercise.   5   4   3   2   1

 3. I think that I can read more quickly now thanks      5   4   3   2   1
    to this course.

 4. I think that I can understand what I read better     5   4   3   2   1
    thanks to this course.

 5. My ability to understand spoken English improved     5   4   3   2   1
    thanks to this course. 

 6. The material was too difficult for me.               5   4   3   2   1

 7. There was too much homework.                         5   4   3   2   1

 8. I listened to the tapes at home regularly.           5   4   3   2   1

 9. What I learned will be useful to me in the future.   5   4   3   2   1

10. This class will help me get a higher score on the    5   4   3   2   1
    TOEIC test.

11. I could understand almost everything that the        5   4   3   2   1
    teacher said in English. 

12. The contents of the class was interesting.           5   4   3   2   1

13. The teacher spoke in Japanese too much.              5   4   3   2   1

14. The teacher spoke too fast for me to understand.     5   4   3   2   1


The results of the questionnaire showed that, for most items, there was very little difference between the groups in their responses. Some items did, indeed, result in a 'significant difference' but the magnitude of the absolute difference in the values is so small that the significant differences' have little import. For example, for item B-8, "I listened to the tapes at home regularly," we find the following results for the Non-Majors:

             N    Mean
   TOEIC     50    2.1
   REGULAR   46    2.4
   BUSINESS  51    1.9

The Fisher PSLD Post-Hoc test reports a significant difference between the Business and Regular treatments, with p =.0045.

While this does show that the regular treatment probably listened to their tapes more than those in the other treatment groups, The absolute difference between these two groups is only 0.5 on a scale of 1 to 5, More importantly, all groups are below the mid-point and have disagreed to some extent with the statement.

The results of part B of the questionniare are reported below.

 
      N Sizes
                           TOEIC   REGULAR  BUSINESS 
      Non-Majors           50       46       51
      Majors-List          80       80       80
      Majors-Read          76       84       66

1. The pace of this class was too fast.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       2.8        2.5        2.3      T > B  p = .0070
  Majors-List      2.5        2.5        2.7             ns
  Majors-Read      2.3        2.5        2.6      B > T  p = .0490


2. More time should have been spent on each exercise.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       3.1        2.8        2.9             ns
  Majors-List      3.4        3.1        2.9      T > B  p = .0318
  Majors-Read      2.9        3.2        3.0      R > T  p = .0214

Note: One of the three TOEIC Major listening sections received an average score of 3.9 as opposed to 3.3 and 2.9 for the other sections. Thus the pace of the class was significantly greater in only one class, not the treatment as a whole.

3. I think that I can read more quickly now thanks to this course.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       2.7        2.7        2.4             ns
  Majors-List     N/A                
  Majors-Read      3.5        3.4        3.5             ns

4. I think that I can understand what I read better thanks to this course.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       2.9        2.6        2.4      T > B  p = .0079
  Majors-List      N/A                
  Majors-Read      3.6        3.3        3.5      T > R  p = .0370

Note: The Non-Majors all scored below the half-way mark of 3 on the Agree/Disagree scale. Despite the fact that it was only in the reading section that the TOEIC treatment improved more than the other treatments, the students themselves apparently did not perceive themselves as having improved in their reading ability

5. My ability to understand spoken English improved thanks to this course.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       3.3        3.3        3.0             ns
  Majors-List      3.4        3.7        2.9        R>T  p=.0069; R>B p<.0001
  Majors-Read      1.9        1.8        2.5        B>T  p<.0012; B>R p<.0001

Note: The Major Regular course included a greater emphasis on pronunciation and sound discrimination exercises which might have caused this difference in perceived improvement. The nature of the Major Reading Business treatment required much more "teacher talk" which explains the higher rating.

6. The material was too difficult for me.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       2.8        2.4        2.4      T > B  p = .0230
  Majors-List      2.6        2.4        2.6             ns
  Majors-Read      2.9        3.0        3.2             ns

7. There was too much homework.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       2.1        2.3        1.8      R > B  p = .0067
  Majors-List      2.1        1.9        1.8
  Majors-Read      3.8        4.0        3.7

Note: The high ratings for the Major-Reading courses are due to the large amount of outside reading required for all the students, regardless of the treatment. Generally speaking, only the in-class work was varied depending on the treatment.

8. I listened to the tapes at home regularly.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       2.1        2.4        1.9           R > B  p = .0045
  Majors-List      2.4        3.0        2.5        R>T  p=.0004; R>B p=.0046
  Majors-Read    N/A

9. What I learned will be useful to me in the future.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       3.6        3.3        3.4             ns
  Majors-List      3.7        4.0        3.0        R>B  p<.0001; T>B p<.0046
  Majors-Read      3.6        3.5        3.4             ns

10. This class will help me get a higher score on the TOEIC test.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       3.8        2.9        3.2        T>B  p=.0007; T>R p<.0001
  Majors-List      3.7        2.5        2.8        T>B  p=.0009; T>R p<.0001
  Majors-Read      3.6        3.2        3.0        T>B  p=.0013; T>R p=.0114

Note: Predictably those who were directly studying TOEIC materials believed that this material was helpful, despite our experimental evidence to the contrary.

11. I could understand almost everything that the teacher said in English.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       2.6        2.8        2.7             ns
  Majors-List      2.9        3.3        2.8        R>B  p=.0066; T>R p=.0145
  Majors-Read      3.2        3.3        3.3

Note: The instructor for one of the Major-Listening Regular sections conducted his class completely in Japanese and instructed his student not to respond to this question. Only two sections, with an N=49 are included here.

12. The contents of the class was interesting.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       3.2        3.3        3.4             ns
  Majors-List      3.4        3.4        2.7        T>B  p=.0006; R>B p<.0001
  Majors-Read      3.1        2.8        2.7        T>B  p=.0044; T>R p=.0155

Note: Despite the fact that the bulk of the English Major graduates find themselves working in business, it appears that they do not find studying Business English interesting.

13. The teacher spoke in Japanese too much.

                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       1.6        1.7        1.5             ns
  Majors-List      3.0        2.7        2.8        
  Majors-Read      2.0@@@@1.2        1.1    T>B  p<.0001; T>R p<.0001

Note: This item suffered from the same problem as item 11, thus only two sections were tabulated for the Majors-Listing Regular treatment.

14. The teacher spoke too fast for me to understand.


                  TOEIC     REGULAR    BUSINESS      Significance
  Non-Majors       3.2        2.9        2.8             ns
  Majors-List      2.0        2.2        1.9             ns
  Majors-Read      2.2        2.5        2.5           R > T  p = .0096