Construct Validity of "e-rater"® in Scoring TOEFL® Essays. Research Report. ETS RR-07-21

Autor/in	Attali, Yigal
Titel	Construct Validity of "e-rater"® in Scoring TOEFL® Essays. Research Report. ETS RR-07-21
Quelle	In: ETS Research Report Series, (2007), (26 Seiten)Infoseite zur Zeitschrift PDF als Volltext kostenfreie Datei Verfügbarkeit
Sprache	englisch
Dokumenttyp	gedruckt; online; Zeitschriftenaufsatz
ISSN	2330-8516
Schlagwörter	Construct Validity; Computer Assisted Testing; Scoring; English (Second Language); Language Tests; Second Language Learning; Essay Tests; Correlation; Test Reliability; Bias; Factor Analysis; Prediction; Comparative Analysis; Regression (Statistics); Models; Weighted Scores; True Scores; Comparative Education; Prompting; Test of English as a Foreign Language + Suchen Sie Ihr Suchwort? Bewertung; English as second language; English; Second Language; Englisch als Zweitsprache; Language test; Sprachtest; Zweitsprachenerwerb; Schriftlicher Sprachgebrauch; Korrelation; Testreliabilität; Faktorenanalyse; Vorhersage; Regression; Regressionsanalyse; Analogiemodell; Vergleichende Erziehungswissenschaft; Benutzerführung
Abstract	This study examined the construct validity of the "e-rater"® automated essay scoring engine as an alternative to human scoring in the context of TOEFL® essay writing. Analyses were based on a sample of students who repeated the TOEFL within a short time period. Two "e-rater" scores were investigated in this study, the first based on optimally predicting the human essay score and the second based on equal weights for the different features of "e-rater." Within a multitrait-multimethod approach, the correlations and reliabilities of human and "e-rater" scores were analyzed together with TOEFL subscores (structured writing, reading, and listening) and with essay length. Possible biases between human and "e-rater" scores were examined with respect to differences in performance across countries of origin and differences in difficulty across prompts. Finally, a factor analysis was conducted on the "e-rater" features to investigate the interpretability of their internal structure and determine which of the two "e-rater" scores reflects this structure more closely. Results showed that the "e-rater" score based on optimally predicting the human score measures essentially the same construct as human-based essay scores with significantly higher reliability and consequently higher correlations with related language scores. The equal-weights "e-rater" score showed the same high reliability but significantly lower correlation with essay length. It is also aligned with the 3-factor hierarchical (word use, grammar, and discourse) structure that was discovered in the factor analysis. Both "e-rater" scores also successfully replicate human score differences between countries and prompts. (As Provided).
Anmerkungen	Educational Testing Service. Rosedale Road, MS19-R Princeton, NJ 08541. Tel: 609-921-9000; Fax: 609-734-5410; e-mail: RDweb@ets.org; Web site: https://www.ets.org/research/policy_research_reports/ets
Erfasst von	ERIC (Education Resources Information Center), Washington, DC
Update	2020/1/01