Improving Reliability of New HSK Writing Test: A Generalizability Theory’s Approach

Abstract

Abstract: New HSK is the official Chinese proficiency test for speakers of other languages, which is administered by China’s Hanban (i.e. National Office for Teaching Chinese as a Foreign Language). It was first launched in 2010, and held five times annually. Being a rather new test, research and development focusing on an improvement in the test’s quality is essential. Writing test scores of the New HSK, similar to those of any other language proficiency tests, are most vulnerable to reliability criticism. The study collected writing samples (n=52) from two sets of mock tests of the writing part of the New HSK, level 5. Data analysis was performed from the perspective of Generalizability Theory using Genova and mGenova software. Variance components were estimated for possible influencing effects of item, raters, rating speed, and their interactions. Dependability coefficient (Phi) was estimated for current test settings. Moreover, D studies with various rating patterns were conducted to explore approaches promoting the Phi coefficient. Major findings are: (a) According to current allocation of quantity for each item type, the descending order of the Phi coefficient for each item type is: writing based on keywords given, writing based on the given photo, ordering of the inner-sentence components; (b) For the Phi coefficient to be at least 0.8 for each item type, ordering items needs to increase from 8 items to 20 while the other two need to increase one item each; (c) If item quantities are increased as specified in b, then, the ordering items needs only one rater, writing based on keywords needs two, and writing based on photo needs three. However, if writing based on photo has three items, two raters will be enough to serve the need of keeping Phi coefficient as least 0.8; (d) With current allocation of item quantities for each item type, if calculation of the comprehensive score of writing uses weight proportional to the raw scores, then the Phi coefficient for the writing test could marginally reach the level of 0.8. But the study explored various approaches reaching a Phi coefficient at least 0.85 with relatively lower costs (for details, please refer to section 3.2.2 of this paper). For such estimations, the analysis applied solver functions of Microsoft Excel. (e) The study did not find effect of rating speed significant. However, this conclusion was limited to the two different speeds investigated by this study: a speed that each individual rater feels comfortable, and a speed that raters feel a little pushy but still have confidence about the rating quality under that speed. Furthermore, more evidence is needed to support such a conclusion, especially real data from professional raters. The study suggested to investigate effect of rating speeds on generalizability of writing test scores with more rigorous designs and more mature data collection techniques, and called for more attention to reliability issues of writing tests.

Key words: new HSK, writing assessment, reliability, Generalizability Theory

摘要： 本研究以概化理论为视角，搜集了新HSK五级模拟书写题的作答和评分数据，估算了题型、题量、评卷员人数、评阅速度等潜在影响效应的方差分量，考察了新HSK书写成绩的可靠性，并探索了改善该分数可靠性的途径。基于概化理论和规划求解的数据分析发现了题量的调整方案以及题型、题量、评卷员人数的最优组合方案。本研究对评阅速度进行的分析属于前沿性的理论探索，而其他数据分析结果，则可能有益于旨在改进该测试质量的决策实践。

关键词: 新HSK，写作测评，信度，概化理论

朱宇冯瑞龙辛涛. 新HSK书写成绩可靠性影响因素的概化理论分析[J]. 心理科学, 2013, 36(2): 479-483.

References

薄丽. (2005). 背景差异的两类评卷员在HSK高等作文考试评分中的差异研究. 北京语言大学硕士学位论文.
国家汉办/孔子学院总部. (2010). 新汉语水平考试大纲HSK五级. 北京：商务印书馆.
国家汉办/孔子学院总部. (2010). 新汉语水平考试真题集HSK五级. 北京：华语教学出版社.
康春花、姜宇、辛涛. (2010).概化理论在人事测评中的评分者一致性研究. 心理科学 , 33(6), 1456-1460.
刘婧. (2006). 运用概化理论分析作文分数的变异. 北京语言大学硕士学位论文.
刘远我、张厚粲. (1998). 概化理论在作文评分中的应用研究. 心理学报, 30(2), 211-218.
任春艳.(2004). HSK作文评分客观化探讨. 汉语学习 , 2004（6）， 58-67.
田清源、赵刚. (2008). HSK作文客观化评分的研究, 汉语学习, 2008（5）， 103-107.
王晓华、文剑冰. (2010). 多元概化理论在高等教育达标性考试中的应用. 心理科学 , 33(5), 1223-1226.
赵亮. (2004). 作为第二语言的汉语写作能力测验方式的实验研究. 北京语言大学硕士学位论文.
赵琪凤. (2010). HSK写作测试评分信度考查——基于对新老评卷员的个案调查. 中国考试 , 2010（10）， 13-19.
Brennan, R. L. (2001). The urGENOVA Software. Iowa City, IA: Iowa Testing Programs, University of Iowa.
Gebril, A. (2009). Score generalizability of academic writing tasks: Does one test method fit it all? Language Testing, 26(4), 507-531.
Lee, Y.-W., & Kantor, R. (2007). Evaluating prototype tasks and alternative rating schemes for a new ESL writing test through G-theory. International Journal of Testing, 7(4), 353-385.
Nie, Y., Yeo, S. M., & Lau, S. (2007). Application of generalizability theory in the investigation of the quality of journal writing in mathematics. Studies in Educational Evaluation, 33(3-4), 371-383.
Parkes, J. (2000). The relationship between the reliability and cost of performance assessments. Education Policy Analysis Archives, 8(16),1-15.
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1-30.
Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2004). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239-261.