Unlike the English learning setting for which many learning technologies have been developed, those to support Asian language learners are relatively rare. In response, the NLPTEA (Natural Language Processing Techniques for Educational Applications) workshops were organized to provide a forum where international participants can share knowledge on the computer-assisted techniques for Asian language learning and teaching. In NLPTEA workshops, a series of shared tasks for Chinese grammatical error diagnosis was hosted (Yu et al., 2014; Lee et al., 2015b; Lee et al., 2016). The goal of these tasks is to develop NLP techniques to automatically diagnose grammatical errors in Chinese sentences written by Chinese as a foreign language learners. The grammatical errors are broadly categorized into 4 error types: word ordering, redundant, missing, and incorrect selection of linguistic components (also called PADS error types, denoting errors of Permutation, Addition, Deletion, and Selection, correspondingly). In the first edition in NLPTEA 2014 workshop, a sentence with/without one of grammatical errors is given, and the developed system should indicate whether it contains a grammatical error and further identify the error type if any error occurs. In the second edition in NLPTEA 2015 workshop, in addition to detecting whether a given sentence contains a grammatical error and identify the error type, the system should also indicate the range of occurring errors. In the third edition in NLPTEA 2016 workshop, the task is basically the same, except that a sentence may contain more than one errors and the HSK learner corpus is also included for the task.
Shared Task | NLPTEA 2014 Task
(Yu et al., 2014) |
NLPTEA 2015 Task
(Lee et al., 2015) |
NLPTEA 2016 Task
(Lee et al., 2016) |
---|---|---|---|
Examples |
Example 1:
|
Example 1:
|
Example 1:
|
Data Source | TOCFL Learner Corpus | TOCFL Learner Corpus |
TOCFL Learner Corpus
& HSK Dynamic Composition Corpus |
Training Set | 1,506 writings
(5,607 errors) |
2,205 sentences
(2,205 errors) |
TOCFL Track:
|
Test Set |
This set consists of 1,750 testing sentences. Half of these instances contained no grammatical errors. Another half included one grammatical error. |
This set consists of 1,000 testing sentences. Half of these sentences contained no grammatical errors, while the other half included one error. |
TOCFL Track:
|
Technical Challenges |
Error Detection
|
Error Detection (binary class categorization problem)
|
|
Evaluation Metrics |
False Positive Rate
|
False Positive Rate
|
|
Registered Teams |
13 teams (4 from Taiwan, 3 from China, and the remaining 6 from Japan, USA, UK, Germany, New Zealand, and Russia) |
13 teams (4 from Taiwan, 3 from China, 2 from USA, 2 from UK, 1 from Japan, and 1 from Poland) |
15 teams (8 from China, 4 from Taiwan, 1 from Dublin, 1 from Germany, and 1 private firm) |
CGED data sets and evaluation tools
Lung-Hao Lee
Associate Professor
Department of Electrical Engineering
National Central University
lhlee@ee.ncu.edu.tw