Chinese spelling errors frequently arise from confusion among multiple-character words that are phonologically and
visually similar but semantically distinct. The first Chinese Spelling Check (CSC) bakeoff was organized as part
the Seventh SIGHAN (Special Interesting Group on Chinse Language Processing of the Association for Computational
Linguistics) workshop (Wu et al., 2013). This shared task is the first open evaluation for automatic Chinese
spelling checkers. A second version of this bakeoff was collocated with the Third CIPS-SIGHAN Joint Conference
on Chinese Language Processing (Yu et al., 2014). A third one was organized in conjunction with the Eighth SIGHAN
workshop (Tseng et al. 2015). The main purpose of these bakeoffs was to provide a common setting so that researchers
who approach the tasks with different linguistic factors and computational techniques can compare their results.
Different from the initial edition in SIGHAN 2013 adopting the data sets written by Chinese native speakers,
the second and third bakeoffs used the TOCFL Learner Corpus as the data source. This is considered a greater
challenge in detecting and correcting spelling errors as the sentences written by Chinese as a foreign language
learners can contain errors that are hard to predict. The goal is identical to evaluate the capability of a Chinese
spelling checker. A sentence consisting of several clauses with/without spelling errors was given as the input.
The checker should return the locations of incorrect characters and suggest the correct characters. Each character
or punctuation mark occupies 1 spot for counting location.
Shared Task | SIGHAN 2013
(Wu et al., 2013) |
SIGHAN 2014
(Yu et al., 2014) |
SIGHAN 2015
(Tseng et al., 2015) |
---|---|---|---|
Examples |
Input: (NID=88888) 擁有六百一十年歷史 的崇禮門,象微著南韓人的精神,在一 夕之門,被火燒得精光。
|
Input: (pid=B1-0201-1) 我一身中的貴人就是我姨媽,從我回來台灣的時候, 她一只都很照顧我,也很觀心我。
|
Input: (pid=B2-1670-2) 在日本,大學生打工的情況 是相當普偏的。
|
Data Source | Chinese native speakers’ data | TOCFL Learner Corpus | |
Training Set |
This set included 700 samples selected from students’ essays. Besides, the set of Chinese characters with similar shapes/pronunciations was provided. |
This set included 1,301 sentences with a total of 5,284 spelling errors |
This set included 970 sentences with a total of 3,143 spelling errors |
Test Set |
This set consisted of 2,000 testing sentences for both subtasks, each with an average of 70 characters. Total number of error character is 1,641. |
This set consisted of 1,062 testing sentences, each with an average of 50 characters. One half contained no spelling errors, while the other half included as least one spelling error each for a total of 792 spelling errors. |
This set consisted of 1,100 testing sentences. Half of these sentences contained no spelling errors, while the other half included as least one spelling errors. |
Evaluation Metrics |
False Positive Rate, Detection-level Accuracy/Precision/Recall/F1,
|
||
Registered Teams | 17 teams (9 from Taiwan, 5 from China and 3 from Singapore, Japan, and UK) | 19 teams (10 from China, 8 from Taiwan, and 1 private firm) | 9 teams (4 from China, 4 from Taiwan, and 1 private firm) |
CSC data sets and evaluation tools
Lung-Hao Lee
Associate Professor
Department of Electrical Engineering
National Central University
lhlee@ee.ncu.edu.tw