Chinese Spell Checking

Introduction

Chinese spelling errors frequently arise from confusion among multiple-character words that are phonologically and visually similar but semantically distinct. The first Chinese Spelling Check (CSC) bakeoff was organized as part the Seventh SIGHAN (Special Interesting Group on Chinse Language Processing of the Association for Computational Linguistics) workshop (Wu et al., 2013). This shared task is the first open evaluation for automatic Chinese spelling checkers. A second version of this bakeoff was collocated with the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing (Yu et al., 2014). A third one was organized in conjunction with the Eighth SIGHAN workshop (Tseng et al. 2015). The main purpose of these bakeoffs was to provide a common setting so that researchers who approach the tasks with different linguistic factors and computational techniques can compare their results.
Different from the initial edition in SIGHAN 2013 adopting the data sets written by Chinese native speakers, the second and third bakeoffs used the TOCFL Learner Corpus as the data source. This is considered a greater challenge in detecting and correcting spelling errors as the sentences written by Chinese as a foreign language learners can contain errors that are hard to predict. The goal is identical to evaluate the capability of a Chinese spelling checker. A sentence consisting of several clauses with/without spelling errors was given as the input. The checker should return the locations of incorrect characters and suggest the correct characters. Each character or punctuation mark occupies 1 spot for counting location.


Summary

Shared Task SIGHAN 2013
(Wu et al., 2013)
SIGHAN 2014
(Yu et al., 2014)
SIGHAN 2015
(Tseng et al., 2015)
Examples

Input: (NID=88888) 擁有六百一十年歷史 的崇禮門,象微著南韓人的精神,在一 夕之門,被火燒得精光。
Output: 88888, 16, 徵 29, 間

Input: (pid=B1-0201-1) 我一身中的貴人就是我姨媽,從我回來台灣的時候, 她一只都很照顧我,也很觀心我。
Output: B1-0201-1, 3, 生, 26, 直, 35, 關

Input: (pid=B2-1670-2) 在日本,大學生打工的情況 是相當普偏的。
Output: B2-1670-2, 17, 遍

Data Source Chinese native speakers’ data TOCFL Learner Corpus
Training Set

This set included 700 samples selected from students’ essays. Besides, the set of Chinese characters with similar shapes/pronunciations was provided.

This set included 1,301 sentences with a total of 5,284 spelling errors

This set included 970 sentences with a total of 3,143 spelling errors

Test Set

This set consisted of 2,000 testing sentences for both subtasks, each with an average of 70 characters. Total number of error character is 1,641.

This set consisted of 1,062 testing sentences, each with an average of 50 characters. One half contained no spelling errors, while the other half included as least one spelling error each for a total of 792 spelling errors.

This set consisted of 1,100 testing sentences. Half of these sentences contained no spelling errors, while the other half included as least one spelling errors.

Evaluation Metrics

False Positive Rate, Detection-level Accuracy/Precision/Recall/F1,
Correction-level Accuracy/Precision/Recall/F1.

Registered Teams 17 teams (9 from Taiwan, 5 from China and 3 from Singapore, Japan, and UK) 19 teams (10 from China, 8 from Taiwan, and 1 private firm) 9 teams (4 from China, 4 from Taiwan, and 1 private firm)

Download

Click here to download all data sets and evaluation tools.


References


Contact

Lung-Hao Lee

Assistant Professor

Department of Electrical Engineering

National Central University

lhlee@ee.ncu.edu.tw