first commit

2024-11-21 11:02:01 +08:00 · 2024-11-21 11:02:01 +08:00 · 3325ae49f4
parent 632c3e725b
commit 3325ae49f4
6 changed files with 63544 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,148 @@
-# nlp_bart_text-error-correction_chinese_a13669743711481856463214
+---
+tasks:
+- text-error-correction
+
+widgets:
+  - task: text-error-correction
+    inputs:
+      - type: text #可选值：text|image|video|audio
+        name: 
+        title: 
+        validator: 
+          max_words: 100 
+    examples:
+      - name: 1
+        title: 示例1 #用于前端显示，如果不写会使用name来显示
+        inputs:
+          - name:
+            data: 这洋的话，下一年的福气来到自己身上。
+      - name: 2
+        title: 示例2 #用于前端显示，如果不写会使用name来显示
+        inputs:
+          - name:
+            data: 在拥挤时间，为了让人们尊守交通规律，派至少两个警察或者交通管理者。
+      - name: 3
+        title: 示例3 #用于前端显示，如果不写会使用name来显示
+        inputs:
+          - name:
+            data: 因为世界的多元化，不时会引起人们产生了兴趣。
+    inferencespec:
+      cpu: 2 #CPU数量
+      memory: 4000 #单位MB
+      gpu: 1 #GPU数量
+      gpu_memory: 16000 #单位MB
+
+model-type:
+- seq2seq(BART)
+
+domain:
+- nlp
+
+frameworks:
+- pytorch
+
+backbone:
+- transformer
+
+containers:
+
+metrics:
+- M2Scorer P/R/F0.5
+
+license: Apache License 2.0
+
+language: 
+- ch
+
+tags:
+- transformer
+- Alibaba
+- text-error-correction
+
+datasets:
+  train:
+  - Lang8
+  - HSK
+  test: 
+  - NLPCC18 testset
+---
+
+# 中文文本纠错模型介绍
+输入一句中文文本，文本纠错技术对句子中存在拼写、语法、语义等错误进行自动纠正，输出纠正后的文本。主流的方法为seq2seq和seq2edits，常用的中文纠错数据集包括NLPCC18和CGED等，我们最新的工作提供了高质量、多答案的测试集MuCGEC。
+## 模型描述
+
+如图所示，我们采用基于transformer的seq2seq方法建模文本纠错任务。模型训练上，我们使用中文BART作为预训练模型，然后在Lang8和HSK训练数据上进行finetune。不引入额外资源的情况下，本模型在NLPCC18测试集上达到了SOTA。
+
+<p align="center">
+    <img src="./description/model.jpg" alt="donuts" />
+
+模型效果如下：   
+输入：这洋的话，下一年的福气来到自己身上。   
+输出：这**样**的话，下一年的福气**就会**来到自己身上。 
+
+## 期望模型使用方式以及适用范围
+本模型主要用于对中文文本进行错误诊断，输出符合拼写、语法要求的文本。该纠错模型是一个句子级别的模型，模型效果会受到文本长度、分句粒度的影响，建议是每次输入一句话。具体调用方式请参考代码示例。
+
+### 如何使用
+在安装完成ModelScope library之后即可使用text-error-correction的能力
+
+#### 代码范例
+```python
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+model_id = 'damo/nlp_bart_text-error-correction_chinese'
+input = '这洋的话，下一年的福气来到自己身上。'
+pipeline = pipeline(Tasks.text_error_correction, model=model_id)
+result = pipeline(input)
+print(result['output'])
+
+```
+
+### 模型局限性以及可能的偏差
+模型在Lang8和HSK数据集上训练，不同场景下有可能产生一些偏差，请用户自行评测后决定如何使用。
+
+## 训练数据介绍
+本模型训练数据集是Lang8(1,092,285句)和HSK(95,320句)。 Lang8[下载](http://tcci.ccf.org.cn/conference/2018/taskdata.php), HSK由于版权问题，无法提供下载链接，可自行获取。
+
+## 模型训练流程
+暂不支持在ModelScope内部进行训练
+
+## 数据评估及结果
+本模型在NLPCC18测试集上，采用M2Scorer[NLPCC18官方评测工具](https://github.com/nusnlp/m2scorer)评估，同等规模和训练数据的模型中取得了SOTA。
+|                   | P     | R     | F0.5  |
+|-------------------|-------|-------|-------|
+| Tang et al., 2021<sup>1</sup>| 47.41 | 23.72 | 39.51 |
+| Sun et al., 2021<sup>2</sup>  | 45.33 | 27.61 | 40.17 |
+| Ours<sup>3</sup>            | 48.89 | 32.80 | 44.53 |
+
+参考工作：
+1. 苏大：Tang et al. Chinese grammatical error correction enhanced by data augmentation from word and character levels. 2021.
+2. 北大 & MSRA & CUHK：Sun et al. A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model. 2021.
+3. Ours：Zhang et al. MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for
+Chinese Grammatical Error Correction. 2022.
+
+### 相关论文以及引用信息
+如果我们的模型对您有帮助，请您引用我们的文章：
+```BibTeX
+@inproceedings{zhang-etal-2022-mucgec,
+    title = "{M}u{CGEC}: a Multi-Reference Multi-Source Evaluation Dataset for {C}hinese Grammatical Error Correction",
+    author = "Zhang, Yue  and
+      Li, Zhenghua  and
+      Bao, Zuyi  and
+      Li, Jiacheng  and
+      Zhang, Bo  and
+      Li, Chen  and
+      Huang, Fei  and
+      Zhang, Min",
+    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
+    month = jul,
+    year = "2022",
+    address = "Seattle, United States",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.naacl-main.227",
+    pages = "3118--3130"
+    }
+```
+

-BART文本纠错-中文-通用领域-large
--- a/chinese_vocab.txt
+++ b/chinese_vocab.txt
--- a/configuration.json
+++ b/configuration.json
@ -0,0 +1,10 @@
+{
+    "framework": "pytorch",
+    "task": "text-error-correction",
+    "model": {
+       "type": "bart"
+    },
+    "pipeline": {
+       "type": "text-error-correction"
+    }
+}
--- a/dict.src.txt
+++ b/dict.src.txt
--- a/dict.tgt.txt
+++ b/dict.tgt.txt
--- a/pytorch_model.pt
+++ b/pytorch_model.pt