WEKO3
アイテム
Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level
http://hdl.handle.net/10061/0002001061
http://hdl.handle.net/10061/0002001061bdb7cced-0e8a-47dc-b1ae-aafc4f3f4518
| アイテムタイプ | 会議発表論文 / Conference Paper(1) | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 公開日 | 2025-07-23 | |||||||||||||||||||
| タイトル | ||||||||||||||||||||
| タイトル | Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level | |||||||||||||||||||
| 言語 | ||||||||||||||||||||
| 言語 | eng | |||||||||||||||||||
| キーワード | ||||||||||||||||||||
| 主題Scheme | Other | |||||||||||||||||||
| 主題 | Pashto Paraphrase Detection | |||||||||||||||||||
| キーワード | ||||||||||||||||||||
| 主題Scheme | Other | |||||||||||||||||||
| 主題 | Corpus Collection | |||||||||||||||||||
| キーワード | ||||||||||||||||||||
| 主題Scheme | Other | |||||||||||||||||||
| 主題 | Low Resource NLP | |||||||||||||||||||
| 資源タイプ | ||||||||||||||||||||
| 資源タイプ | conference paper | |||||||||||||||||||
| アクセス権 | ||||||||||||||||||||
| アクセス権 | open access | |||||||||||||||||||
| 著者 |
Ali, Iqra
× Ali, Iqra
× 上垣外, 英剛
× 渡辺, 太郎
|
|||||||||||||||||||
| 抄録 | ||||||||||||||||||||
| 内容記述タイプ | Abstract | |||||||||||||||||||
| 内容記述 | Paraphrase detection is a task to identify if two sentences are semantically similar or not. It plays an important role in maintaining the integrity of written work such as plagiarism detection and text reuse detection. Formerly, researchers focused on developing large corpora for English. However, no research has been conducted on sentence-level paraphrase detection in low-resource Pashto language. To bridge this gap, we introduce the first fully manually annotated Pashto sentential paraphrase detection corpus collected from authentic cases in journalism covering 10 different domains, including Sports, Health, Environment, and more. Our proposed corpus contains 6,727 sentences, encompassing 3,687 paraphrased and 3,040 non-paraphrased. Experimental findings reveal that our proposed corpus is sufficient to train XLM-RoBERTa to accurately detect paraphrased sentence pairs in Pashto with an F1 score of 84%. To compare our corpus with those in other languages, we also applied our fine-tuned model to the Indonesian and English paraphrase datasets in a zero-shot manner, achieving F1 scores of 82% and 78%, respectively. This result indicates that the quality of our corpus is not less than commonly used datasets. It‘s a pioneering contribution to the field. We will publicize a subset of 1,800 instances from our corpus, free from any licensing issues. | |||||||||||||||||||
| 書誌情報 |
en : Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) p. 11574-11581, ページ数 8, 発行日 2024-05 |
|||||||||||||||||||
| 会議情報 | ||||||||||||||||||||
| 会議名 | The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) | |||||||||||||||||||
| 開始年 | 2024 | |||||||||||||||||||
| 開始月 | 05 | |||||||||||||||||||
| 開始日 | 20 | |||||||||||||||||||
| 終了年 | 2024 | |||||||||||||||||||
| 終了月 | 05 | |||||||||||||||||||
| 終了日 | 25 | |||||||||||||||||||
| 開催期間 | 2024-05-20 - 2024-05-25 | |||||||||||||||||||
| 開催地 | Torino, Italia | |||||||||||||||||||
| 開催国 | ITA | |||||||||||||||||||
| 出版者 | ||||||||||||||||||||
| 出版者 | ELRA and ICCL | |||||||||||||||||||
| 出版者版URI | ||||||||||||||||||||
| 関連タイプ | isReplacedBy | |||||||||||||||||||
| 識別子タイプ | URI | |||||||||||||||||||
| 関連識別子 | https://aclanthology.org/2024.lrec-main.1011/ | |||||||||||||||||||
| 権利 | ||||||||||||||||||||
| 権利情報Resource | https://creativecommons.org/licenses/by-nc/4.0/ | |||||||||||||||||||
| 権利情報 | $00A9 2024 ELRA Language Resource Association: CC BY-NC 4.0 | |||||||||||||||||||
| 著者版フラグ | ||||||||||||||||||||
| 出版タイプ | NA | |||||||||||||||||||