ログイン
Language:

WEKO3

  • トップ
  • ランキング
To
lat lon distance
To

Field does not validate



インデックスリンク

インデックスツリー

メールアドレスを入力してください。

WEKO

One fine body…

WEKO

One fine body…

アイテム

  1. 02 情報科学
  2. 02 国際会議論文

Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level

http://hdl.handle.net/10061/0002001061
http://hdl.handle.net/10061/0002001061
bdb7cced-0e8a-47dc-b1ae-aafc4f3f4518
アイテムタイプ 会議発表論文 / Conference Paper(1)
公開日 2025-07-23
タイトル
タイトル Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level
言語
言語 eng
キーワード
主題Scheme Other
主題 Pashto Paraphrase Detection
キーワード
主題Scheme Other
主題 Corpus Collection
キーワード
主題Scheme Other
主題 Low Resource NLP
資源タイプ
資源タイプ conference paper
アクセス権
アクセス権 open access
著者 Ali, Iqra

× Ali, Iqra

en Ali, Iqra

Search repository
上垣外, 英剛

× 上垣外, 英剛

ja 上垣外, 英剛

ja-Kana カミガイト, ヒデタカ

en Kamigaito, Hidetaka

Search repository
渡辺, 太郎

× 渡辺, 太郎

ja 渡辺, 太郎

ja-Kana ワタナベ, タロウ

en Watanabe, Taro

Search repository
抄録
内容記述タイプ Abstract
内容記述 Paraphrase detection is a task to identify if two sentences are semantically similar or not. It plays an important role in maintaining the integrity of written work such as plagiarism detection and text reuse detection. Formerly, researchers focused on developing large corpora for English. However, no research has been conducted on sentence-level paraphrase detection in low-resource Pashto language. To bridge this gap, we introduce the first fully manually annotated Pashto sentential paraphrase detection corpus collected from authentic cases in journalism covering 10 different domains, including Sports, Health, Environment, and more. Our proposed corpus contains 6,727 sentences, encompassing 3,687 paraphrased and 3,040 non-paraphrased. Experimental findings reveal that our proposed corpus is sufficient to train XLM-RoBERTa to accurately detect paraphrased sentence pairs in Pashto with an F1 score of 84%. To compare our corpus with those in other languages, we also applied our fine-tuned model to the Indonesian and English paraphrase datasets in a zero-shot manner, achieving F1 scores of 82% and 78%, respectively. This result indicates that the quality of our corpus is not less than commonly used datasets. It‘s a pioneering contribution to the field. We will publicize a subset of 1,800 instances from our corpus, free from any licensing issues.
書誌情報 en : Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

p. 11574-11581, ページ数 8, 発行日 2024-05
会議情報
会議名 The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
開始年 2024
開始月 05
開始日 20
終了年 2024
終了月 05
終了日 25
開催期間 2024-05-20 - 2024-05-25
開催地 Torino, Italia
開催国 ITA
出版者
出版者 ELRA and ICCL
出版者版URI
関連タイプ isReplacedBy
識別子タイプ URI
関連識別子 https://aclanthology.org/2024.lrec-main.1011/
権利
権利情報Resource https://creativecommons.org/licenses/by-nc/4.0/
権利情報 $00A9 2024 ELRA Language Resource Association: CC BY-NC 4.0
著者版フラグ
出版タイプ NA
戻る
0
views
See details
Views

Versions

Ver.1 2025-07-23 04:34:18.228389
Show All versions

Share

Share
tweet

Cite as

Other

print

エクスポート

OAI-PMH
  • OAI-PMH JPCOAR 2.0
  • OAI-PMH JPCOAR 1.0
  • OAI-PMH DublinCore
  • OAI-PMH DDI
Other Formats
  • JSON
  • BIBTEX
  • ZIP

コミュニティ

確認

確認

確認


Powered by WEKO3


Powered by WEKO3