A clinical specific BERT developed using a huge Japanese clinical text corpus

Kawazoe, Yoshimasa; Shibata, Daisaku; Shinohara, Emiko; 荒牧, 英治; アラマキ, エイジ; Aramaki, Eiji; Ohe, Kazuhiko

doi:https://doi.org/10.1371/journal.pone.0259763

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

{"_buckets": {"deposit": "33371c99-f1a4-425d-8ef9-6860da97677d"}, "_deposit": {"created_by": 4, "id": "4370", "owners": [4], "pid": {"revision_id": 0, "type": "depid", "value": "4370"}, "status": "published"}, "_oai": {"id": "oai:naist.repo.nii.ac.jp:00004370", "sets": ["35"]}, "author_link": ["10434", "10435", "10436", "21", "10437"], "item_7_biblio_info_9": {"attribute_name": "書誌情報", "attribute_value_mlt": [{"bibliographicIssueDates": {"bibliographicIssueDate": "2021-11-09", "bibliographicIssueDateType": "Issued"}, "bibliographicIssueNumber": "11", "bibliographicVolumeNumber": "16", "bibliographic_titles": [{"bibliographic_title": "PLoS ONE", "bibliographic_titleLang": "en"}]}]}, "item_7_description_7": {"attribute_name": "抄録", "attribute_value_mlt": [{"subitem_description": "Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed.", "subitem_description_language": "en", "subitem_description_type": "Abstract"}]}, "item_7_publisher_10": {"attribute_name": "出版者", "attribute_value_mlt": [{"subitem_publisher": "Public Library of Science", "subitem_publisher_language": "en"}]}, "item_7_relation_15": {"attribute_name": "PubMed番号", "attribute_value_mlt": [{"subitem_relation_type": "isReplacedBy", "subitem_relation_type_id": {"subitem_relation_type_id_text": "34752490", "subitem_relation_type_select": "PMID"}}]}, "item_7_relation_17": {"attribute_name": "出版者版DOI", "attribute_value_mlt": [{"subitem_relation_type": "isReplacedBy", "subitem_relation_type_id": {"subitem_relation_type_id_text": "https://doi.org/10.1371/journal.pone.0259763", "subitem_relation_type_select": "DOI"}}]}, "item_7_relation_22": {"attribute_name": "出版者版URI", "attribute_value_mlt": [{"subitem_relation_type": "isReplacedBy", "subitem_relation_type_id": {"subitem_relation_type_id_text": "https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0259763", "subitem_relation_type_select": "URI"}}]}, "item_7_rights_18": {"attribute_name": "権利", "attribute_value_mlt": [{"subitem_rights": "c 2021 Kawazoe et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.", "subitem_rights_language": "en"}]}, "item_7_source_id_12": {"attribute_name": "EISSN/PISSN", "attribute_value_mlt": [{"subitem_source_identifier": "1932-6203", "subitem_source_identifier_type": "ISSN"}]}, "item_7_text_25": {"attribute_name": "NAIST ID", "attribute_value_mlt": [{"subitem_text_value": "74652181"}]}, "item_access_right": {"attribute_name": "アクセス権", "attribute_value_mlt": [{"subitem_access_right": "metadata only access", "subitem_access_right_uri": "http://purl.org/coar/access_right/c_14cb"}]}, "item_creator": {"attribute_name": "著者", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "Kawazoe, Yoshimasa", "creatorNameLang": "en"}], "nameIdentifiers": [{"nameIdentifier": "10434", "nameIdentifierScheme": "WEKO"}]}, {"creatorNames": [{"creatorName": "Shibata, Daisaku", "creatorNameLang": "en"}], "nameIdentifiers": [{"nameIdentifier": "10435", "nameIdentifierScheme": "WEKO"}]}, {"creatorNames": [{"creatorName": "Shinohara, Emiko", "creatorNameLang": "en"}], "nameIdentifiers": [{"nameIdentifier": "10436", "nameIdentifierScheme": "WEKO"}]}, {"creatorNames": [{"creatorName": "荒牧, 英治", "creatorNameLang": "ja"}, {"creatorName": "アラマキ, エイジ", "creatorNameLang": "ja-Kana"}, {"creatorName": "Aramaki, Eiji", "creatorNameLang": "en"}], "nameIdentifiers": [{"nameIdentifier": "21", "nameIdentifierScheme": "WEKO"}, {"nameIdentifier": "70401073", "nameIdentifierScheme": "e-Rad", "nameIdentifierURI": "https://kaken.nii.ac.jp/ja/search/?qm=70401073"}]}, {"creatorNames": [{"creatorName": "Ohe, Kazuhiko", "creatorNameLang": "en"}], "nameIdentifiers": [{"nameIdentifier": "10437", "nameIdentifierScheme": "WEKO"}]}]}, "item_language": {"attribute_name": "言語", "attribute_value_mlt": [{"subitem_language": "eng"}]}, "item_resource_type": {"attribute_name": "資源タイプ", "attribute_value_mlt": [{"resourcetype": "journal article", "resourceuri": "http://purl.org/coar/resource_type/c_6501"}]}, "item_title": "A clinical specific BERT developed using a huge Japanese clinical text corpus", "item_titles": {"attribute_name": "タイトル", "attribute_value_mlt": [{"subitem_title": "A clinical specific BERT developed using a huge Japanese clinical text corpus", "subitem_title_language": "en"}]}, "item_type_id": "7", "owner": "4", "path": ["35"], "permalink_uri": "http://hdl.handle.net/10061/14620", "pubdate": {"attribute_name": "PubDate", "attribute_value": "2022-02-10"}, "publish_date": "2022-02-10", "publish_status": "0", "recid": "4370", "relation": {}, "relation_version_is_last": true, "title": ["A clinical specific BERT developed using a huge Japanese clinical text corpus"], "weko_shared_id": -1}

A clinical specific BERT developed using a huge Japanese clinical text corpus

http://hdl.handle.net/10061/14620

Item type

学術雑誌論文 / Journal Article(1)

公開日

2022-02-10

タイトル

A clinical specific BERT developed using a huge Japanese clinical text corpus

言語

eng

資源タイプ

journal article

アクセス権

metadata only access

著者

Kawazoe, Yoshimasa
Shibata, Daisaku
Shinohara, Emiko
荒牧, 英治

WEKO 21
e-Rad 70401073

ja	荒牧, 英治
ja-Kana	アラマキ, エイジ
en	Aramaki, Eiji

Search repository

Ohe, Kazuhiko

抄録

内容記述タイプ

Abstract

内容記述

Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed.

書誌情報

en : PLoS ONE

巻 16, 号 11, 発行日 2021-11-09

出版者

Public Library of Science

EISSN/PISSN

収録物識別子タイプ

ISSN

収録物識別子

1932-6203

PubMed番号

Versions

Ver.1

2023-07-25 14:04:07.757584

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

A clinical specific BERT developed using a huge Japanese clinical text corpus

× Kawazoe, Yoshimasa

× Shibata, Daisaku

× Shinohara, Emiko

× 荒牧, 英治

× Ohe, Kazuhiko

Versions

Share

Cite as

エクスポート