Publication and Updates of Entry Data Files #

This section covers the content and specific examples of Entry Data Files, as well as methods for their publication and updates using GitHub.

Examples of Entry Data Files #

A list of the publicly available data files is provided in the Overview of Public Data. Here, we will explain the content of krm_main, which is the core Entry Data File, using it as an example.

Let’s consider as examples the three Entries that were presented as specific illustrations in the Entry Data Structure section: ‘加復’, ‘ー之’, ‘助’ (along with its variant characters (itaiji)), and ‘功’.

The data in TSV format is shown below. A “No.” column has been added on the far left for explanatory purposes.

No  entry_id	hanzi_id	kazama_location	tenri_location	volume_name	radical_name	volume_radical_index	hanzi_entry	original_entry	definition
1	F25133	S31605	K08084810	Tc090810	僧上	力	v8#83	功	〇	音工（L-R）「コウ（_N）」「クウ（_N）」　續也　事也　成也　タシカニ（LHLH）　𭃄歟
2	F25062	S31507	K08081810	Tc087810	僧上	力	v8#83	助	⿰目力	鉏據反　タスク（LL_）　マサル（HH_）　ハサム　和自ヨ（_L）
3	F25063	S31508	K08081821	Tc087821	僧上	力	v8#83	𦔳／助	■／〇	今正
4	F25121	S31590	K08084411	Tc090411	僧上	力	v8#83	加／復	〇／〇	シカノミナラス
5	F25122	S31592	K08084421	Tc090421	僧上	力	v8#83	ー（加）／之	〇／〇	同

The following is an explanation of the sample data shown above:

Entry 1, ‘Headword’ ‘功’, is an example of the Single Character Form. This particular Entry has complex content within its Original Glosses; its Phonetic Gloss includes multiple Tone marks (shōten), Sino-Japanese readings in kana (仮名字音, kana jion), and nasal sound symbols. (In the provided transcription format, details such as the distinction between circle and star types of Tone marks (shōten), or the use of red versus black ink to differentiate Tone marks (shōten) from Sino-Japanese readings in kana, are not represented.) The Semantic Gloss in Chinese ‘續也’ is a scribal error for ‘績也’. The Japanese Native Reading (wakun) ‘タシカニ’ (tashikani) is questionable as it does not correspond to the semantic meaning of the Headword ‘功’. Immediately following this wakun, there is a note ‘𭃄歟’ (setsu ka; “perhaps 𭃄?”), and indeed, the character ‘切’ has the wakun ‘タシカニ’ (tashikani). The character ‘功’ has a variant character (itaiji) ‘㓛’, which is graphically similar to ‘切’ and its variant character (itaiji) ‘𭃄’. ‘切’ is a graphically similar character to the Headword ‘功’, and this particular wakun is thought to have resulted from confusion between these two graphically similar characters. However, instead of altering the content of the Original Glosses, the note ‘𭃄歟’ (“perhaps 𭃄?”) was added by the compiler or scribe. Such notes appended with ‘歟’ (ka), like ‘𭃄歟’, are treated as ’editorial notes’ (ango, 案語) by the original compiler or a later scribe of the Myōgishō.

Entry 2, ‘Headword’ ‘助’ (whose original form is ⿰目力 in this Entry), is in Single Character Form and provides a Phonetic Gloss and Japanese Native Readings (wakun). The subsequent Entry 3, ‘Headword’ ‘𦔳／助’, is in Multi-Character Form and indicates variant characters (itaiji) through a Note on Character Form (“今正”). The ‘／’ (full-width slash) is a separator used for Multi-Character Form Headwords. The number of characters in such a Headword can be determined from the number of segments separated by these slashes (e.g., one slash indicates two characters).

Entry 4, ‘Headword’ ‘加／復’, presents a Japanese Native Reading (wakun) for a compound term. Entry 5, ‘Headword’ ‘ー（加）／之’, indicates that it shares the same wakun (‘同’). The ‘ー’ in ‘ー（加）／之’ is a symbol used to concisely represent ‘加’, the first character of the Headword in the preceding Entry (Entry 4, ‘加／復’).

The same content is shown below in JSON format. While this format enhances readability, it also increases the data volume, so loading the file may take some time even when using a high-performance editor like VS Code.

[
    {
        "entry_id": "F25133",
        "hanzi_id": "S31605",
        "kazama_location": "K08084810",
        "tenri_location": "Tc090810",
        "volume_name": "僧上",
        "radical_name": "力",
        "volume_radical_index": "v8#83",
        "hanzi_entry": "功",
        "original_entry": "〇",
        "definition": "音工（L-R）「コウ（_N）」「クウ（_N）」　續也　事也　成也　タシカニ（LHLH）　𭃄歟"
    },
    {
        "entry_id": "F25062",
        "hanzi_id": "S31507",
        "kazama_location": "K08081810",
        "tenri_location": "Tc087810",
        "volume_name": "僧上",
        "radical_name": "力",
        "volume_radical_index": "v8#83",
        "hanzi_entry": "助",
        "original_entry": "⿰目力",
        "definition": "鉏據反　タスク（LL_）　マサル（HH_）　ハサム　和自ヨ（_L）"
    },
    {
        "entry_id": "F25063",
        "hanzi_id": "S31508",
        "kazama_location": "K08081821",
        "tenri_location": "Tc087821",
        "volume_name": "僧上",
        "radical_name": "力",
        "volume_radical_index": "v8#83",
        "hanzi_entry": "𦔳／助",
        "original_entry": "■／〇",
        "definition": "今正"
    },
    {
        "entry_id": "F25121",
        "hanzi_id": "S31590",
        "kazama_location": "K08084411",
        "tenri_location": "Tc090411",
        "volume_name": "僧上",
        "radical_name": "力",
        "volume_radical_index": "v8#83",
        "hanzi_entry": "加／復",
        "original_entry": "〇／〇",
        "definition": "シカノミナラス"
    },
    {
        "entry_id": "F25122",
        "hanzi_id": "S31592",
        "kazama_location": "K08084421",
        "tenri_location": "Tc090421",
        "volume_name": "僧上",
        "radical_name": "力",
        "volume_radical_index": "v8#83",
        "hanzi_entry": "ー（加）／之",
        "original_entry": "〇／〇",
        "definition": "同"
    },
]

Description of Columns (Headers) in Entry Data Files #

Data for the Entries of the Myōgishō is stored in krm_main.tsv and krm_main.json. These files constitute the primary data.

For details on column names and their descriptions, please refer to the Overview of Public Data.

Publication and Updates via GitHub #

The Integrated Database of Hanzi Dictionaries in Early Japan (HDIC) has been publicly available via GitHub since October 2015. The repository can be accessed at https://github.com/shikeda.

A summary of the Chinese character dictionaries included in the HDIC and the initial publication dates of their full-text databases is as follows:

Sōhon Gyokuhen (宋本玉篇, Songben Yupian; abbr. SYP) – First published: October 20, 2015
Kōsan-ji manuscript Tenrei Banshō Meigi (篆隷万象名義; abbr. KTB) – First published: September 1, 2016
Tenji manuscript Shinsen Jikyō (新撰字鏡; abbr. TSJ) – First published: June 28, 2018
Kanchi-in manuscript Ruiju Myōgisho (類聚名義抄; abbr. KRM) – First published: March 11, 2022

An explanation of what GitHub is and the significance of publishing research data through this system can be summarized as follows:

GitHub is widely used as a platform for managing and publishing software source code. In recent years, however, it has also been utilized in various research fields, including the humanities, for sharing and publishing research data. GitHub is built upon a version control system called “Git,” which records the entire editing history and clearly preserves the an audit trail of changes. This makes it possible to track who made what changes and when, thereby enhancing the transparency and reproducibility of research data.

Furthermore, GitHub facilitates collaborative editing among multiple individuals. Features such as pull requests and issues allow for dialogue and peer-review-like interactions with other researchers to be recorded. Its appeal also lies in features like document creation using Markdown notation and the ability to view file-specific revision histories, making it relatively easy to use even for humanities researchers who do not write programs.

The significance of placing data on GitHub extends beyond mere storage. It enables “open science” practices, where research progress is incrementally published, and improvements are made based on external feedback. It is particularly well-suited for structured humanities data, such as data on Sino-Japanese character readings (like Kan-on and Go-on based on historical sources), transcribed texts, and lexicographical information, with many existing examples of its use.

Moreover, data on GitHub can be linked with Zenodo, a research data repository, to assign a formal DOI (Digital Object Identifier) and publish it in an academically stable manner. For instance, the “Database of Historical Sino-Japanese Readings” (DHSJR) utilizes GitHub for data construction, and this data is then registered with Zenodo and published with a DOI, making it internationally citable and reusable. In this way, GitHub is playing a significant role as a foundation for the long-term sharing and utilization of digital resources in the humanities.