日本語
Headwords, Entry Structure, and ID System

Under preparation.

Headwords, Entry Structure, and ID System #

This section explains the structure of Entries in the Myōgishō and the ID system used to identify them.

Definitions of Headwords and Entries #

A Headword refers to the Chinese character or string of Chinese characters that forms the heading. An Entry is the basic unit of the dictionary, typically consisting of a Headword and its associated Original Glosses.

In the Myōgishō, the Headword of an Entry consists of either a single Chinese character or multiple Chinese characters (a multi-character compound).

Entries are presented in two forms based on their Headwords:

  • Single Character Form: When the Headword is a single Chinese character.
  • Multi-Character Form: When the Headword consists of multiple Chinese characters.

Multi-Character Form Entries include those for idioms (or compounds) and those that co-list variant characters (itaiji).

Below are some examples of these two forms:

Examples:

  • Single Character Form: 人, 何
  • Multi-Character Form (idioms/compounds): 一人, 二人, 何如, 如何
  • Multi-Character Form (co-listing variant characters (itaiji)): 爲為, 羱𦍘

The distinction between Multi-Character Form Entries for idioms/compounds and those co-listing variant characters (itaiji) is determined by the content of the Original Glosses. For example, in ‘爲為’, which co-lists variant characters (itaiji), the Original Glosses contain the Form Classification Tag ‘正今’ (sei kin). This indicates that ‘爲’ is the ‘standard’ (sei) form and ‘為’ is the ‘current/modern’ (kin) form, thus establishing their relationship as variants. Similarly, for ‘羱𦍘’, the Original Glosses contain the Form Classification Tag ‘正俗’ (sei zoku), indicating that ‘羱’ is the ‘standard’ (sei) form and ‘𦍘’ is the ‘popular/vulgar’ (zoku) form, confirming their variant relationship.

Principles and Exceptions in the Arrangement of Entries #

A single page in the Myōgishō is typically composed of 8 lines (rows) and 4 segments (columns per line). In other words, each page has a layout of 8 vertical lines and 4 horizontal segments, totaling 32 cells (or blocks) for entries.

The overwhelming majority of Entries are “single-segment, single-entry” (一段一項目, ichidan ikkōmoku), meaning one Entry is written within a single cell.

When an explanation (i.e., the Original Glosses) is lengthy, it may occupy two or more cells, extending over multiple segments or multiple lines. Such “multi-segment, single-entry” (多段一項目, tadan ikkōmoku) instances also appear frequently. However, even with lengthy explanations, an Entry never extends beyond a single page. Thus, the general principle is that a single cell contains one Entry or less.

As an exception, two Entries may occasionally be found within a single segment. These “single-segment, multi-entry” (一段多項目, ichidan takōmoku) cases account for less than 1% of the total.

Bearing these principles and exceptions in mind, we will now explain the ID system for the Myōgishō. The complexity of the Myōgishō’s ID system stems largely from this occasional practice of recording two Entries within a single segment, a point to which users of this data should pay particular attention.

ID System of the Myōgishō #

Types of IDs #

In the Myōgishō data, Entries and individual character positions within Headwords are primarily managed by the following columns:

  • Entry ID (entry_id - e.g., F00001) - F-format
  • Headword Character ID (hanzi_id - e.g., S00001) - S-format
  • Kazama Edition Location (kazama_location - e.g., K0100131) - K-format
  • Tenri Edition Location (tenri_location - e.g., Ta023310) - T-format

Relationship Between Columns #

The Entry ID (entry_id) identifies one or more Headword Character IDs (hanzi_id) that constitute the Headword of that Entry.

The Headword Character ID (hanzi_id) is the primary key that uniquely identifies each individual character position within the data structure (particularly in files like krm_headword_chars).

Each Headword Character ID (hanzi_id) is linked to corresponding location IDs (for the Kazama and Tenri editions).

The establishment of these four types of IDs accommodates the diversity of Entries in the Myōgishō and facilitates the use of multiple facsimile editions. For this reason, explanations of the F-format, S-format, K-format, and T-format IDs are provided repeatedly as needed, even if it involves some redundancy.

Detailed Format of Primary IDs #

Entry ID (entry_id / F-format) #

Format: The Entry ID (entry_id) is a 5-digit number prefixed with ‘F’, forming a sequential series from F00001 to F32604. For some additionally inserted Entries, a ‘b’ suffix is appended to the numeric part of the ID. It should be noted that while a ‘b’ suffix may also be appended to the Headword Character ID (hanzi_id described below), the ‘b’ suffix for an entry_id is assigned independently of any ‘b’ suffix on a hanzi_id.

Purpose: To uniquely identify each Entry in the Myōgishō.

Headword Character ID (hanzi_id / S-format) #

Format: The Headword Character ID (hanzi_id) is a 5-digit number prefixed with ‘S’, forming a sequential series from S00001 to S42328. This is the S-format. For some additionally inserted Headword characters, a ‘b’ suffix is appended to the numeric part of the ID. It should be noted that while a ‘b’ suffix may also be appended to the Entry ID (entry_id mentioned above), the ‘b’ suffix for a hanzi_id is assigned independently of any ‘b’ suffix on an entry_id.

Purpose: To serve as the primary key that uniquely identifies each individual Headword character (i.e., each character position) within the dataset.

Supplement: A separate data file lists all Headword Character IDs (hanzi_id), including those for the second and subsequent characters in multi-character Headwords. This file is krm_headword_chars, the details of which are described elsewhere.

Kazama Edition Location (kazama_location / K-format) #

Format: K + Volume (2 digits) + Kazama Edition Page number (3 digits) + Line number (1 digit) + Segment number (1 digit) + Character Order (1 digit). This is the K-format.

Character Order (1 digit): This digit is a number assigned based on the type of Entry (single-character or multi-character) and its order of appearance within that segment. While the “Character Order” is fundamentally a criterion based on individual character positions, it is used here to indicate the location of an Entry by referencing the position of the first character of that Entry within the segment.

The use of a character-position-based criterion for this “Character Order” digit is specifically to address the exceptional cases where two or more Entries are recorded within a single segment (which can also be thought of as a single cell or block in the layout).

To apply this character-position-based criterion representatively for the “Character Order,” the following rules for determining this digit have been established:

  • Case 1: If there is only one Entry in the segment
    • If the Entry is a single-character Entry, the Character Order is 0.
    • If the Entry is a multi-character Entry, the Character Order is 1.
  • Case 2: If there are two or more Entries in the segment
    • For the first Entry in the segment, the Character Order is 1.
    • For the second or subsequent Entries in the segment, the Character Order indicates the position of the first character of that Entry, counted sequentially from the beginning of the segment (where the first character position in the segment is counted as 1).

For example, consider a segment containing ‘A’ and then ‘BC’, where ‘A’ is the 1st character in the segment, ‘B’ is the 2nd, and ‘C’ is the 3rd. The Entry ‘BC’ is the second Entry in this segment. Its first character is ‘B’, which is the 2nd character from the beginning of the segment. Therefore, the Character Order for the Entry ‘BC’ is 2. Alternatively, consider a segment containing ‘AB’ and then ‘CD’, where ‘A’ is the 1st character, ‘B’ the 2nd, ‘C’ the 3rd, and ‘D’ the 4th. The Entry ‘CD’ is the second Entry in this segment. Its first character is ‘C’, which is the 3rd character from the beginning of the segment. Therefore, the Character Order for the Entry ‘CD’ is 3.

Examples of kazama_location IDs:

  • K01001310: (Indicates a single-character Entry, one Entry per segment) Volume 1, Page 1, Line 3, Segment 1, Character Order 0.
  • K08084411: (Indicates a multi-character Entry, one Entry per segment) Volume 8, Page 84, Line 4, Segment 1, Character Order 1.
  • K01004241: (Indicates the first Entry when multiple Entries are in the segment) Volume 1, Page 4, Line 2, Segment 4, Character Order 1.
  • K01004242: (Indicates an Entry starting from the 2nd character position within the segment, when multiple Entries are in the segment) Volume 1, Page 4, Line 2, Segment 4, Character Order 2.
  • K01008341: (Indicates the first Entry when multiple Entries are in the segment) Volume 1, Page 8, Line 3, Segment 4, Character Order 1.
  • K01008343: (Indicates an Entry starting from the 3rd character position within the segment, when multiple Entries are in the segment) Volume 1, Page 8, Line 3, Segment 4, Character Order 3.

Purpose: To indicate the location of an Entry in the Kazama Edition. This ID system is determined based on rules for indicating character position, designed to accommodate all various arrangement patterns of Entries: the primarily used “single-segment, single-entry” arrangement; the frequently occurring “multi-segment, single-entry” arrangement; and the rare “single-segment, multi-entry” arrangement.

Source: Based on Ruiju Myōgishō, Daiikkan (類聚名義抄 第一巻, Ruiju Myōgishō, Vol. 1), edited by Masamune Atsuo (Tokyo: Kazama Shobō, 1954).

Tenri Edition Location (tenri_location / T-format) #

The Tenri Edition Location (tenri_location) follows principles similar to those for determining the K-format of the Kazama Edition Location. Its format, Character Order, and purpose are defined as follows:

Format: T + Volume (a/b/c) + Page number (3 digits) + Line number (1 digit) + Segment number (1 digit) + Character Order (1 digit). This is the T-format.

Character Order (1 digit): This digit is a number assigned based on the type of Entry (single-character or multi-character) and its order of appearance within that segment. While the “Character Order” is fundamentally a criterion based on individual character positions, it is used here to indicate the location of an Entry by referencing the position of the first character of that Entry within the segment.

The use of a character-position-based criterion for this “Character Order” digit is specifically to address the exceptional cases where two or more Entries are recorded within a single segment (which can also be thought of as a single cell or block in the layout).

To apply this character-position-based criterion representatively for the “Character Order,” the following rules for determining this digit have been established:

  • Case 1: If there is only one Entry in the segment
    • If the Entry is a single-character Entry, the Character Order is 0.
    • If the Entry is a multi-character Entry, the Character Order is 1.
  • Case 2: If there are two or more Entries in the segment
    • For the first Entry in the segment, the Character Order is 1.
    • For the second or subsequent Entries in the segment, the Character Order indicates the position of the first character of that Entry, counted sequentially from the beginning of the segment (where the first character position in the segment is counted as 1).

The method for indicating the Tenri Edition Location is based on the same principles as that for the Kazama Edition Location. The examples used to explain the Kazama Edition Location, if shown as Tenri Edition Locations, would be as follows (using hypothetical Tenri IDs for illustration, actual examples below):

Examples of tenri_location IDs:

  • Ta023310: (Indicates a single-character Entry, one Entry per segment) Upper Volume (上巻), Page 23, Line 3, Segment 1, Character Order 0.
  • Tc090411: (Indicates a multi-character Entry, one Entry per segment) Lower Volume (下巻), Page 90, Line 4, Segment 1, Character Order 1.
  • Ta026241: (Indicates the first Entry when multiple Entries are in the segment) Upper Volume (上巻), Page 26, Line 2, Segment 4, Character Order 1.
  • Ta026242: (Indicates an Entry starting from the 2nd character position within the segment, when multiple Entries are in the segment) Upper Volume (上巻), Page 26, Line 2, Segment 4, Character Order 2.
  • Ta030341: (Indicates the first Entry when multiple Entries are in the segment) Upper Volume (上巻), Page 30, Line 3, Segment 4, Character Order 1.
  • Ta030343: (Indicates an Entry starting from the 3rd character position within the segment, when multiple Entries are in the segment) Upper Volume (上巻), Page 30, Line 3, Segment 4, Character Order 3.

Purpose: To indicate the location of an Entry in the Tenri Edition. This ID system is determined based on rules for indicating character position, designed to accommodate all various arrangement patterns of Entries: the primarily used “single-segment, single-entry” arrangement; the frequently occurring “multi-segment, single-entry” arrangement; and the rare “single-segment, multi-entry” arrangement.

Source: Based on Ruiju Myōgishō: Butsu, Hō, Sō (類聚名義抄 仏・法・僧; Tenri Toshokan Zenpon Sōsho, Washo no Bu, vols. 32-34; Tenri Daigaku Shuppanbu, distributed by Yagi Shoten).

Input Method for Headwords #

Headwords are input into the hanzi_entry column.

Entries have Headwords that can be in either Single Character Form or Multi-Character Form. As the Single Character Form generally does not pose particular input issues, this section will focus on the input method for Headwords in Multi-Character Form.

For Multi-Character Form Headwords, whether they represent a co-listing of variant characters (itaiji) or an idiom/compound, the constituent characters of the Headword are input separated by a ‘/’ (full-width slash, U+FF0F).

If a Headword in an Entry contains ‘/’, this indicates that the Headword is in Multi-Character Form. The number of characters in such a Headword can be determined by the number of segments separated by the ‘/’ (e.g., one slash results in two segments, indicating two characters).

Examples:

  • Co-listing of variant characters (itaiji): 翛/倐/倏/翛β
  • Idiom/compound: 一/人

For details on how to input characters using Unicode, please refer to the Character Encoding and Representation section.

Handling of IDs in Data Representation #

The main entry data for the Myōgishō (e.g., in krm_main.tsv) is published in TSV format. This subsection explains how each ID is represented in the TSV files, with a particular focus on the representation rules for Multi-Character Form Entries. While some explanations may overlap with those in other sections (such as “Detailed Format of Primary IDs,” “Input Method for Headwords,” and the section on krm_headword_chars which details individual headword characters and acts as a mapping table), this information is crucial for processing the TSV-formatted data and is therefore summarized again here.

TSV Columns and Corresponding IDs:

The primary TSV files store the following IDs in their respective columns:

  • entry_id: Entry ID (F-format)
  • hanzi_id: Headword Character ID (S-format)
  • kazama_location: Kazama Edition Location (K-format)
  • tenri_location: Tenri Edition Location (T-format)

IDs other than those listed above (e.g., the Headword Character IDs or location IDs for the second and subsequent characters within a multi-character Headword) are not directly stored in this primary file but can be referenced through a separate mapping table (krm_headword_chars.tsv).

Data Representation Rules for Multi-Character Form Entries

  • Multi-Character Form Headwords (such as those co-listing variant characters (itaiji) or representing idioms/compounds) are stored as a string in the hanzi_entry column, with constituent characters separated by a full-width slash (’/’).
  • Regarding the representation of IDs, the main TSV row corresponding to a Multi-Character Form Entry displays only the IDs related to the first character of that Headword.
  • IDs (S-format, K-format, T-format) related to the second and subsequent characters that constitute the Headword are omitted from this main row.

Example: Suppose there is an Entry with the Headword “AB” (composed of A + B), and their respective IDs are as follows:

  • Entry ID: F25121
  • Headword Character ID for A: S31590 (Kazama Edition Location: K08084411, Tenri Edition Location: Tc090411)
  • Headword Character ID for B: S31591 (Kazama Edition Location: K08084412, Tenri Edition Location: Tc090412)

When this Entry is represented in the TSV file, the main row would appear as follows (showing relevant columns only):

entry_id hanzi_id hanzi_entry kazama_location tenri_location
F25121 S31590 AB K08084411 Tc090411

It can be seen that this row contains only the Entry ID, the Headword Character ID for the first character, the complete Headword string, and the location IDs for the first character.

Deferring Detailed Headword Character Information to a Mapping Table #

All of the following detailed information pertaining to Headword characters is deferred to the mapping table krm_headword_chars:

  • The complete list of more detailed location information (K-format, T-format) corresponding to each Headword Character ID (S-format).
  • The cropped image file name corresponding to each individual Headword character (S-format).

Users who require this information will need to consult krm_headword_chars.tsv.