Under preparation.
Headwords, Entry Structure, and ID System #
This section explains the structure of Entries
in the Myōgishō and the ID system used to identify them.
Definitions of Headwords and Entries #
A Headword
refers to the Chinese character
or string of Chinese characters
that forms the heading.
An Entry
is the basic unit of the dictionary, typically consisting of a Headword
and its associated Original Glosses
.
In the Myōgishō, the Headword
of an Entry
consists of either a single Chinese character
or multiple Chinese characters
(a multi-character compound).
Entries
are presented in two forms based on their Headwords
:
Single Character Form
: When theHeadword
is a singleChinese character
.Multi-Character Form
: When theHeadword
consists of multipleChinese characters
.
Multi-Character Form
Entries
include those for idioms (or compounds) and those that co-list variant characters
(itaiji).
Below are some examples of these two forms:
Examples:
Single Character Form
: 人, 何Multi-Character Form
(idioms/compounds): 一人, 二人, 何如, 如何Multi-Character Form
(co-listingvariant characters
(itaiji)): 爲為, 羱𦍘
The distinction between Multi-Character Form
Entries
for idioms/compounds and those co-listing variant characters
(itaiji) is determined by the content of the Original Glosses
.
For example, in ‘爲為’, which co-lists variant characters
(itaiji), the Original Glosses
contain the Form Classification Tag
‘正今’ (sei kin). This indicates that ‘爲’ is the ‘standard’ (sei) form and ‘為’ is the ‘current/modern’ (kin) form, thus establishing their relationship as variants.
Similarly, for ‘羱𦍘’, the Original Glosses
contain the Form Classification Tag
‘正俗’ (sei zoku), indicating that ‘羱’ is the ‘standard’ (sei) form and ‘𦍘’ is the ‘popular/vulgar’ (zoku) form, confirming their variant relationship.
Principles and Exceptions in the Arrangement of Entries #
A single page in the Myōgishō is typically composed of 8 lines (rows) and 4 segments (columns per line). In other words, each page has a layout of 8 vertical lines and 4 horizontal segments, totaling 32 cells (or blocks) for entries.
The overwhelming majority of Entries
are “single-segment, single-entry” (一段一項目, ichidan ikkōmoku), meaning one Entry
is written within a single cell.
When an explanation (i.e., the Original Glosses
) is lengthy, it may occupy two or more cells, extending over multiple segments or multiple lines. Such “multi-segment, single-entry” (多段一項目, tadan ikkōmoku) instances also appear frequently. However, even with lengthy explanations, an Entry
never extends beyond a single page.
Thus, the general principle is that a single cell contains one Entry
or less.
As an exception, two Entries
may occasionally be found within a single segment. These “single-segment, multi-entry” (一段多項目, ichidan takōmoku) cases account for less than 1% of the total.
Bearing these principles and exceptions in mind, we will now explain the ID system for the Myōgishō.
The complexity of the Myōgishō’s ID system stems largely from this occasional practice of recording two Entries
within a single segment, a point to which users of this data should pay particular attention.
ID System of the Myōgishō #
Types of IDs #
In the Myōgishō data, Entries
and individual character positions within Headwords
are primarily managed by the following columns:
Entry
ID (entry_id
- e.g., F00001) - F-formatHeadword
Character ID (hanzi_id
- e.g., S00001) - S-format- Kazama Edition Location (
kazama_location
- e.g., K0100131) - K-format - Tenri Edition Location (
tenri_location
- e.g., Ta023310) - T-format
Relationship Between Columns #
The Entry
ID (entry_id
) identifies one or more Headword
Character IDs (hanzi_id
) that constitute the Headword
of that Entry
.
The Headword
Character ID (hanzi_id
) is the primary key that uniquely identifies each individual character position within the data structure (particularly in files like krm_headword_chars
).
Each Headword
Character ID (hanzi_id
) is linked to corresponding location IDs (for the Kazama and Tenri editions).
The establishment of these four types of IDs accommodates the diversity of Entries
in the Myōgishō and facilitates the use of multiple facsimile editions. For this reason, explanations of the F-format, S-format, K-format, and T-format IDs are provided repeatedly as needed, even if it involves some redundancy.
Detailed Format of Primary IDs #
Entry
ID (entry_id
/ F-format)
#
Format:
The Entry
ID (entry_id
) is a 5-digit number prefixed with ‘F’, forming a sequential series from F00001 to F32604.
For some additionally inserted Entries
, a ‘b’ suffix is appended to the numeric part of the ID.
It should be noted that while a ‘b’ suffix may also be appended to the Headword
Character ID (hanzi_id
described below), the ‘b’ suffix for an entry_id
is assigned independently of any ‘b’ suffix on a hanzi_id
.
Purpose:
To uniquely identify each Entry
in the Myōgishō.
Headword
Character ID (hanzi_id
/ S-format)
#
Format:
The Headword
Character ID (hanzi_id
) is a 5-digit number prefixed with ‘S’, forming a sequential series from S00001 to S42328. This is the S-format.
For some additionally inserted Headword
characters, a ‘b’ suffix is appended to the numeric part of the ID.
It should be noted that while a ‘b’ suffix may also be appended to the Entry
ID (entry_id
mentioned above), the ‘b’ suffix for a hanzi_id
is assigned independently of any ‘b’ suffix on an entry_id
.
Purpose:
To serve as the primary key that uniquely identifies each individual Headword
character (i.e., each character position) within the dataset.
Supplement:
A separate data file lists all Headword
Character IDs (hanzi_id
), including those for the second and subsequent characters in multi-character Headwords
. This file is krm_headword_chars
, the details of which are described elsewhere.
Kazama Edition Location (kazama_location
/ K-format)
#
Format: K + Volume (2 digits) + Kazama Edition Page number (3 digits) + Line number (1 digit) + Segment number (1 digit) + Character Order (1 digit). This is the K-format.
Character Order (1 digit):
This digit is a number assigned based on the type of Entry
(single-character or multi-character) and its order of appearance within that segment.
While the “Character Order” is fundamentally a criterion based on individual character positions, it is used here to indicate the location of an Entry
by referencing the position of the first character of that Entry
within the segment.
The use of a character-position-based criterion for this “Character Order” digit is specifically to address the exceptional cases where two or more Entries
are recorded within a single segment (which can also be thought of as a single cell or block in the layout).
To apply this character-position-based criterion representatively for the “Character Order,” the following rules for determining this digit have been established:
- Case 1: If there is only one
Entry
in the segment- If the
Entry
is asingle-character Entry
, the Character Order is 0. - If the
Entry
is amulti-character Entry
, the Character Order is 1.
- If the
- Case 2: If there are two or more
Entries
in the segment- For the first
Entry
in the segment, the Character Order is 1. - For the second or subsequent
Entries
in the segment, the Character Order indicates the position of the first character of thatEntry
, counted sequentially from the beginning of the segment (where the first character position in the segment is counted as 1).
- For the first
For example, consider a segment containing ‘A’ and then ‘BC’, where ‘A’ is the 1st character in the segment, ‘B’ is the 2nd, and ‘C’ is the 3rd. The Entry
‘BC’ is the second Entry
in this segment. Its first character is ‘B’, which is the 2nd character from the beginning of the segment. Therefore, the Character Order for the Entry
‘BC’ is 2.
Alternatively, consider a segment containing ‘AB’ and then ‘CD’, where ‘A’ is the 1st character, ‘B’ the 2nd, ‘C’ the 3rd, and ‘D’ the 4th. The Entry
‘CD’ is the second Entry
in this segment. Its first character is ‘C’, which is the 3rd character from the beginning of the segment. Therefore, the Character Order for the Entry
‘CD’ is 3.
Examples of kazama_location
IDs:
K01001310
: (Indicates asingle-character Entry
, oneEntry
per segment) Volume 1, Page 1, Line 3, Segment 1, Character Order 0.K08084411
: (Indicates amulti-character Entry
, oneEntry
per segment) Volume 8, Page 84, Line 4, Segment 1, Character Order 1.K01004241
: (Indicates the firstEntry
when multipleEntries
are in the segment) Volume 1, Page 4, Line 2, Segment 4, Character Order 1.K01004242
: (Indicates anEntry
starting from the 2nd character position within the segment, when multipleEntries
are in the segment) Volume 1, Page 4, Line 2, Segment 4, Character Order 2.K01008341
: (Indicates the firstEntry
when multipleEntries
are in the segment) Volume 1, Page 8, Line 3, Segment 4, Character Order 1.K01008343
: (Indicates anEntry
starting from the 3rd character position within the segment, when multipleEntries
are in the segment) Volume 1, Page 8, Line 3, Segment 4, Character Order 3.
Purpose:
To indicate the location of an Entry
in the Kazama Edition. This ID system is determined based on rules for indicating character position, designed to accommodate all various arrangement patterns of Entries
: the primarily used “single-segment, single-entry” arrangement; the frequently occurring “multi-segment, single-entry” arrangement; and the rare “single-segment, multi-entry” arrangement.
Source: Based on Ruiju Myōgishō, Daiikkan (類聚名義抄 第一巻, Ruiju Myōgishō, Vol. 1), edited by Masamune Atsuo (Tokyo: Kazama Shobō, 1954).
Tenri Edition Location (tenri_location
/ T-format)
#
The Tenri Edition Location (tenri_location
) follows principles similar to those for determining the K-format of the Kazama Edition Location. Its format, Character Order, and purpose are defined as follows:
Format: T + Volume (a/b/c) + Page number (3 digits) + Line number (1 digit) + Segment number (1 digit) + Character Order (1 digit). This is the T-format.
Character Order (1 digit):
This digit is a number assigned based on the type of Entry
(single-character or multi-character) and its order of appearance within that segment.
While the “Character Order” is fundamentally a criterion based on individual character positions, it is used here to indicate the location of an Entry
by referencing the position of the first character of that Entry
within the segment.
The use of a character-position-based criterion for this “Character Order” digit is specifically to address the exceptional cases where two or more Entries
are recorded within a single segment (which can also be thought of as a single cell or block in the layout).
To apply this character-position-based criterion representatively for the “Character Order,” the following rules for determining this digit have been established:
- Case 1: If there is only one
Entry
in the segment- If the
Entry
is asingle-character Entry
, the Character Order is 0. - If the
Entry
is amulti-character Entry
, the Character Order is 1.
- If the
- Case 2: If there are two or more
Entries
in the segment- For the first
Entry
in the segment, the Character Order is 1. - For the second or subsequent
Entries
in the segment, the Character Order indicates the position of the first character of thatEntry
, counted sequentially from the beginning of the segment (where the first character position in the segment is counted as 1).
- For the first
The method for indicating the Tenri Edition Location is based on the same principles as that for the Kazama Edition Location. The examples used to explain the Kazama Edition Location, if shown as Tenri Edition Locations, would be as follows (using hypothetical Tenri IDs for illustration, actual examples below):
Examples of tenri_location
IDs:
Ta023310
: (Indicates asingle-character Entry
, oneEntry
per segment) Upper Volume (上巻), Page 23, Line 3, Segment 1, Character Order 0.Tc090411
: (Indicates amulti-character Entry
, oneEntry
per segment) Lower Volume (下巻), Page 90, Line 4, Segment 1, Character Order 1.Ta026241
: (Indicates the firstEntry
when multipleEntries
are in the segment) Upper Volume (上巻), Page 26, Line 2, Segment 4, Character Order 1.Ta026242
: (Indicates anEntry
starting from the 2nd character position within the segment, when multipleEntries
are in the segment) Upper Volume (上巻), Page 26, Line 2, Segment 4, Character Order 2.Ta030341
: (Indicates the firstEntry
when multipleEntries
are in the segment) Upper Volume (上巻), Page 30, Line 3, Segment 4, Character Order 1.Ta030343
: (Indicates anEntry
starting from the 3rd character position within the segment, when multipleEntries
are in the segment) Upper Volume (上巻), Page 30, Line 3, Segment 4, Character Order 3.
Purpose:
To indicate the location of an Entry
in the Tenri Edition. This ID system is determined based on rules for indicating character position, designed to accommodate all various arrangement patterns of Entries
: the primarily used “single-segment, single-entry” arrangement; the frequently occurring “multi-segment, single-entry” arrangement; and the rare “single-segment, multi-entry” arrangement.
Source: Based on Ruiju Myōgishō: Butsu, Hō, Sō (類聚名義抄 仏・法・僧; Tenri Toshokan Zenpon Sōsho, Washo no Bu, vols. 32-34; Tenri Daigaku Shuppanbu, distributed by Yagi Shoten).
Input Method for Headwords
#
Headwords
are input into the hanzi_entry
column.
Entries
have Headwords
that can be in either Single Character Form
or Multi-Character Form
. As the Single Character Form
generally does not pose particular input issues, this section will focus on the input method for Headwords
in Multi-Character Form
.
For Multi-Character Form
Headwords
, whether they represent a co-listing of variant characters
(itaiji) or an idiom/compound, the constituent characters of the Headword
are input separated by a ‘/’ (full-width slash, U+FF0F).
If a Headword
in an Entry
contains ‘/’, this indicates that the Headword
is in Multi-Character Form
. The number of characters in such a Headword
can be determined by the number of segments separated by the ‘/’ (e.g., one slash results in two segments, indicating two characters).
Examples:
- Co-listing of
variant characters
(itaiji): 翛/倐/倏/翛β - Idiom/compound: 一/人
For details on how to input characters using Unicode, please refer to the Character Encoding and Representation section.
Handling of IDs in Data Representation #
The main entry data for the Myōgishō (e.g., in krm_main.tsv
) is published in TSV format.
This subsection explains how each ID is represented in the TSV files, with a particular focus on the representation rules for Multi-Character Form
Entries
.
While some explanations may overlap with those in other sections (such as “Detailed Format of Primary IDs,” “Input Method for Headwords
,” and the section on krm_headword_chars
which details individual headword characters and acts as a mapping table), this information is crucial for processing the TSV-formatted data and is therefore summarized again here.
TSV Columns and Corresponding IDs:
The primary TSV files store the following IDs in their respective columns:
entry_id
:Entry
ID (F-format)hanzi_id
:Headword
Character ID (S-format)kazama_location
: Kazama Edition Location (K-format)tenri_location
: Tenri Edition Location (T-format)
IDs other than those listed above (e.g., the Headword
Character IDs or location IDs for the second and subsequent characters within a multi-character Headword
) are not directly stored in this primary file but can be referenced through a separate mapping table (krm_headword_chars.tsv
).
Data Representation Rules for Multi-Character Form
Entries
Multi-Character Form
Headwords
(such as those co-listingvariant characters
(itaiji) or representing idioms/compounds) are stored as a string in thehanzi_entry
column, with constituent characters separated by a full-width slash (’/’).- Regarding the representation of IDs, the main TSV row corresponding to a
Multi-Character Form
Entry
displays only the IDs related to the first character of thatHeadword
. - IDs (S-format, K-format, T-format) related to the second and subsequent characters that constitute the
Headword
are omitted from this main row.
Example:
Suppose there is an Entry
with the Headword
“AB” (composed of A + B), and their respective IDs are as follows:
Entry
ID: F25121Headword
Character ID for A: S31590 (Kazama Edition Location: K08084411, Tenri Edition Location: Tc090411)Headword
Character ID for B: S31591 (Kazama Edition Location: K08084412, Tenri Edition Location: Tc090412)
When this Entry
is represented in the TSV file, the main row would appear as follows (showing relevant columns only):
entry_id | hanzi_id | hanzi_entry | kazama_location | tenri_location |
---|---|---|---|---|
F25121 | S31590 | AB | K08084411 | Tc090411 |
It can be seen that this row contains only the Entry
ID, the Headword
Character ID for the first character, the complete Headword
string, and the location IDs for the first character.
Deferring Detailed Headword Character Information to a Mapping Table #
All of the following detailed information pertaining to Headword
characters is deferred to the mapping table krm_headword_chars
:
- The complete list of more detailed location information (K-format, T-format) corresponding to each
Headword
Character ID (S-format). - The cropped image file name corresponding to each individual
Headword
character (S-format).
Users who require this information will need to consult krm_headword_chars.tsv
.