# Lexid Morpheme Dataset (v1)

A cleaned, attested, plagiarism-free **lexicography** dataset of English
morphemes — affixes, combining forms, and roots — with senses, definitions,
example words, etymology, part-of-speech behavior, and attestation.

**13659 rows.** Files:
- `lexid-morphemes-v1.csv`
- `lexid-morphemes-v1.jsonl` (one JSON object per line)

## License
**CC BY-SA 4.0.** See `LICENSE` and `LICENSES/`. If you redistribute or adapt
this data, keep the attribution, link the license, note your changes, and
release adaptations under CC BY-SA 4.0.

## What's in it
This is the lexicography layer only. Game-design / internal columns from the
source project are deliberately excluded. Included columns:

`morpheme_key`, `morpheme`, `morpheme_head`, `affix_form`, `allomorphs`, `sense_label`,
`affix_type`, `morpheme_subtype`, `bound_free_or_root`,
`derivational_or_inflectional`, `parts_of_speech_this_morpheme_can_join_onto`,
`part_of_speech_created_by_this_suffix`, `meaning_definition`,
`example_words_list_1`, `example_words_list_2`, `example_words_corpus`,
`related_morphemes`, `morpheme_counterpart_opposite`, `opposing_meaning`,
`categorical_group_list`, `ancestor_etymon`, `etymology`,
`ancestor_etymon_language`, `etymological_family`, `register`, `pronunciation`,
`word_list_frequency`, `morpheme_length`, `source`, `attestation`.

## Notes on the data
- **Scope:** this catalogs **affixes, combining forms, and roots** — far more
  than an affix-only dictionary. The suffix layer (~750 distinct) matches the
  standard scholarly affix count; the volume beyond that is roots/combining
  forms and productive free morphemes.
- **Curation (what was excluded):** standalone complete words that do not
  productively combine ("lexemes") were filtered out — this is a *morpheme*
  dataset, not a word list. Exact-duplicate rows were collapsed.
- **`morpheme_head`:** a grouping key. For most rows it equals the morpheme's own
  spelling; for closed-class allomorph families it is the shared head, so e.g.
  `il-/im-/in-/ir-` ("not") all carry `morpheme_head = in-(not)`, while
  homophones (privative `a-` vs directional `a-`) remain distinct. Group by
  `morpheme_head` for morpheme-level analysis, or by `morpheme` for form-level.
- **Disambiguation:** a `morpheme` may have multiple senses; rows are
  disambiguated by `sense_label` (and the unique `morpheme_key`).
- **Attestation** (`attestation`): `both` = in MorphoLex ∪ Wiktionary,
  `wiktionary`, `wikidata`, `morpholex`. Every published row is attested in at
  least one open source; unverified candidates are excluded from v1.
- **Definitions** were rewritten into original wording where they derived from
  copyrighted or unknown-copyright references; facts (spellings, frequencies,
  attestation) are preserved as-is.
- **Example words** in `example_words_corpus` were regenerated from an open
  Wiktionary word corpus (real attested words).

## Attribution
Includes material from **Wiktionary** and **Wikipedia** (CC BY-SA 4.0; modified),
**Kaikki.org / wiktextract**, **Open English WordNet** (CC BY 4.0, incl. the
Princeton WordNet notice), and **Wikidata Lexemes** (CC0). Compiled by Logan Park
for the Lexid project. Full source accounting: `SOURCES_AND_LICENSING.md`.
