Contributing Data

Methexis rewards people who contribute useful, lawful, and high‑quality datasets for training open‑source AI models. Data is screened by an automated pipeline and stake‑weighted committees; accepted d

Principles: consent & compliance → quality → transparency. The chain records hashes, status, rewards; the data itself is stored off‑chain (IPFS/Arweave).


What kinds of data?

Supported modalities (initially):

  • Text (natural language, documentation, code snippets)

  • Code (repos, problems/solutions, test cases)

  • Images (with licenses & model-suitable usage rights)

  • Audio (transcripts, speech, effects; with consent)

  • Synthetic data (must disclose provenance and generator)

Start narrow if needed (e.g., text + code) and expand by governance vote.


What not to upload

  • Personal data without consent (names, emails, phone numbers, addresses, medical or financial records)

  • Content restricted by terms of service or non‑redistributable licenses

  • Copyrighted material without an explicit license permitting ML training & redistribution

  • CSAM, violent/graphic abuse, hate speech, or harassment datasets

  • Hidden malware, tracking pixels, or executable payloads

  • Mass‑duplicated web scrapes with little originality (fails dedup/novelty checks)

Violations may lead to slashing of the submitter’s bond and blacklisting of associated keys.


Choose an allowed license when you submit. Examples:

  • CC‑BY 4.0, CC‑BY‑SA 4.0, CC0, MIT/Apache‑2.0 (for code)

  • OpenRAIL (where terms allow training & redistribution)

Your manifest must include:

  • the license identifier (SPDX or link),

  • the rights statement (you own it / have permission),

  • any attribution requirements,

  • and a consent declaration if people are identifiable (or an explicit statement that PII has been removed/anonymized).


Data hygiene & PII

Before uploading:

  • Remove or mask PII via automated NER + regex (emails, SSNs, phone numbers).

  • Strip EXIF from images; blur faces if needed.

  • Normalize encodings (UTF‑8), line endings, file paths.

  • Provide dedup‑resistant chunking (e.g., minhash/simhash) so novelty can be scored.

  • Include a small validation sample for the committee to review quickly.


Submission flow

  1. Prepare the dataset

    • Organize files (e.g., data/), include a manifest and an attribution file if required.

  2. Create manifest (see spec below)

    • A single YAML/JSON describing license, size, modality, languages, provenance, and consent.

  3. Package & hash

    • Create a .tar.zst (or .tar.gz) archive.

    • Compute the SHA‑256 over the archive → this becomes your contentHash.

  4. Upload to IPFS/Arweave

    • Pin via your node or a pinning service; note the CID or transaction ID.

  5. Register on‑chain

    • Call registerDataset(contentHash, metaURI) where metaURI points to the manifest (e.g., ipfs://<manifestCID>).

    • Post the data bond (see below).

  6. Validation

    • Automated filters + a randomized committee review samples, license, novelty, and policy adherence.

    • Committee publishes DatasetStatusChanged(datasetId, status).

  7. Rewards

    • If Accepted, the dataset is eligible for rewards when used in training rounds.

    • Use is measured via the training pipeline (see Reward Model).


Data bond (anti‑spam)

To discourage spam, each dataset registration posts a refundable bond:

  • Proposed formula: bond = base + k * sizeGB (e.g., base 50 MTHX + 10 MTHX/GB)

  • If Accepted → bond is returned.

  • If Rejected (policy/licensing violation or clear junk) → bond is burned or partially slashed.

  • If Rejected (quality borderline) → bond returned minus a review fee (DAO‑tunable).


Manifest specification

A minimal, machine‑readable record (YAML example):

# ipfs://bafy.../manifest.yaml
dataset:
  name: "CleanCodeDocs-v1"
  version: "1.0.0"
  modality: ["text", "code"]
  languages: ["en"]
  size_bytes: 128734987
  files: 1203
  archive_sha256: "0xA3F5...7C1E"
  archive_uri: "ipfs://bafy.../cleancode-docs-v1.tar.zst"

provenance:
  created_by: "0xSUBMITTER"
  collected_at: "2025-10-15T12:20:00Z"
  source_types: ["original_docs", "license-permitted_repos"]
  synthetic: false
  notes: "Docs curated from repos with permissive licenses. Attributions included."

license:
  spdx: "CC-BY-4.0"
  attribution_required: true
  attribution_file: "ipfs://bafy.../ATTRIBUTION.txt"
  redistribution_ok: true
  training_ok: true

consent_and_safety:
  contains_pii: false
  pii_process: "ner+regex redaction"
  adult_content: false
  safety_notes: "removed EXIF; normalized encoding"

quality_signals:
  novelty_hash: "simhash:7f3a-..."
  duplicate_rate_estimate: 0.8   # lower is better
  sample_cid: "ipfs://bafy.../committee-sample.jsonl"

metadata_uri: "ipfs://bafy.../README.md"

CLI examples (illustrative)

# Pack & hash
tar -I 'zstd -19 -T0' -cf cleancode-docs-v1.tar.zst data/
sha256sum cleancode-docs-v1.tar.zst

# Pin to IPFS (local node)
ipfs add -r cleancode-docs-v1.tar.zst
ipfs add manifest.yaml

# Register on-chain (placeholder CLI)
mhx data register \
  --content-hash 0xA3F5...7C1E \
  --meta-uri ipfs://bafy.../manifest.yaml \
  --bond 80 \
  --from 0xSUBMITTER

Validation criteria

Automated pre‑filters:

  • License & policy checks against the manifest + heuristics on file contents

  • Format checks (parsing, encoding, schema / filetype allow‑list)

  • Anomaly checks (PII leakage, profanity if disallowed, adversarial content)

  • Dedup/novelty scoring (simhash/minhash, overlap with accepted corpora)

Committee review (stake‑weighted):

  • Sampled quality inspection, license verification, novelty sanity check

  • Final vote sets Accepted or Rejected on‑chain


Reward model (proposed)

Let R_d be the reward attributable to datasets in a round. Each accepted dataset i earns:

rewardi=Rd⋅α⋅usagei+β⋅noveltyi+γ⋅qualityi∑j(α⋅usagej+β⋅noveltyj+γ⋅qualityj)\text{reward}_i = R_d \cdot \frac{\alpha \cdot \text{usage}_i + \beta \cdot \text{novelty}_i + \gamma \cdot \text{quality}_i}{\sum_j (\alpha \cdot \text{usage}_j + \beta \cdot \text{novelty}_j + \gamma \cdot \text{quality}_j)}rewardi​=Rd​⋅∑j​(α⋅usagej​+β⋅noveltyj​+γ⋅qualityj​)α⋅usagei​+β⋅noveltyi​+γ⋅qualityi​​

  • usage: actual tokens/batches consumed from the dataset

  • novelty: uniqueness vs. corpus (higher = rarer content)

  • quality: committee score (readability, cleanliness, license clarity)

  • Default weights: α=0.6, β=0.25, γ=0.15 (governed)

Rewards only flow to Accepted datasets; malicious submissions can be slashed (bond + future eligibility).


Updates & versioning

  • Use semver for dataset versions (vMAJOR.MINOR.PATCH).

  • New versions → new archive and new content hash; register as a new dataset and reference the prior one in your manifest.

  • Deprecations are recorded on‑chain (status → deprecated); artifacts remain pinned for auditability.


Retractions & takedowns

  • Submit a tombstone request if you discover a licensing/privacy issue; governance may flag a dataset to deny future use.

  • Note that IPFS/Arweave are decentralized—full removal is not guaranteed. The protocol can block usage and deny rewards going forward.

  • A DMCA or legal takedown can be attached to the dataset record for auditability.


Common pitfalls

  • License mismatch between manifest and files.

  • Datasets with high duplication vs. the accepted corpus.

  • PII not fully redacted (names/emails in text, EXIF in images).

  • Archive hash doesn’t match the registered contentHash.


Pre‑flight checklist

Last updated