Contributing Data

Methexis rewards people who contribute useful, lawful, and high‑quality datasets for training open‑source AI models. Data is screened by an automated pipeline and stake‑weighted committees; accepted d

Principles: consent & compliance → quality → transparency. The chain records hashes, status, rewards; the data itself is stored off‑chain (IPFS/Arweave).

What kinds of data?

Supported modalities (initially):

Text (natural language, documentation, code snippets)
Code (repos, problems/solutions, test cases)
Images (with licenses & model-suitable usage rights)
Audio (transcripts, speech, effects; with consent)
Synthetic data (must disclose provenance and generator)

Start narrow if needed (e.g., text + code) and expand by governance vote.

What not to upload

Personal data without consent (names, emails, phone numbers, addresses, medical or financial records)
Content restricted by terms of service or non‑redistributable licenses
Copyrighted material without an explicit license permitting ML training & redistribution
CSAM, violent/graphic abuse, hate speech, or harassment datasets
Hidden malware, tracking pixels, or executable payloads
Mass‑duplicated web scrapes with little originality (fails dedup/novelty checks)

Violations may lead to slashing of the submitter’s bond and blacklisting of associated keys.

Choose an allowed license when you submit. Examples:

CC‑BY 4.0, CC‑BY‑SA 4.0, CC0, MIT/Apache‑2.0 (for code)
OpenRAIL (where terms allow training & redistribution)

Your manifest must include:

the license identifier (SPDX or link),
the rights statement (you own it / have permission),
any attribution requirements,
and a consent declaration if people are identifiable (or an explicit statement that PII has been removed/anonymized).

Data hygiene & PII

Before uploading:

Remove or mask PII via automated NER + regex (emails, SSNs, phone numbers).
Strip EXIF from images; blur faces if needed.
Normalize encodings (UTF‑8), line endings, file paths.
Provide dedup‑resistant chunking (e.g., minhash/simhash) so novelty can be scored.
Include a small validation sample for the committee to review quickly.

Submission flow

Prepare the dataset
- Organize files (e.g., data/), include a manifest and an attribution file if required.
Create manifest (see spec below)
- A single YAML/JSON describing license, size, modality, languages, provenance, and consent.
Package & hash
- Create a .tar.zst (or .tar.gz) archive.
- Compute the SHA‑256 over the archive → this becomes your contentHash.
Upload to IPFS/Arweave
- Pin via your node or a pinning service; note the CID or transaction ID.
Register on‑chain
- Call registerDataset(contentHash, metaURI) where metaURI points to the manifest (e.g., ipfs://<manifestCID>).
- Post the data bond (see below).
Validation
- Automated filters + a randomized committee review samples, license, novelty, and policy adherence.
- Committee publishes DatasetStatusChanged(datasetId, status).
Rewards
- If Accepted, the dataset is eligible for rewards when used in training rounds.
- Use is measured via the training pipeline (see Reward Model).

Data bond (anti‑spam)

To discourage spam, each dataset registration posts a refundable bond:

Proposed formula: bond = base + k * sizeGB (e.g., base 50 MTHX + 10 MTHX/GB)
If Accepted → bond is returned.
If Rejected (policy/licensing violation or clear junk) → bond is burned or partially slashed.
If Rejected (quality borderline) → bond returned minus a review fee (DAO‑tunable).

Manifest specification

A minimal, machine‑readable record (YAML example):

# ipfs://bafy.../manifest.yaml
dataset:
  name: "CleanCodeDocs-v1"
  version: "1.0.0"
  modality: ["text", "code"]
  languages: ["en"]
  size_bytes: 128734987
  files: 1203
  archive_sha256: "0xA3F5...7C1E"
  archive_uri: "ipfs://bafy.../cleancode-docs-v1.tar.zst"

provenance:
  created_by: "0xSUBMITTER"
  collected_at: "2025-10-15T12:20:00Z"
  source_types: ["original_docs", "license-permitted_repos"]
  synthetic: false
  notes: "Docs curated from repos with permissive licenses. Attributions included."

license:
  spdx: "CC-BY-4.0"
  attribution_required: true
  attribution_file: "ipfs://bafy.../ATTRIBUTION.txt"
  redistribution_ok: true
  training_ok: true

consent_and_safety:
  contains_pii: false
  pii_process: "ner+regex redaction"
  adult_content: false
  safety_notes: "removed EXIF; normalized encoding"

quality_signals:
  novelty_hash: "simhash:7f3a-..."
  duplicate_rate_estimate: 0.8   # lower is better
  sample_cid: "ipfs://bafy.../committee-sample.jsonl"

metadata_uri: "ipfs://bafy.../README.md"

CLI examples (illustrative)

# Pack & hash
tar -I 'zstd -19 -T0' -cf cleancode-docs-v1.tar.zst data/
sha256sum cleancode-docs-v1.tar.zst

# Pin to IPFS (local node)
ipfs add -r cleancode-docs-v1.tar.zst
ipfs add manifest.yaml

# Register on-chain (placeholder CLI)
mhx data register \
  --content-hash 0xA3F5...7C1E \
  --meta-uri ipfs://bafy.../manifest.yaml \
  --bond 80 \
  --from 0xSUBMITTER

Validation criteria

Automated pre‑filters:

License & policy checks against the manifest + heuristics on file contents
Format checks (parsing, encoding, schema / filetype allow‑list)
Anomaly checks (PII leakage, profanity if disallowed, adversarial content)
Dedup/novelty scoring (simhash/minhash, overlap with accepted corpora)

Committee review (stake‑weighted):

Sampled quality inspection, license verification, novelty sanity check
Final vote sets Accepted or Rejected on‑chain

Reward model (proposed)

Let R_d be the reward attributable to datasets in a round. Each accepted dataset i earns:

rewardi=Rd⋅α⋅usagei+β⋅noveltyi+γ⋅qualityi∑j(α⋅usagej+β⋅noveltyj+γ⋅qualityj)\text{reward}_i = R_d \cdot \frac{\alpha \cdot \text{usage}_i + \beta \cdot \text{novelty}_i + \gamma \cdot \text{quality}_i}{\sum_j (\alpha \cdot \text{usage}_j + \beta \cdot \text{novelty}_j + \gamma \cdot \text{quality}_j)}rewardi=Rd⋅∑j(α⋅usagej+β⋅noveltyj+γ⋅qualityj)α⋅usagei+β⋅noveltyi+γ⋅qualityi

usage: actual tokens/batches consumed from the dataset
novelty: uniqueness vs. corpus (higher = rarer content)
quality: committee score (readability, cleanliness, license clarity)
Default weights: α=0.6, β=0.25, γ=0.15 (governed)

Rewards only flow to Accepted datasets; malicious submissions can be slashed (bond + future eligibility).

Updates & versioning

Use semver for dataset versions (vMAJOR.MINOR.PATCH).
New versions → new archive and new content hash; register as a new dataset and reference the prior one in your manifest.
Deprecations are recorded on‑chain (status → deprecated); artifacts remain pinned for auditability.

Retractions & takedowns

Submit a tombstone request if you discover a licensing/privacy issue; governance may flag a dataset to deny future use.
Note that IPFS/Arweave are decentralized—full removal is not guaranteed. The protocol can block usage and deny rewards going forward.
A DMCA or legal takedown can be attached to the dataset record for auditability.

Common pitfalls

License mismatch between manifest and files.
Datasets with high duplication vs. the accepted corpus.
PII not fully redacted (names/emails in text, EXIF in images).
Archive hash doesn’t match the registered contentHash.

Pre‑flight checklist

Chosen a permitted license (+ attribution file if needed)
Manifest filled and validated (YAML/JSON)
PII removed / consent documented
Archive created and SHA‑256 computed
Uploaded to IPFS/Arweave; CIDs recorded
Bond amount available; registration transaction ready

Last updated 2 months ago

Good morning