Contributing Data
Methexis rewards people who contribute useful, lawful, and high‑quality datasets for training open‑source AI models. Data is screened by an automated pipeline and stake‑weighted committees; accepted d
Principles: consent & compliance → quality → transparency. The chain records hashes, status, rewards; the data itself is stored off‑chain (IPFS/Arweave).
What kinds of data?
Supported modalities (initially):
Text (natural language, documentation, code snippets)
Code (repos, problems/solutions, test cases)
Images (with licenses & model-suitable usage rights)
Audio (transcripts, speech, effects; with consent)
Synthetic data (must disclose provenance and generator)
Start narrow if needed (e.g., text + code) and expand by governance vote.
What not to upload
Personal data without consent (names, emails, phone numbers, addresses, medical or financial records)
Content restricted by terms of service or non‑redistributable licenses
Copyrighted material without an explicit license permitting ML training & redistribution
CSAM, violent/graphic abuse, hate speech, or harassment datasets
Hidden malware, tracking pixels, or executable payloads
Mass‑duplicated web scrapes with little originality (fails dedup/novelty checks)
Violations may lead to slashing of the submitter’s bond and blacklisting of associated keys.
Licensing & consent
Choose an allowed license when you submit. Examples:
CC‑BY 4.0, CC‑BY‑SA 4.0, CC0, MIT/Apache‑2.0 (for code)
OpenRAIL (where terms allow training & redistribution)
Your manifest must include:
the license identifier (SPDX or link),
the rights statement (you own it / have permission),
any attribution requirements,
and a consent declaration if people are identifiable (or an explicit statement that PII has been removed/anonymized).
Data hygiene & PII
Before uploading:
Remove or mask PII via automated NER + regex (emails, SSNs, phone numbers).
Strip EXIF from images; blur faces if needed.
Normalize encodings (UTF‑8), line endings, file paths.
Provide dedup‑resistant chunking (e.g., minhash/simhash) so novelty can be scored.
Include a small validation sample for the committee to review quickly.
Submission flow
Prepare the dataset
Organize files (e.g.,
data/
), include a manifest and an attribution file if required.
Create manifest (see spec below)
A single YAML/JSON describing license, size, modality, languages, provenance, and consent.
Package & hash
Create a
.tar.zst
(or.tar.gz
) archive.Compute the SHA‑256 over the archive → this becomes your
contentHash
.
Upload to IPFS/Arweave
Pin via your node or a pinning service; note the CID or transaction ID.
Register on‑chain
Call
registerDataset(contentHash, metaURI)
wheremetaURI
points to the manifest (e.g.,ipfs://<manifestCID>
).Post the data bond (see below).
Validation
Automated filters + a randomized committee review samples, license, novelty, and policy adherence.
Committee publishes
DatasetStatusChanged(datasetId, status)
.
Rewards
If Accepted, the dataset is eligible for rewards when used in training rounds.
Use is measured via the training pipeline (see Reward Model).
Data bond (anti‑spam)
To discourage spam, each dataset registration posts a refundable bond:
Proposed formula:
bond = base + k * sizeGB
(e.g., base 50 MTHX + 10 MTHX/GB)If Accepted → bond is returned.
If Rejected (policy/licensing violation or clear junk) → bond is burned or partially slashed.
If Rejected (quality borderline) → bond returned minus a review fee (DAO‑tunable).
Manifest specification
A minimal, machine‑readable record (YAML example):
# ipfs://bafy.../manifest.yaml
dataset:
name: "CleanCodeDocs-v1"
version: "1.0.0"
modality: ["text", "code"]
languages: ["en"]
size_bytes: 128734987
files: 1203
archive_sha256: "0xA3F5...7C1E"
archive_uri: "ipfs://bafy.../cleancode-docs-v1.tar.zst"
provenance:
created_by: "0xSUBMITTER"
collected_at: "2025-10-15T12:20:00Z"
source_types: ["original_docs", "license-permitted_repos"]
synthetic: false
notes: "Docs curated from repos with permissive licenses. Attributions included."
license:
spdx: "CC-BY-4.0"
attribution_required: true
attribution_file: "ipfs://bafy.../ATTRIBUTION.txt"
redistribution_ok: true
training_ok: true
consent_and_safety:
contains_pii: false
pii_process: "ner+regex redaction"
adult_content: false
safety_notes: "removed EXIF; normalized encoding"
quality_signals:
novelty_hash: "simhash:7f3a-..."
duplicate_rate_estimate: 0.8 # lower is better
sample_cid: "ipfs://bafy.../committee-sample.jsonl"
metadata_uri: "ipfs://bafy.../README.md"
CLI examples (illustrative)
# Pack & hash
tar -I 'zstd -19 -T0' -cf cleancode-docs-v1.tar.zst data/
sha256sum cleancode-docs-v1.tar.zst
# Pin to IPFS (local node)
ipfs add -r cleancode-docs-v1.tar.zst
ipfs add manifest.yaml
# Register on-chain (placeholder CLI)
mhx data register \
--content-hash 0xA3F5...7C1E \
--meta-uri ipfs://bafy.../manifest.yaml \
--bond 80 \
--from 0xSUBMITTER
Validation criteria
Automated pre‑filters:
License & policy checks against the manifest + heuristics on file contents
Format checks (parsing, encoding, schema / filetype allow‑list)
Anomaly checks (PII leakage, profanity if disallowed, adversarial content)
Dedup/novelty scoring (simhash/minhash, overlap with accepted corpora)
Committee review (stake‑weighted):
Sampled quality inspection, license verification, novelty sanity check
Final vote sets Accepted or Rejected on‑chain
Reward model (proposed)
Let R_d be the reward attributable to datasets in a round. Each accepted dataset i earns:
rewardi=Rd⋅α⋅usagei+β⋅noveltyi+γ⋅qualityi∑j(α⋅usagej+β⋅noveltyj+γ⋅qualityj)\text{reward}_i = R_d \cdot \frac{\alpha \cdot \text{usage}_i + \beta \cdot \text{novelty}_i + \gamma \cdot \text{quality}_i}{\sum_j (\alpha \cdot \text{usage}_j + \beta \cdot \text{novelty}_j + \gamma \cdot \text{quality}_j)}rewardi=Rd⋅∑j(α⋅usagej+β⋅noveltyj+γ⋅qualityj)α⋅usagei+β⋅noveltyi+γ⋅qualityi
usage: actual tokens/batches consumed from the dataset
novelty: uniqueness vs. corpus (higher = rarer content)
quality: committee score (readability, cleanliness, license clarity)
Default weights: α=0.6, β=0.25, γ=0.15 (governed)
Rewards only flow to Accepted datasets; malicious submissions can be slashed (bond + future eligibility).
Updates & versioning
Use semver for dataset versions (
vMAJOR.MINOR.PATCH
).New versions → new archive and new content hash; register as a new dataset and reference the prior one in your manifest.
Deprecations are recorded on‑chain (status → deprecated); artifacts remain pinned for auditability.
Retractions & takedowns
Submit a tombstone request if you discover a licensing/privacy issue; governance may flag a dataset to deny future use.
Note that IPFS/Arweave are decentralized—full removal is not guaranteed. The protocol can block usage and deny rewards going forward.
A DMCA or legal takedown can be attached to the dataset record for auditability.
Common pitfalls
License mismatch between manifest and files.
Datasets with high duplication vs. the accepted corpus.
PII not fully redacted (names/emails in text, EXIF in images).
Archive hash doesn’t match the registered
contentHash
.
Pre‑flight checklist
Last updated