Running a Validator

Methexis supports two validator roles:

  1. Training Validators (compute providers) – run training jobs, co‑sign model updates, and submit them on‑chain.

  2. Data Validators (validation workers) – screen datasets with automated checks and stake‑weighted committee voting.

Both roles earn $MTHX; both can be slashed for misbehavior or chronic downtime.


Hardware & OS Requirements

These are recommended baselines for testnet/mainnet. You can run bigger machines for higher rewards. Values may be tuned via governance.

Training Validator (GPU)

  • GPU: NVIDIA 24 GB+ VRAM (e.g., RTX 3090/4090, A5000/A6000). Multiple GPUs supported.

  • CPU: 8+ cores (x86_64).

  • RAM: 32–64 GB.

  • Storage: 1 TB NVMe SSD (model checkpoints + cache).

  • Network: ≥ 100 Mbps up/down, low jitter; public IP or properly forwarded ports.

  • OS: Ubuntu 22.04 LTS or 24.04 LTS.

  • Drivers: NVIDIA driver + CUDA/cuDNN matching the container image.

Data Validator (CPU)

  • CPU: 4+ cores.

  • RAM: 16 GB.

  • Storage: 200–500 GB SSD (dataset cache + logs).

  • Network: ≥ 50 Mbps up/down.

  • OS: Ubuntu 22.04/24.04 LTS.

Optional: co‑located IPFS node with generous pinning space improves latency for both roles.


Software Prerequisites

  • Docker or another OCI‑compatible runtime (recommended for reproducibility).

  • NVIDIA Container Toolkit (for GPU nodes).

  • Git (for pulling configs).

  • Systemd (or another process supervisor) for reliability.

  • NTP/chrony for accurate time sync (prevents consensus issues).


Keys & Wallet Safety

  • Create a dedicated validator wallet. Keep your treasury/cold funds separate.

  • Prefer hardware wallet or remote signing.

  • Never paste seed phrases into terminal sessions on shared machines.

  • Back up the validator keystore + configs (encrypted) and store off‑site.


Quick Start (Testnet)

Below are example commands with placeholder names. Replace hostnames/images when your repos are public.

1) Pull the validator image

# Training validator (GPU)
docker pull ghcr.io/methexisxyz/mhx-training-validator:latest

# Data validator (CPU)
docker pull ghcr.io/methexisxyz/mhx-data-validator:latest

2) Create a working directory

mkdir -p /opt/methexis/{config,data,logs}
cd /opt/methexis

3) Generate or import validator keys

# Example CLI (placeholder)
docker run --rm -it -v $PWD:/work ghcr.io/methexisxyz/mhx-training-validator:latest \
  mhx keys import --keystore /work/config/validator.json

4) Configure the node

Create config.yaml:

# /opt/methexis/config/config.yaml
role: training            # training | data
network: testnet
eth_rpc: "https://rpc.testnet.example.org"
staking_contract: "0xTEST_STAKING"
rewards_contract: "0xTEST_REWARDS"
data_registry_contract: "0xTEST_DATA"
validator_key: "/opt/methexis/config/validator.json"
ipfs:
  gateway: "https://ipfs.io"
  pin: true
compute:
  gpus: "all"             # or "0,1"
  mixed_precision: true
metrics:
  enable: true
  listen_addr: "0.0.0.0:9100"

5) Stake test MTHX (Proposed)

# illustrative CLI – will be replaced with final tool
mhx stake --rpc $ETH_RPC --amount 100000 --from <validator-address>

6) Start the service

Docker (recommended)

docker run -d --name mhx-validator \
  --gpus all \
  -v /opt/methexis/config:/app/config \
  -v /opt/methexis/data:/app/data \
  -v /opt/methexis/logs:/app/logs \
  -p 9100:9100 \
  ghcr.io/methexisxyz/mhx-training-validator:latest \
  mhx validator start --config /app/config/config.yaml

systemd (optional)

# /etc/systemd/system/mhx-validator.service
[Unit]
Description=Methexis Validator
After=network-online.target

[Service]
Restart=always
RestartSec=5
Environment=CONFIG=/opt/methexis/config/config.yaml
ExecStart=/usr/bin/docker run --rm --gpus all \
  -v /opt/methexis/config:/app/config \
  -v /opt/methexis/data:/app/data \
  -v /opt/methexis/logs:/app/logs \
  -p 9100:9100 \
  ghcr.io/methexisxyz/mhx-training-validator:latest \
  mhx validator start --config ${CONFIG}

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable --now mhx-validator

Operating Modes

Training Validator

  • Subscribes to round events, fetches latest approved datasets & checkpoint, executes training, signs the update, and submits the result on‑chain.

  • Parameters you can tune: batch_size, accumulation_steps, mixed_precision, gpu_ids, max_job_runtime.

Data Validator

  • Pulls pending datasets, runs license/format/anomaly/duplication checks, participates in stake‑weighted committee votes, and writes results back (Accepted/Rejected).

  • Parameters you can tune: max_filesize, accepted_mime_types, policy_ruleset, vote_quorum.


Monitoring & Metrics

  • Built‑in Prometheus metrics at :9100 by default.

  • Recommended dashboard: CPU/GPU usage, VRAM, I/O, latency to IPFS gateway, round participation, success rate.

  • Logs: JSON to stdout; send to Loki/ELK or journald.

  • Health checks: mhx status (peer count, last round, signatures submitted).

SLO targets (suggested minimums):

  • Uptime: ≥ 97% over a rolling 30‑day window.

  • Participation: ≥ 90% of eligible rounds.

  • Attestation agreement: ≥ 99% (training results within consensus bounds).

Falling below SLOs risks slashing once parameters are finalized.


Rewards: How You Earn

Let R be the round reward (in MTHX) after any maintenance skim.

  • Training Validators: receive ~58% × your compute share.

  • Data Providers: ~35% allocated across accepted datasets.

  • Validation Committees: ~7% split by honest participation.

Example: If your validator contributes 15% of the verified compute for a round and R = 10,000 MTHX, you earn ≈ 0.58 * 0.15 * 10000 = 870 MTHX.

(Exact splits and formulas are governed; values above are current targets.)


Slashing & Risk Management

You may be slashed for:

  • Submitting invalid or dishonest computation (training validators).

  • Colluding or voting against evidence in committee decisions (data validators).

  • Chronic downtime or missing required attestations.

  • Double‑signing / equivocating during a round.

Mitigations:

  • Run sentry architecture (expose a public node; keep the validator behind a firewall).

  • Use remote signer / hardware wallet; isolate keys from the worker.

  • Configure auto‑shutdown if time sync or GPU health checks fail.

  • Keep NTP/chrony active; monitor clock drift.

  • Maintain reliable power/UPS and redundant network links where possible.


Upgrades

  • Container images are versioned (semver).

  • Always drain before upgrade:

mhx validator drain --grace 300   # stop taking new jobs, finish current one
docker pull ghcr.io/...:vX.Y.Z
docker stop mhx-validator && docker rm mhx-validator
# start again with the new image
  • Read release notes for any config schema changes.


Troubleshooting

  • High GPU memory usage: lower batch_size or enable mixed_precision.

  • Slow downloads: switch ipfs.gateway to a closer mirror; run your own IPFS node.

  • Frequent timeouts: check ISP packet loss/jitter; tune max_job_runtime.

  • Attestation mismatch: compare container versions; purge cache; re‑sync to the latest checkpoint.

  • RPC failures: fail over to a secondary eth_rpc endpoint.


Unstaking & Exit

  • Unbonding period (Proposed): 14–28 days.

  • Exit steps: mhx validator drain → wait for round completion → mhx unstake → halt the service after the unbonding period.


Security Best Practices (Checklist)

Last updated