Running a Validator

Methexis supports two validator roles:

  1. Training Validators (compute providers) – run training jobs, co‑sign model updates, and submit them on‑chain.

  2. Data Validators (validation workers) – screen datasets with automated checks and stake‑weighted committee voting.

Both roles earn $MTHX; both can be slashed for misbehavior or chronic downtime.


Hardware & OS Requirements

These are recommended baselines for testnet/mainnet. You can run bigger machines for higher rewards. Values may be tuned via governance.

Training Validator (GPU)

  • GPU: NVIDIA 24 GB+ VRAM (e.g., RTX 3090/4090, A5000/A6000). Multiple GPUs supported.

  • CPU: 8+ cores (x86_64).

  • RAM: 32–64 GB.

  • Storage: 1 TB NVMe SSD (model checkpoints + cache).

  • Network: ≥ 100 Mbps up/down, low jitter; public IP or properly forwarded ports.

  • OS: Ubuntu 22.04 LTS or 24.04 LTS.

  • Drivers: NVIDIA driver + CUDA/cuDNN matching the container image.

Data Validator (CPU)

  • CPU: 4+ cores.

  • RAM: 16 GB.

  • Storage: 200–500 GB SSD (dataset cache + logs).

  • Network: ≥ 50 Mbps up/down.

  • OS: Ubuntu 22.04/24.04 LTS.

Optional: co‑located IPFS node with generous pinning space improves latency for both roles.


Software Prerequisites

  • Docker or another OCI‑compatible runtime (recommended for reproducibility).

  • NVIDIA Container Toolkit (for GPU nodes).

  • Git (for pulling configs).

  • Systemd (or another process supervisor) for reliability.

  • NTP/chrony for accurate time sync (prevents consensus issues).


Keys & Wallet Safety

  • Create a dedicated validator wallet. Keep your treasury/cold funds separate.

  • Prefer hardware wallet or remote signing.

  • Never paste seed phrases into terminal sessions on shared machines.

  • Back up the validator keystore + configs (encrypted) and store off‑site.


Quick Start (Testnet)

Below are example commands with placeholder names. Replace hostnames/images when your repos are public.

1) Pull the validator image

2) Create a working directory

3) Generate or import validator keys

4) Configure the node

Create config.yaml:

5) Stake test MTHX (Proposed)

6) Start the service

Docker (recommended)

systemd (optional)


Operating Modes

Training Validator

  • Subscribes to round events, fetches latest approved datasets & checkpoint, executes training, signs the update, and submits the result on‑chain.

  • Parameters you can tune: batch_size, accumulation_steps, mixed_precision, gpu_ids, max_job_runtime.

Data Validator

  • Pulls pending datasets, runs license/format/anomaly/duplication checks, participates in stake‑weighted committee votes, and writes results back (Accepted/Rejected).

  • Parameters you can tune: max_filesize, accepted_mime_types, policy_ruleset, vote_quorum.


Monitoring & Metrics

  • Built‑in Prometheus metrics at :9100 by default.

  • Recommended dashboard: CPU/GPU usage, VRAM, I/O, latency to IPFS gateway, round participation, success rate.

  • Logs: JSON to stdout; send to Loki/ELK or journald.

  • Health checks: mhx status (peer count, last round, signatures submitted).

SLO targets (suggested minimums):

  • Uptime: ≥ 97% over a rolling 30‑day window.

  • Participation: ≥ 90% of eligible rounds.

  • Attestation agreement: ≥ 99% (training results within consensus bounds).

Falling below SLOs risks slashing once parameters are finalized.


Rewards: How You Earn

Let R be the round reward (in MTHX) after any maintenance skim.

  • Training Validators: receive ~58% × your compute share.

  • Data Providers: ~35% allocated across accepted datasets.

  • Validation Committees: ~7% split by honest participation.

Example: If your validator contributes 15% of the verified compute for a round and R = 10,000 MTHX, you earn ≈ 0.58 * 0.15 * 10000 = 870 MTHX.

(Exact splits and formulas are governed; values above are current targets.)


Slashing & Risk Management

You may be slashed for:

  • Submitting invalid or dishonest computation (training validators).

  • Colluding or voting against evidence in committee decisions (data validators).

  • Chronic downtime or missing required attestations.

  • Double‑signing / equivocating during a round.

Mitigations:

  • Run sentry architecture (expose a public node; keep the validator behind a firewall).

  • Use remote signer / hardware wallet; isolate keys from the worker.

  • Configure auto‑shutdown if time sync or GPU health checks fail.

  • Keep NTP/chrony active; monitor clock drift.

  • Maintain reliable power/UPS and redundant network links where possible.


Upgrades

  • Container images are versioned (semver).

  • Always drain before upgrade:

  • Read release notes for any config schema changes.


Troubleshooting

  • High GPU memory usage: lower batch_size or enable mixed_precision.

  • Slow downloads: switch ipfs.gateway to a closer mirror; run your own IPFS node.

  • Frequent timeouts: check ISP packet loss/jitter; tune max_job_runtime.

  • Attestation mismatch: compare container versions; purge cache; re‑sync to the latest checkpoint.

  • RPC failures: fail over to a secondary eth_rpc endpoint.


Unstaking & Exit

  • Unbonding period (Proposed): 14–28 days.

  • Exit steps: mhx validator drain → wait for round completion → mhx unstake → halt the service after the unbonding period.


Security Best Practices (Checklist)

Last updated