Kshitij Tyagi

01Thesis

Kshitij Tyagi · Software Engineer — Trading Automation & Real-Time Systems

I build systems that assume failure is the default state — latency budgets, recovery paths, and confirmation pipelines come before features.

Most of my work runs where being slow is the same as being wrong: trading on Solana, real-time sensor streams, high-volume data pipelines. The interesting engineering isn't the happy path — it's what the system does when the market spikes, the RPC degrades, or a worker dies mid-order.

/ The claim, and where it’s already shipped

A trade isn't done when you send it.
Multi-stage confirmation on EZ-Wallet: gRPC → RPC fallback → timeout watchdog, so an order is never reported settled until it provably is.
Throughput means nothing without ordering.
Tuned Kafka consumer groups for a 5M+ msg/day pipeline, lifting throughput ~30% without breaking the ordering the ML models depend on.
Recovery is a feature, not an afterthought.
Event-driven 6-worker daemon with simulation mode and liquidity-rug protection — the kill switches were designed in, not bolted on.

02Trajectory

How I got to failure-first — the moments a belief changed, in order.

  1. T-00Apr 2024 · Systaldyn Consultancy · Full-Stack Web Developer

    Shipping features that worked on my machine

    beforeI optimized for shipping: build the dashboard, wire the API, make the demo work. Reliability was something you added if a bug forced you to.

    shiftReal users and real data showed up. 'Works' and 'keeps working under load' turned out to be completely different problems.

  2. T-012024 · Systaldyn · Kafka data pipeline

    Throughput was the goal — until ordering broke the ML

    beforeMore messages per second was the only number I watched. Scale the consumers, push the graph up.

    shiftAt 5M+ points/day, raw throughput started corrupting downstream ML because event order wasn't guaranteed. I learned to design the consumer, not just feed it.

    5M+msgs / day
  3. T-022025 · PHAROS · real-time power & thermal monitoring

    Real-time isn't a UI problem

    beforeI thought 'real-time' meant making the chart update fast — a front-end concern with WebSockets and animation.

    shiftOperators were making safety decisions on this data across live / simulation / maintenance modes. Sub-second wasn't a polish target; stale or mode-bled data was a hazard. Correctness moved to the center.

    50+sensor streams
  4. T-03Dec 2025 · Spizen Technologies · Software Engineer

    Latency became the product

    beforePerformance was a thing you profiled later, once features were in.

    shiftOn Solana, the market moves inside a ~400ms slot and does not wait. A late order is a wrong order. I started designing backward from the millisecond that mattered and defending it through every layer.

    <1shot-path SLA
  5. T-04Now · How I build today

    Failure-first by default

    beforeEarlier, the happy path was the design and failure handling was the patch.

    shiftNow the failure modes are the design. Latency budget, recovery path, and confirmation pipeline get specified before the feature does — across trading, real-time data, and the infra under both.

    ~99%slot-accurate exec

03Systems

Three systems, read as post-mortems: the problem, the tradeoffs, and what I’d change.

★ Flagship

EZ-Wallet — Solana Trading Automation

Dec 2025 — Mar 2026

Autonomous, risk-managed order execution across 5+ Solana DEXs, built so a degraded RPC or a dead worker doesn't become a missed or doubled trade.

  • Rust
  • Axum
  • Tokio
  • Redis Streams
  • ClickHouse
  • SQLite
  • Kubernetes
  • gRPC
  • Jito

Problem

Run trading strategies 24/7 with no human in the loop, on a chain where the market moves inside a ~400ms slot and RPC endpoints degrade exactly when volatility is highest. A late order is a wrong order; a double-sent order is worse.

Constraints

  • Hot path budget sub-second end-to-end (~100–250ms typical). // TODO: confirm exact split
  • Solana slot ≈400ms — execution must be slot-accurate, not just 'fast'.
  • RPC reliability drops under the exact volatility the strategies trade.
  • Multi-wallet: encrypted key storage, no cross-wallet bleed.
  • No missed trades and no double-executes — correctness over throughput.

Shape

  1. CREATED
  2. PENDING
  3. CONFIRMED
  4. SETTLED

Decisions & tradeoffs

Rust + Axum/Tokio for the backend (not Node)

Predictable tail latency and real concurrency for the daemon and 60+ APIs.

tradeoffSlower feature velocity and a smaller ecosystem to lean on.

Event-driven 6-worker daemon (not request/response)

Sub-millisecond trigger evaluation across PumpFun, PumpSwap, Raydium, Meteora.

tradeoffReal coordination/observability complexity — failures are now distributed.

3-tier data: SQLite + Redis Streams + ClickHouse

Each access pattern (transactional state / event flow / time-series audit) gets the right tool.

tradeoffThree systems to operate and keep consistent instead of one.

gRPC → RPC fallback → timeout watchdog confirmation

An order is only reported settled when it provably is — false settles are unacceptable.

tradeoffAdded confirmation latency into an already tight budget.

Outcome

1,000sorders / hour
~99%slot-accurate exec
<1shot-path SLA
~0missed trades, steady state

Incident

◇ Incident report

What happened
During a sharp volatility spike, RPC latency climbed and the daemon's trigger evaluation fell behind the slot clock. A batch of conditional orders evaluated late and missed their intended slot — the system was fast enough on an average minute and not on the worst one.
What it changed
Average-case latency was a lie I'd been telling myself. I moved to budgeting against the worst minute, not the mean: added RPC-degradation detection with fallback routing and shed/queue-aware load before evaluation, so falling behind degrades gracefully instead of silently dropping slots.

What I'd do differently

SQLite for transactional state was the right bet for a single-node start and the wrong one as wallet count grows — I'd reach for Postgres earlier. I'd also have built the volatility load-shedding before the first incident forced it, not after.

PHAROS — Real-Time Power & Thermal Monitoring

Oct 2025 — Dec 2025

24/7 operations console where operators make safety calls on live sensor data — so stale or mode-confused data is a hazard, not a glitch.

  • React
  • D3.js
  • Three.js
  • Highcharts
  • Redux
  • WebSocket
  • MQTT

Problem

Stream 50+ sensor values into an operations UI used around the clock across distinct modes (live, simulation, maintenance, safety). If simulation data ever reads as live, or a value goes stale without saying so, an operator can make the wrong call.

Constraints

  • 50+ concurrent sensor streams over WebSocket + MQTT.
  • Sub-second UI updates, sustained 24×7.
  • Modes must be strictly isolated — no live/sim bleed, even across browser tabs.

Shape

  1. Live
  2. Simulation
  3. Maintenance
  4. Safety
  • sub-second UI
  • strict mode isolation
  • cross-tab sync

Decisions & tradeoffs

Redux global state with cross-tab sync

Mode is a safety-critical invariant; every tab must agree on it.

tradeoffMore state boilerplate than local component state.

Custom D3 render path for live visualizations

Direct control over render performance at 50+ streams, sub-second.

tradeoffBuilt and maintained what a chart library gives for free.

Three.js GLB waste-heat configurator

Spatial fidelity operators could actually reason about.

tradeoffReal bundle weight on an already data-heavy app.

Outcome

50+sensor streams
<1sUI freshness
4isolated modes
24×7operations

What I'd do differently

I'd put an explicit staleness/heartbeat indicator on every value from day one — 'no update in N seconds' is itself safety-critical information, and inferring it late is harder than designing it in.

Systaldyn — Real-Time Data Pipeline

Apr 2024 — Dec 2025

A Kafka pipeline moving 5M+ well-data points a day into ML, where lifting throughput could not be allowed to corrupt event order.

  • Apache Kafka
  • KafkaJS
  • Node.js
  • React
  • MySQL
  • Highcharts

Problem

Ingest 5M+ data points/day from field and IoT (ESP32) sources and feed ML models that assume ordered events. Naïve scaling of consumers raised throughput but reordered events and quietly degraded model output.

Constraints

  • 5M+ points/day sustained.
  • Per-key event ordering required for ML correctness.
  • JavaScript/Node consumer stack (team velocity constraint).

Shape

  1. Ingest
  2. Partition
  3. Consume
  4. ML
  • +30% throughput
  • ordering preserved

Decisions & tradeoffs

Tune consumer groups & partitioning (not just add instances)

Recovered throughput while preserving the ordering ML depended on.

tradeoffRebalance/partition-key complexity to reason about and operate.

KafkaJS + Node for IoT/ESP32 ingestion

Matched the team's stack and shipped the integration fast.

tradeoffNot the lowest-latency consumer runtime available.

Outcome

5M+points / day
~30%throughput gain
~40%data-flow efficiency (AI-EASE)

What I'd do differently

I'd have made the ordering guarantee an explicit, tested contract early — a partition-key/ordering test in CI — instead of an assumption that only surfaced when model output drifted.

04Stack

Comfortable with low-latency, concurrent systems and the messy edges of production.

Core Languages

  • TypeScript
  • JavaScript
  • Rust
  • Java
  • Python
  • C / C++

Web & Frontend

  • React
  • Next.js
  • Tailwind CSS
  • D3.js
  • Highcharts
  • Three.js
  • Radix UI
  • HTML
  • CSS

Backend & Data

  • Node.js
  • Express.js
  • MongoDB
  • MySQL
  • SQLite
  • Redis Streams
  • ClickHouse

Distributed & Infra

  • Apache Kafka
  • Docker
  • Kubernetes
  • GitLab CI/CD
  • Prometheus
  • Grafana
  • Loki
  • OpenTelemetry

Under the Hood

  • Data Structures & Algorithms
  • Operating Systems
  • Computer Networks
  • DBMS
  • Cloud & Virtualization
  • Software Engineering
  • Business Intelligence
  • Mobile App Development

05Education

Where the foundations were poured.

Degree
B.Tech, Computer Science & Engineering
Institution
Graphic Era (Deemed to be University) · 2024
CGPA
7.8
Notes
Strong foundation in OS, networks, and distributed systems.

06Contact

Available for backend, platform, and trading-infra roles.