01Thesis

Kshitij Tyagi · Software Engineer — Trading Automation & Real-Time Systems

I build systems that assume failure is the default state — latency budgets, recovery paths, and confirmation pipelines come before features.

Most of my work runs where being slow is the same as being wrong: trading on Solana, real-time sensor streams, high-volume data pipelines. The interesting engineering isn't the happy path — it's what the system does when the market spikes, the RPC degrades, or a worker dies mid-order.

Download CV Open GitHub

/ The claim, and where it’s already shipped

A trade isn't done when you send it.: Multi-stage confirmation on EZ-Wallet: gRPC → RPC fallback → timeout watchdog, so an order is never reported settled until it provably is.
Throughput means nothing without ordering.: Tuned Kafka consumer groups for a 5M+ msg/day pipeline, lifting throughput ~30% without breaking the ordering the ML models depend on.
Recovery is a feature, not an afterthought.: Event-driven 6-worker daemon with simulation mode and liquidity-rug protection — the kill switches were designed in, not bolted on.

02Trajectory

How I got to failure-first — the moments a belief changed, in order.

T-00Apr 2024 · Systaldyn Consultancy · Full-Stack Web Developer
Shipping features that worked on my machine
beforeI optimized for shipping: build the dashboard, wire the API, make the demo work. Reliability was something you added if a bug forced you to.
shiftReal users and real data showed up. 'Works' and 'keeps working under load' turned out to be completely different problems.
T-012024 · Systaldyn · Kafka data pipeline
Throughput was the goal — until ordering broke the ML
beforeMore messages per second was the only number I watched. Scale the consumers, push the graph up.
shiftAt 5M+ points/day, raw throughput started corrupting downstream ML because event order wasn't guaranteed. I learned to design the consumer, not just feed it.
5M+msgs / day
T-022025 · PHAROS · real-time power & thermal monitoring
Real-time isn't a UI problem
beforeI thought 'real-time' meant making the chart update fast — a front-end concern with WebSockets and animation.
shiftOperators were making safety decisions on this data across live / simulation / maintenance modes. Sub-second wasn't a polish target; stale or mode-bled data was a hazard. Correctness moved to the center.
50+sensor streams
T-03Dec 2025 · Spizen Technologies · Software Engineer
Latency became the product
beforePerformance was a thing you profiled later, once features were in.
shiftOn Solana, the market moves inside a ~400ms slot and does not wait. A late order is a wrong order. I started designing backward from the millisecond that mattered and defending it through every layer.
<1shot-path SLA
T-04Now · How I build today
Failure-first by default
beforeEarlier, the happy path was the design and failure handling was the patch.
shiftNow the failure modes are the design. Latency budget, recovery path, and confirmation pipeline get specified before the feature does — across trading, real-time data, and the infra under both.
~99%slot-accurate exec

03Systems

Three systems, read as post-mortems: the problem, the tradeoffs, and what I’d change.

★ Flagship

EZ-Wallet — Solana Trading Automation

Dec 2025 — Mar 2026

Autonomous, risk-managed order execution across 5+ Solana DEXs, built so a degraded RPC or a dead worker doesn't become a missed or doubled trade.

Rust
Axum
Tokio
Redis Streams
ClickHouse
SQLite
Kubernetes
gRPC
Jito

Problem

Run trading strategies 24/7 with no human in the loop, on a chain where the market moves inside a ~400ms slot and RPC endpoints degrade exactly when volatility is highest. A late order is a wrong order; a double-sent order is worse.

Constraints

Hot path budget sub-second end-to-end (~100–250ms typical). // TODO: confirm exact split
Solana slot ≈400ms — execution must be slot-accurate, not just 'fast'.
RPC reliability drops under the exact volatility the strategies trade.
Multi-wallet: encrypted key storage, no cross-wallet bleed.
No missed trades and no double-executes — correctness over throughput.

Shape

CREATED
PENDING
CONFIRMED
SETTLED

Decisions & tradeoffs

Rust + Axum/Tokio for the backend (not Node)

Predictable tail latency and real concurrency for the daemon and 60+ APIs.

tradeoffSlower feature velocity and a smaller ecosystem to lean on.

Event-driven 6-worker daemon (not request/response)

Sub-millisecond trigger evaluation across PumpFun, PumpSwap, Raydium, Meteora.

tradeoffReal coordination/observability complexity — failures are now distributed.

3-tier data: SQLite + Redis Streams + ClickHouse

Each access pattern (transactional state / event flow / time-series audit) gets the right tool.

tradeoffThree systems to operate and keep consistent instead of one.

gRPC → RPC fallback → timeout watchdog confirmation

An order is only reported settled when it provably is — false settles are unacceptable.

tradeoffAdded confirmation latency into an already tight budget.

Outcome

1,000sorders / hour

~99%slot-accurate exec

<1shot-path SLA

~0missed trades, steady state

Incident

◇ Incident report

What happened: During a sharp volatility spike, RPC latency climbed and the daemon's trigger evaluation fell behind the slot clock. A batch of conditional orders evaluated late and missed their intended slot — the system was fast enough on an average minute and not on the worst one.
What it changed: Average-case latency was a lie I'd been telling myself. I moved to budgeting against the worst minute, not the mean: added RPC-degradation detection with fallback routing and shed/queue-aware load before evaluation, so falling behind degrades gracefully instead of silently dropping slots.

What I'd do differently

SQLite for transactional state was the right bet for a single-node start and the wrong one as wallet count grows — I'd reach for Postgres earlier. I'd also have built the volatility load-shedding before the first incident forced it, not after.

PHAROS — Real-Time Power & Thermal Monitoring

Oct 2025 — Dec 2025

24/7 operations console where operators make safety calls on live sensor data — so stale or mode-confused data is a hazard, not a glitch.

React
D3.js
Three.js
Highcharts
Redux
WebSocket
MQTT

Problem

Stream 50+ sensor values into an operations UI used around the clock across distinct modes (live, simulation, maintenance, safety). If simulation data ever reads as live, or a value goes stale without saying so, an operator can make the wrong call.

Constraints

50+ concurrent sensor streams over WebSocket + MQTT.
Sub-second UI updates, sustained 24×7.
Modes must be strictly isolated — no live/sim bleed, even across browser tabs.

Shape

Live
Simulation
Maintenance
Safety

sub-second UI
strict mode isolation
cross-tab sync

Decisions & tradeoffs

Redux global state with cross-tab sync

Mode is a safety-critical invariant; every tab must agree on it.

tradeoffMore state boilerplate than local component state.

Custom D3 render path for live visualizations

Direct control over render performance at 50+ streams, sub-second.

tradeoffBuilt and maintained what a chart library gives for free.

Three.js GLB waste-heat configurator

Spatial fidelity operators could actually reason about.

tradeoffReal bundle weight on an already data-heavy app.

Outcome

50+sensor streams

<1sUI freshness

4isolated modes

24×7operations

What I'd do differently

I'd put an explicit staleness/heartbeat indicator on every value from day one — 'no update in N seconds' is itself safety-critical information, and inferring it late is harder than designing it in.

Systaldyn — Real-Time Data Pipeline

Apr 2024 — Dec 2025

A Kafka pipeline moving 5M+ well-data points a day into ML, where lifting throughput could not be allowed to corrupt event order.

Apache Kafka
KafkaJS
Node.js
React
MySQL
Highcharts

Problem

Ingest 5M+ data points/day from field and IoT (ESP32) sources and feed ML models that assume ordered events. Naïve scaling of consumers raised throughput but reordered events and quietly degraded model output.

Constraints

5M+ points/day sustained.
Per-key event ordering required for ML correctness.
JavaScript/Node consumer stack (team velocity constraint).

Shape

Ingest
Partition
Consume
ML

+30% throughput
ordering preserved

Decisions & tradeoffs

Tune consumer groups & partitioning (not just add instances)

Recovered throughput while preserving the ordering ML depended on.

tradeoffRebalance/partition-key complexity to reason about and operate.

KafkaJS + Node for IoT/ESP32 ingestion

Matched the team's stack and shipped the integration fast.

tradeoffNot the lowest-latency consumer runtime available.

Outcome

5M+points / day

~30%throughput gain

~40%data-flow efficiency (AI-EASE)

What I'd do differently

I'd have made the ordering guarantee an explicit, tested contract early — a partition-key/ordering test in CI — instead of an assumption that only surfaced when model output drifted.

04Stack

Comfortable with low-latency, concurrent systems and the messy edges of production.

Core Languages

TypeScript
JavaScript
Rust
Java
Python
C / C++

Web & Frontend

React
Next.js
Tailwind CSS
D3.js
Highcharts
Three.js
Radix UI
HTML
CSS

Backend & Data

Node.js
Express.js
MongoDB
MySQL
SQLite
Redis Streams
ClickHouse

Distributed & Infra

Apache Kafka
Docker
Kubernetes
GitLab CI/CD
Prometheus
Grafana
Loki
OpenTelemetry

Under the Hood

Data Structures & Algorithms
Operating Systems
Computer Networks
DBMS
Cloud & Virtualization
Software Engineering
Business Intelligence
Mobile App Development

05Education

Where the foundations were poured.

Degree: B.Tech, Computer Science & Engineering
Institution: Graphic Era (Deemed to be University) · 2024
CGPA: 7.8
Notes: Strong foundation in OS, networks, and distributed systems.

06Contact

Available for backend, platform, and trading-infra roles.

Email: kshitijtyagi9@gmail.com
Phone: +91-8126269186
Links: GitHub(opens in new tab)LinkedIn(opens in new tab)

I build systems that assume failure is the default state — latency budgets, recovery paths, and confirmation pipelines come before features.

How I got to failure-first — the moments a belief changed, in order.

Shipping features that worked on my machine

Throughput was the goal — until ordering broke the ML

Real-time isn't a UI problem

Latency became the product

Failure-first by default

Three systems, read as post-mortems: the problem, the tradeoffs, and what I’d change.

Comfortable with low-latency, concurrent systems and the messy edges of production.

Core Languages

Web & Frontend

Backend & Data

Distributed & Infra

Under the Hood

Where the foundations were poured.

Available for backend, platform, and trading-infra roles.