Data Generation V4: Chasing 1M+, Finding the Limit

With Version 3 dialed up to a solid 600K+ posts/sec after some late tweaks, I was riding high. BrandPulse’s 700k/sec goal was in the bag, but I couldn’t resist pushing further—could I hit 1M+ and really show off? Version 4 was my all-in bet: auto-scaling workers, hardcore Kafka configs, and a setup built to dominate. The result? A wild ride that peaked high but crashed hard, proving v3’s 600K+ was the real champ. Here’s the tale.

The Drive

V3 was a beast at 600K+ posts/sec—way past the 700k target—but I saw a chance to double down. Theoretical maxes hinted at millions, so I aimed for 1M+ sustained throughput. I’d need dynamic scaling, tighter error handling, and a system that could flex under pressure. Problem was, I was still on one machine, and v3 was already flexing its muscles. This was about testing the ceiling—of the hardware and my grit.

The Playbook

I went big with these moves:

Auto-Tuning Workers: Scaled workers dynamically (8 to 16+) to chase 1M/sec based on live throughput.
Balanced Configs: Dropped maxInFlightRequests to 200, added minimal retries, and cut batch size to 15K for stability.
Memory Smarts: Pre-allocated batch pools and capped buffer memory at 70% of RAM.
Resilience Boost: Built in reconnection logic and backoff for when things got messy.
Metrics Overload: Tracked every detail—throughput, errors, progress—to pinpoint the breaking point.

The Code: V4 in Full Swing

Here’s what I unleashed—complex, bold, and a bit reckless:

javascript

const { Kafka, Partitioners } = require('kafkajs');
const { Worker, isMainThread, parentPort, threadId } = require('worker_threads');
const dataSchema = require('../../schema/avroSchema');
const os = require('os');

const config = {
  kafka: {
    clientId: 'dataStorm-producer',
    brokers: ['localhost:9092'],
    topic: 'dataStorm-topic',
    retry: { retries: 2, initialRetryTime: 100, maxRetryTime: 500 },
    producer: {
      createPartitioner: Partitioners.LegacyPartitioner,
      transactionTimeout: 10000,
      compression: null,
      maxInFlightRequests: 200,
      idempotent: false,
      allowAutoTopicCreation: false,
      batchSize: 16 * 1024 * 1024,
      lingerMs: 5,
      acks: 0,
      bufferMemory: Math.min(8 * 1024 * 1024 * 1024, Math.floor(os.totalmem() * 0.7))
    }
  },
  batch: { size: 15000, maxBufferedBatches: 8 },
  workers: { count: Math.max(8, os.cpus().length * 2), maxCount: Math.max(16, os.cpus().length * 4), restartDelay: 1000 },
  metrics: { reportIntervalMs: 1000, syncIntervalMs: 20 },
  target: 50000000,
  stability: { connectionBackoffMs: 2000, maxErrors: 50, backoffDurationMs: 5000 }
};

if (isMainThread) {
  process.env.UV_THREADPOOL_SIZE = config.workers.maxCount * 4;
  const v8 = require('v8');
  v8.setFlagsFromString('--max_old_space_size=8192');

  const workers = new Map();
  let totalRecordsSent = 0;
  let startTime = Date.now();

  const spawnWorker = (id) => {
    const worker = new Worker(__filename);
    workers.set(id, worker);
    worker.on('message', (msg) => {
      if (msg.type === 'metrics') totalRecordsSent += msg.count;
    });
  };

  for (let i = 0; i < config.workers.count; i++) spawnWorker(i + 1);
} else {
  const kafka = new Kafka({ ...config.kafka, clientId: `${config.kafka.clientId}-${process.pid}-${threadId}` });
  const producer = kafka.producer(config.kafka.producer);
  const batchPool = Array(config.batch.maxBufferedBatches).fill().map(() => Array(config.batch.size).fill({ value: dataSchema.toBuffer({ id: 0, timestamp: '', value: 0 }) }));

  const runProducer = async () => {
    await producer.connect();
    let recordsSent = 0;
    let batchIndex = 0;

    while (true) {
      const batch = batchPool[batchIndex];
      batchIndex = (batchIndex + 1) % config.batch.maxBufferedBatches;
      const randomIndex = Math.floor(Math.random() * config.batch.size);
      batch[randomIndex].value = dataSchema.toBuffer({ id: threadId, timestamp: new Date().toISOString(), value: Math.random() * 100 });

      await producer.send({ topic: config.kafka.topic, messages: batch });
      recordsSent += config.batch.size;
      parentPort.postMessage({ type: 'metrics', count: config.batch.size });
      await new Promise(resolve => setImmediate(resolve));
    }
  };

  runProducer().catch(err => console.error(`Worker error: ${err.message}`));
}

(Full code with auto-tuning and metrics in the repo—above is the core.)

What I Got

Peak Throughput: Spiked to 857K posts/sec—thrilling for a hot second!
Sustained Average: Averaged ~400K posts/sec over 49 seconds, totaling 42M records.
The Collapse: System froze in seconds—memory maxed out, workers stalled, game over.

Results Breakdown

Metric	Value	Target	Progress
Peak Throughput	857K/sec	1M/sec	85.7%
Sustained Avg	400K/sec	1M/sec	40%
Total in 49s	42M	49M	85.7%

Where It Broke

Memory Overload: 8GB buffer and 16+ workers pushed RAM past 90%—no headroom left.
Worker Chaos: Scaling beyond 8 cores caused I/O thrashing and CPU bottlenecks.
Serialization Drag: Avro’s toBuffer() couldn’t keep up with the pace, even pre-allocated.
Broker Limits: Single broker buckled under 200 in-flight requests with no acks.

Key Takeaways

Know Your Edge: V3’s 600K+ was sustainable; v4’s 857K peak was a tease—hardware calls the shots.
Stability Wins: Chasing 1M/sec traded reliability for a flashy number. Real systems need balance.
Scale Smart: Auto-tuning was cool, but over-scaling burned me—sometimes less is more.

Why It’s Valuable

This wasn’t a flop—it was a lesson in boundaries. I was taught to push limits with what we’ve got, and v4 was me doing just that—stretching too far to find the breaking point. It proved v3’s tuned 600K+ posts/sec was the practical powerhouse, ready for BrandPulse’s real-world needs. Recruiters see the drive; engineers see the wisdom in pulling back.

Back to V3

V4’s crash sent me back to v3, where I’d already hit 600K+ posts/sec with solid configs. That became the final producer—fast, stable, and proven. V4 was the wild experiment that showed me where to stop.

Implementation

Research Prototypes

Data Generation

Data Ingestion

Data Generation V4: Chasing 1M+, Finding the Limit

The Drive

The Playbook

The Code: V4 in Full Swing

What I Got

Results Breakdown

Where It Broke

Key Takeaways

Why It’s Valuable

Back to V3

Data Generation

Data Ingestion

Data Generation V4: Chasing 1M+, Finding the Limit ​

The Drive ​

The Playbook ​

The Code: V4 in Full Swing ​

What I Got ​

Results Breakdown ​

Where It Broke ​

Key Takeaways ​

Why It’s Valuable ​

Back to V3 ​

Data Generation V4: Chasing 1M+, Finding the Limit

The Drive

The Playbook

The Code: V4 in Full Swing

What I Got

Results Breakdown

Where It Broke

Key Takeaways

Why It’s Valuable

Back to V3