Skip to content

Data Generation V4: Chasing 1M+, Finding the Limit

With Version 3 dialed up to a solid 600K+ posts/sec after some late tweaks, I was riding high. BrandPulse’s 700k/sec goal was in the bag, but I couldn’t resist pushing further—could I hit 1M+ and really show off? Version 4 was my all-in bet: auto-scaling workers, hardcore Kafka configs, and a setup built to dominate. The result? A wild ride that peaked high but crashed hard, proving v3’s 600K+ was the real champ. Here’s the tale.

The Drive

V3 was a beast at 600K+ posts/sec—way past the 700k target—but I saw a chance to double down. Theoretical maxes hinted at millions, so I aimed for 1M+ sustained throughput. I’d need dynamic scaling, tighter error handling, and a system that could flex under pressure. Problem was, I was still on one machine, and v3 was already flexing its muscles. This was about testing the ceiling—of the hardware and my grit.

The Playbook

I went big with these moves:

  1. Auto-Tuning Workers: Scaled workers dynamically (8 to 16+) to chase 1M/sec based on live throughput.
  2. Balanced Configs: Dropped maxInFlightRequests to 200, added minimal retries, and cut batch size to 15K for stability.
  3. Memory Smarts: Pre-allocated batch pools and capped buffer memory at 70% of RAM.
  4. Resilience Boost: Built in reconnection logic and backoff for when things got messy.
  5. Metrics Overload: Tracked every detail—throughput, errors, progress—to pinpoint the breaking point.

The Code: V4 in Full Swing

Here’s what I unleashed—complex, bold, and a bit reckless:

javascript
const { Kafka, Partitioners } = require('kafkajs');
const { Worker, isMainThread, parentPort, threadId } = require('worker_threads');
const dataSchema = require('../../schema/avroSchema');
const os = require('os');

const config = {
  kafka: {
    clientId: 'dataStorm-producer',
    brokers: ['localhost:9092'],
    topic: 'dataStorm-topic',
    retry: { retries: 2, initialRetryTime: 100, maxRetryTime: 500 },
    producer: {
      createPartitioner: Partitioners.LegacyPartitioner,
      transactionTimeout: 10000,
      compression: null,
      maxInFlightRequests: 200,
      idempotent: false,
      allowAutoTopicCreation: false,
      batchSize: 16 * 1024 * 1024,
      lingerMs: 5,
      acks: 0,
      bufferMemory: Math.min(8 * 1024 * 1024 * 1024, Math.floor(os.totalmem() * 0.7))
    }
  },
  batch: { size: 15000, maxBufferedBatches: 8 },
  workers: { count: Math.max(8, os.cpus().length * 2), maxCount: Math.max(16, os.cpus().length * 4), restartDelay: 1000 },
  metrics: { reportIntervalMs: 1000, syncIntervalMs: 20 },
  target: 50000000,
  stability: { connectionBackoffMs: 2000, maxErrors: 50, backoffDurationMs: 5000 }
};

if (isMainThread) {
  process.env.UV_THREADPOOL_SIZE = config.workers.maxCount * 4;
  const v8 = require('v8');
  v8.setFlagsFromString('--max_old_space_size=8192');

  const workers = new Map();
  let totalRecordsSent = 0;
  let startTime = Date.now();

  const spawnWorker = (id) => {
    const worker = new Worker(__filename);
    workers.set(id, worker);
    worker.on('message', (msg) => {
      if (msg.type === 'metrics') totalRecordsSent += msg.count;
    });
  };

  for (let i = 0; i < config.workers.count; i++) spawnWorker(i + 1);
} else {
  const kafka = new Kafka({ ...config.kafka, clientId: `${config.kafka.clientId}-${process.pid}-${threadId}` });
  const producer = kafka.producer(config.kafka.producer);
  const batchPool = Array(config.batch.maxBufferedBatches).fill().map(() => Array(config.batch.size).fill({ value: dataSchema.toBuffer({ id: 0, timestamp: '', value: 0 }) }));

  const runProducer = async () => {
    await producer.connect();
    let recordsSent = 0;
    let batchIndex = 0;

    while (true) {
      const batch = batchPool[batchIndex];
      batchIndex = (batchIndex + 1) % config.batch.maxBufferedBatches;
      const randomIndex = Math.floor(Math.random() * config.batch.size);
      batch[randomIndex].value = dataSchema.toBuffer({ id: threadId, timestamp: new Date().toISOString(), value: Math.random() * 100 });

      await producer.send({ topic: config.kafka.topic, messages: batch });
      recordsSent += config.batch.size;
      parentPort.postMessage({ type: 'metrics', count: config.batch.size });
      await new Promise(resolve => setImmediate(resolve));
    }
  };

  runProducer().catch(err => console.error(`Worker error: ${err.message}`));
}

(Full code with auto-tuning and metrics in the repo—above is the core.)

What I Got

  • Peak Throughput: Spiked to 857K posts/sec—thrilling for a hot second!
  • Sustained Average: Averaged ~400K posts/sec over 49 seconds, totaling 42M records.
  • The Collapse: System froze in seconds—memory maxed out, workers stalled, game over.

Results Breakdown

MetricValueTargetProgress
Peak Throughput857K/sec1M/sec85.7%
Sustained Avg400K/sec1M/sec40%
Total in 49s42M49M85.7%

Where It Broke

  • Memory Overload: 8GB buffer and 16+ workers pushed RAM past 90%—no headroom left.
  • Worker Chaos: Scaling beyond 8 cores caused I/O thrashing and CPU bottlenecks.
  • Serialization Drag: Avro’s toBuffer() couldn’t keep up with the pace, even pre-allocated.
  • Broker Limits: Single broker buckled under 200 in-flight requests with no acks.

Key Takeaways

  • Know Your Edge: V3’s 600K+ was sustainable; v4’s 857K peak was a tease—hardware calls the shots.
  • Stability Wins: Chasing 1M/sec traded reliability for a flashy number. Real systems need balance.
  • Scale Smart: Auto-tuning was cool, but over-scaling burned me—sometimes less is more.

Why It’s Valuable

This wasn’t a flop—it was a lesson in boundaries. I was taught to push limits with what we’ve got, and v4 was me doing just that—stretching too far to find the breaking point. It proved v3’s tuned 600K+ posts/sec was the practical powerhouse, ready for BrandPulse’s real-world needs. Recruiters see the drive; engineers see the wisdom in pulling back.

Back to V3

V4’s crash sent me back to v3, where I’d already hit 600K+ posts/sec with solid configs. That became the final producer—fast, stable, and proven. V4 was the wild experiment that showed me where to stop.

Built with precision engineering and innovative solutions.