Data Generation V1: Starting Simple, Then Kicking It Up

Alright, here’s where the rubber meets the road—my first crack at generating fake social media posts for BrandPulse. I didn’t jump straight into the fancy stuff; I started basic, hit some walls, and then got scrappy with it. This is the story of going from a single-threaded “eh, it works” setup (call it Version 0) to a beefier parallel version (Version 1) that started pushing the needle toward real throughput.

Version 0: The Naive First Shot

So, I kicked things off with something simple—just me, some Node.js, and a Kafka producer. The goal? Generate a bunch of fake “SuperCoffee” tweets and shove them into Kafka. I figured 100K records in a batch sounded cool, but my machine laughed at me and choked hard. Scaled it back to 20K, and here’s what I ended up with:

javascript

const { Kafka, Partitioners } = require('kafkajs');
const dataSchema = require('../../schema/avroSchema');

const kafka = new Kafka({
  clientId: 'dataStorm-producer',
  brokers: ['localhost:9092'],
});
const producer = kafka.producer({
  createPartitioner: Partitioners.LegacyPartitioner,
});

const generateBatch = () => {
  return Array.from({ length: 20000 }, () => {
    const record = {
      id: Math.floor(Math.random() * 100000),
      timestamp: new Date().toISOString(),
      value: Math.random() * 100
    };
    return { value: dataSchema.toBuffer(record) }; // Serialize with Avro
  });
};

const runProducer = async () => {
  await producer.connect();
  while (true) {
    const batch = generateBatch();
    await producer.send({
      topic: 'dataStorm-topic',
      messages: batch,
    });
    console.log(`✅ Sent 20K records to Kafka`);
  }
};

runProducer().catch(console.error);

What Happened?

Throughput: I was clocking about 20K posts/sec on my laptop. Not bad for a first swing, right?
The Catch: That generateBatch function? It’s a CPU hog. Generating 20K records and serializing them with Avro was blocking the hell out of Node’s event loop. The producer was twiddling its thumbs waiting for data, and I knew 20K/sec wasn’t gonna cut it for BrandPulse’s 700k/sec dream.
Gut Check: I could’ve just cranked up the batch size, but I’d hit memory limits fast. Plus, it felt like I was patching a leaky boat instead of building a better one.

Version 1: Going Parallel

Alright, 20K/sec wasn’t gonna fly—I needed more juice. My brain went, “Hey, Node’s single-threaded, but I’ve got CPU cores sitting there twiddling their thumbs. Let’s use ‘em!” So, I cooked up a plan: spread the load across workers with the cluster module or worker_threads. I landed on worker_threads because it’s cleaner for this CPU-bound generation task, and Kafka’s async producer could stay unblocked. Here’s what I rolled out:

javascript

const { Kafka, Partitioners } = require('kafkajs');
const { Worker, isMainThread, parentPort } = require('worker_threads');
const dataSchema = require('../../schema/avroSchema');

const kafka = new Kafka({
  clientId: 'dataStorm-producer',
  brokers: ['localhost:9092'],
});

const generateBatch = (batchSize) => {
  return Array.from({ length: batchSize }, () => {
    const record = {
      id: Math.floor(Math.random() * 100000),
      timestamp: new Date().toISOString(),
      value: Math.random() * 100
    };
    return { value: dataSchema.toBuffer(record) };
  });
};

if (!isMainThread) {
  const producer = kafka.producer({
    createPartitioner: Partitioners.LegacyPartitioner,
  });

  const runProducer = async () => {
    await producer.connect();
    while (true) {
      const batch = generateBatch(20000);
      await producer.send({
        topic: 'dataStorm-topic',
        messages: batch,
      });
      parentPort.postMessage('✅ Sent 20K records');
    }
  };

  runProducer().catch(console.error);
}

if (isMainThread) {
  const WORKER_COUNT = 8; // Match my CPU cores
  console.log(`Master process is running... Spawning ${WORKER_COUNT} workers`);

  for (let i = 0; i < WORKER_COUNT; i++) {
    const worker = new Worker(__filename);
    worker.on('message', (msg) => console.log(`[Worker ${i + 1}] ${msg}`));
    worker.on('error', (err) => console.error(`[Worker ${i + 1}] Error:`, err));
  }
}

What’s the Deal Here?

The Plan: Spin up 8 workers (one per core on my machine), each generating and sending 20K-record batches in parallel. Kafka’s producer is async anyway, so I offloaded the heavy lifting—batch generation and Avro serialization—to worker threads.
How It Works: The main thread just spawns workers and chills. Each worker runs its own producer, connects to Kafka, and pumps out 20K posts in a loop. No more event loop bottlenecks!
Throughput Jump: With 8 workers, I’m looking at ~160K posts/sec (20K × 8). That’s a hell of a leap from 20K, and it’s starting to feel like I’m in the game.

Issues I Hit

Worker Overlap: Early on, I had workers stepping on each other’s toes—duplicate clientId or something funky with Kafka connections. Fixed it by letting each worker manage its own producer.
Memory Creep: Generating 20K records per worker still spiked memory a bit. Had to keep an eye on that for later versions.

What I Learned

Parallelism Rocks: One thread wasn’t cutting it. Workers let me use all my CPU cores, and bam—throughput soared.
CPU vs. I/O: Generation’s a CPU hog, but Kafka sending is I/O. Splitting those up was the key to keeping things smooth.
Baby Steps: Going from 20K to 160K felt good, but I knew 700k was the real target. This was proof I was on the right track.

Next Up

Version 1 got me to 160K posts/sec, but BrandPulse needs 700k. Time to tweak batch sizes, add compression, and maybe throw in some Kafka partitioning. Check out Version 2 for how I pushed it further.

Implementation

Research Prototypes

Data Generation

Data Ingestion

Data Generation V1: Starting Simple, Then Kicking It Up

Version 0: The Naive First Shot

What Happened?

Version 1: Going Parallel

What’s the Deal Here?

Issues I Hit

What I Learned

Next Up

Data Generation

Data Ingestion

Data Generation V1: Starting Simple, Then Kicking It Up ​

Version 0: The Naive First Shot ​

What Happened? ​

Version 1: Going Parallel ​

What’s the Deal Here? ​

Issues I Hit ​

What I Learned ​

Next Up ​

Data Generation V1: Starting Simple, Then Kicking It Up

Version 0: The Naive First Shot

What Happened?

Version 1: Going Parallel

What’s the Deal Here?

Issues I Hit

What I Learned

Next Up