Data Generation V1: Starting Simple, Then Kicking It Up
Alright, here’s where the rubber meets the road—my first crack at generating fake social media posts for BrandPulse. I didn’t jump straight into the fancy stuff; I started basic, hit some walls, and then got scrappy with it. This is the story of going from a single-threaded “eh, it works” setup (call it Version 0) to a beefier parallel version (Version 1) that started pushing the needle toward real throughput.
Version 0: The Naive First Shot
So, I kicked things off with something simple—just me, some Node.js, and a Kafka producer. The goal? Generate a bunch of fake “SuperCoffee” tweets and shove them into Kafka. I figured 100K records in a batch sounded cool, but my machine laughed at me and choked hard. Scaled it back to 20K, and here’s what I ended up with:
const { Kafka, Partitioners } = require('kafkajs');
const dataSchema = require('../../schema/avroSchema');
const kafka = new Kafka({
clientId: 'dataStorm-producer',
brokers: ['localhost:9092'],
});
const producer = kafka.producer({
createPartitioner: Partitioners.LegacyPartitioner,
});
const generateBatch = () => {
return Array.from({ length: 20000 }, () => {
const record = {
id: Math.floor(Math.random() * 100000),
timestamp: new Date().toISOString(),
value: Math.random() * 100
};
return { value: dataSchema.toBuffer(record) }; // Serialize with Avro
});
};
const runProducer = async () => {
await producer.connect();
while (true) {
const batch = generateBatch();
await producer.send({
topic: 'dataStorm-topic',
messages: batch,
});
console.log(`✅ Sent 20K records to Kafka`);
}
};
runProducer().catch(console.error);
What Happened?
- Throughput: I was clocking about 20K posts/sec on my laptop. Not bad for a first swing, right?
- The Catch: That
generateBatch
function? It’s a CPU hog. Generating 20K records and serializing them with Avro was blocking the hell out of Node’s event loop. The producer was twiddling its thumbs waiting for data, and I knew 20K/sec wasn’t gonna cut it for BrandPulse’s 700k/sec dream. - Gut Check: I could’ve just cranked up the batch size, but I’d hit memory limits fast. Plus, it felt like I was patching a leaky boat instead of building a better one.
Version 1: Going Parallel
Alright, 20K/sec wasn’t gonna fly—I needed more juice. My brain went, “Hey, Node’s single-threaded, but I’ve got CPU cores sitting there twiddling their thumbs. Let’s use ‘em!” So, I cooked up a plan: spread the load across workers with the cluster
module or worker_threads
. I landed on worker_threads
because it’s cleaner for this CPU-bound generation task, and Kafka’s async producer could stay unblocked. Here’s what I rolled out:
const { Kafka, Partitioners } = require('kafkajs');
const { Worker, isMainThread, parentPort } = require('worker_threads');
const dataSchema = require('../../schema/avroSchema');
const kafka = new Kafka({
clientId: 'dataStorm-producer',
brokers: ['localhost:9092'],
});
const generateBatch = (batchSize) => {
return Array.from({ length: batchSize }, () => {
const record = {
id: Math.floor(Math.random() * 100000),
timestamp: new Date().toISOString(),
value: Math.random() * 100
};
return { value: dataSchema.toBuffer(record) };
});
};
if (!isMainThread) {
const producer = kafka.producer({
createPartitioner: Partitioners.LegacyPartitioner,
});
const runProducer = async () => {
await producer.connect();
while (true) {
const batch = generateBatch(20000);
await producer.send({
topic: 'dataStorm-topic',
messages: batch,
});
parentPort.postMessage('✅ Sent 20K records');
}
};
runProducer().catch(console.error);
}
if (isMainThread) {
const WORKER_COUNT = 8; // Match my CPU cores
console.log(`Master process is running... Spawning ${WORKER_COUNT} workers`);
for (let i = 0; i < WORKER_COUNT; i++) {
const worker = new Worker(__filename);
worker.on('message', (msg) => console.log(`[Worker ${i + 1}] ${msg}`));
worker.on('error', (err) => console.error(`[Worker ${i + 1}] Error:`, err));
}
}
What’s the Deal Here?
- The Plan: Spin up 8 workers (one per core on my machine), each generating and sending 20K-record batches in parallel. Kafka’s producer is async anyway, so I offloaded the heavy lifting—batch generation and Avro serialization—to worker threads.
- How It Works: The main thread just spawns workers and chills. Each worker runs its own producer, connects to Kafka, and pumps out 20K posts in a loop. No more event loop bottlenecks!
- Throughput Jump: With 8 workers, I’m looking at ~160K posts/sec (20K × 8). That’s a hell of a leap from 20K, and it’s starting to feel like I’m in the game.
Issues I Hit
- Worker Overlap: Early on, I had workers stepping on each other’s toes—duplicate
clientId
or something funky with Kafka connections. Fixed it by letting each worker manage its own producer. - Memory Creep: Generating 20K records per worker still spiked memory a bit. Had to keep an eye on that for later versions.
What I Learned
- Parallelism Rocks: One thread wasn’t cutting it. Workers let me use all my CPU cores, and bam—throughput soared.
- CPU vs. I/O: Generation’s a CPU hog, but Kafka sending is I/O. Splitting those up was the key to keeping things smooth.
- Baby Steps: Going from 20K to 160K felt good, but I knew 700k was the real target. This was proof I was on the right track.
Next Up
Version 1 got me to 160K posts/sec, but BrandPulse needs 700k. Time to tweak batch sizes, add compression, and maybe throw in some Kafka partitioning. Check out Version 2 for how I pushed it further.