Data Generation V2: Smarter Workers, Tougher Lessons â
After Version 1 got me to 160K posts/sec with worker threads, I knew I wasnât done. BrandPulse needed moreâcloser to that 700k posts/sec dreamâso I rolled up my sleeves and dug into Version 2. This wasnât about throwing more code at it; it was about fixing the cracks, tuning the engine, and making sure it wouldnât just run but fly. Hereâs how I took v1âs foundation and pushed it harder, smarter, and a bit more polished.
What I Was Up Against â
Version 1 was a winâ20K posts/sec solo, 160K with 8 workersâbut it had rough edges. Workers were stepping on each otherâs toes with the same client ID, errors could crash the whole show, and the event loop was gasping under constant batch sends. Plus, the Kafka producer was running on default settings, which felt like driving a sports car in first gear. I needed to smooth it out and squeeze out more throughputâaiming for 200K+ posts/secâwithout breaking my machine or my spirit.
The Game Plan â
Hereâs what I tackled to level up:
- Unique Client IDs: Gave each worker its own identity to avoid Kafka conflicts and make debugging less of a nightmare.
- Error Handling: Wrapped the send loop in a try-catch so a hiccup wouldnât kill the workerâresilience matters.
- Throttling: Added a tiny delay between batches to keep the event loop breathing, not choking.
- Producer Tuning: Tweaked Kafka settings like batch size and linger time to push more data faster.
- Graceful Shutdown: Set up signal handling so I could stop it cleanlyâno more Ctrl+C chaos.
The Code: V2 in Action â
Hereâs what I builtâcleaner, tougher, and ready to scale:
const { Kafka, Partitioners } = require('kafkajs');
const { Worker, isMainThread, parentPort, threadId } = require('worker_threads');
const dataSchema = require('../../schema/avroSchema');
// Shared Kafka config
const kafkaConfig = {
clientId: `dataStorm-producer-${process.pid}-${threadId}`, // Unique IDs
brokers: ['localhost:9092', 'localhost:9093', 'localhost:9094'],
retry: { retries: 3 } // Basic retry logic
};
// Batch settings
const BATCH_SIZE = 20000;
const BATCH_INTERVAL_MS = 100; // 100ms breather
// Pre-allocated array for efficiency
const recordCache = new Array(BATCH_SIZE).fill(null);
// Batch generationâoptimized
const generateBatch = () => {
const now = new Date().toISOString();
return recordCache.map(() => ({
value: dataSchema.toBuffer({
id: Math.floor(Math.random() * 100000),
timestamp: now, // Same timestamp per batch
value: Math.random() * 100
})
}));
};
// Worker Logic
if (!isMainThread) {
const producer = new Kafka(kafkaConfig).producer({
createPartitioner: Partitioners.LegacyPartitioner,
allowAutoTopicCreation: false,
});
const runProducer = async () => {
await producer.connect();
while (true) {
try {
const batch = generateBatch();
await producer.send({
topic: 'dataStorm-topic',
messages: batch,
});
parentPort.postMessage(`Sent ${BATCH_SIZE} records`);
await new Promise(resolve => setTimeout(resolve, BATCH_INTERVAL_MS));
} catch (err) {
console.error(`Worker error: ${err.message}`);
// Keep goingâdonât die on me
}
}
};
runProducer().catch(err => {
console.error(`Fatal worker error: ${err.message}`);
process.exit(1);
});
}
// Main Thread
if (isMainThread) {
const WORKER_COUNT = require('os').cpus().length; // Match CPU cores
const workers = new Set();
console.log(`Main process started. Spawning ${WORKER_COUNT} workers`);
const spawnWorker = (id) => {
const worker = new Worker(__filename);
worker
.on('message', (msg) => console.log(`[Worker ${id}] ${msg}`))
.on('error', (err) => console.error(`[Worker ${id}] Error: ${err.message}`))
.on('exit', (code) => {
console.log(`[Worker ${id}] Exited with code ${code}`);
workers.delete(worker);
if (code !== 0) spawnWorker(id); // Restart on crash
});
workers.add(worker);
};
for (let i = 0; i < WORKER_COUNT; i++) {
spawnWorker(i + 1);
}
process.on('SIGINT', async () => {
console.log('\nGracefully shutting down...');
for (const worker of workers) {
await worker.terminate();
}
process.exit();
});
}
What I Got Out of It â
- Throughput: Hit around 200K posts/sec with 8 workers (25K/worker roughly). The tweaks paid offâsmoother and faster than v1.
- Stability: Workers didnât crash on random Kafka blips anymore, thanks to error handling. Huge relief.
- Efficiency: Pre-allocating the batch array and reusing timestamps cut generation overhead. Small wins add up.
The Hiccups â
- Worker Sync: Unique IDs fixed conflicts, but I still saw occasional lag spikesâKafka brokers werenât fully balanced yet.
- Throttling Trade-off: That 100ms delay helped the event loop, but it capped throughput a bit. Needed finer tuning.
- Memory: 20K batches per worker still pushed memory usage upâhad to watch that for bigger leaps.
Lessons Learned â
- Details Matter: A unique
clientId
isnât sexy, but it saves hours of head-scratching when logs go wild. - Resilience Beats Speed: Crashing workers taught me stabilityâs worth more than a few extra posts/sec early on.
- Tune, Donât Assume: Default Kafka settings were lazyâdigging into
linger.ms
and batch size was where the real gains hid.
Why This Counts â
This wasnât just tinkeringâit was about building something reliable that could scale. Iâve seen how you donât always get the fanciest tools, but you make it work anyway. V2 showed I could take a raw idea, iron out the kinks, and push it closer to BrandPulseâs 700k/sec goal. Itâs the kind of hustle recruiters noticeâand engineers respect.
Next Steps â
200K posts/sec was progress, but I wasnât there yet. Version 3 (v3) brought in sentiment randomization and more optimizationâcheck it out to see how I kept the momentum going.