The Hidden Power of PHP Generators for Large Datasets
TL;DR: PHP generators let you process million-row CSVs, paginated APIs, and massive database exports without running out of memory. This article goes beyond the yield basics — covering generator pipelines, yield from delegation, bidirectional communication, and real memory comparisons that show why generators should be your default tool for any dataset that doesn’t fit comfortably in RAM.
You Already Know yield. Here’s Why You’re Not Using It Enough.
Most PHP developers learn generators from a tutorial that shows yield inside a range() replacement and think “cool, but I’ll never need that.” Then they write a queue worker that loads 200K rows into an array, blows through memory_limit, and spend an afternoon debugging it.
Generators are not a niche feature. They’re the right default for any data pipeline that processes more than a few hundred items. The problem is that most content stops at the basics. Let’s go further.
Quick Recap: What Generators Do
A generator function uses yield instead of return. Instead of building a complete array and returning it, it produces values one at a time, only when asked:
function naturalNumbers(): Generator
{
$i = 1;
while (true) {
yield $i++;
}
}
// This doesn't consume infinite memory — it produces one number at a time
$numbers = naturalNumbers();
echo $numbers->current(); // 1
$numbers->next();
echo $numbers->current(); // 2
The key insight: between yields, the generator is suspended. It doesn’t use CPU cycles. It holds only its local variables in memory — not the entire dataset.
In practice, you almost never call ->current() and ->next() directly. You iterate with foreach:
foreach (naturalNumbers() as $n) {
if ($n > 1000000) break;
// Process each number without storing the full sequence
}
Real-World Pattern #1: Streaming CSV Processing
The most common use case I hit in production. Clients upload CSV exports ranging from 10K to 5M rows. Processing them with file() or fgetcsv() in a loop that collects into an array is a ticking time bomb.
/**
* Read a CSV file row by row, yielding associative arrays.
* Memory usage is constant regardless of file size.
*/
function readCsv(string $path, string $separator = ','): Generator
{
$handle = fopen($path, 'r');
if ($handle === false) {
throw new \RuntimeException("Cannot open file: {$path}");
}
try {
// First row is the header
$headers = fgetcsv($handle, 0, $separator);
if ($headers === false) {
return; // Empty file
}
// Trim BOM and whitespace from headers
$headers = array_map(fn ($h) => trim($h, "\xEF\xBB\xBF \t\n\r"), $headers);
$columnCount = count($headers);
$lineNumber = 1;
while (($row = fgetcsv($handle, 0, $separator)) !== false) {
$lineNumber++;
// Skip rows with wrong column count (malformed data)
if (count($row) !== $columnCount) {
// Log and skip rather than crash
error_log("CSV line {$lineNumber}: expected {$columnCount} columns, got " . count($row));
continue;
}
yield $lineNumber => array_combine($headers, $row);
}
} finally {
fclose($handle);
}
}
Usage is dead simple:
foreach (readCsv('/imports/customers-2026.csv') as $lineNum => $row) {
$this->upsertCustomer(
email: $row['email'],
name: $row['full_name'],
region: $row['region'],
);
}
Let’s put numbers on this:
| File size | Rows | file() + array | Generator |
|---|---|---|---|
| 5 MB | 10K | 18 MB | 2 MB |
| 50 MB | 100K | 165 MB | 2 MB |
| 500 MB | 1M | OOM (>512 MB) | 2 MB |
| 2 GB | 5M | OOM | 2 MB |
The generator column is always ~2 MB because it only holds one row plus the headers in memory at any time. The file handle is buffered by the OS. You could process a 50 GB file on a server with 128 MB memory_limit and it would work fine.
Real-World Pattern #2: Paginated API Consumption
External APIs paginate their responses. The naive approach loads all pages into an array before processing:
// ❌ Accumulates all pages in memory
function getAllProducts(ApiClient $api): array
{
$all = [];
$page = 1;
do {
$response = $api->get('/products', ['page' => $page, 'per_page' => 100]);
$all = array_merge($all, $response['data']);
$page++;
} while ($response['has_more']);
return $all; // Could be 50K+ items
}
With a generator, each page is fetched on demand and each item is yielded individually:
// ✅ Fetches and yields one page at a time
function allProducts(ApiClient $api, int $perPage = 100): Generator
{
$page = 1;
do {
$response = $api->get('/products', [
'page' => $page,
'per_page' => $perPage,
]);
foreach ($response['data'] as $product) {
yield $product;
}
$page++;
} while ($response['has_more']);
}
// Consumer doesn't know or care about pagination
foreach (allProducts($api) as $product) {
$this->syncProduct($product);
}
This pattern is even more powerful when the API uses cursor-based pagination:
function allOrders(ApiClient $api): Generator
{
$cursor = null;
do {
$params = ['limit' => 100];
if ($cursor !== null) {
$params['after'] = $cursor;
}
$response = $api->get('/orders', $params);
foreach ($response['data'] as $order) {
yield $order;
}
$cursor = $response['next_cursor'];
} while ($cursor !== null);
}
The consumer sees a flat stream of orders. The pagination complexity is entirely encapsulated in the generator.
Real-World Pattern #3: Database Result Streaming
PDO can fetch rows one at a time, but most code calls fetchAll() out of habit. For large result sets, use a generator:
function queryStream(PDO $pdo, string $sql, array $params = []): Generator
{
$stmt = $pdo->prepare($sql);
$stmt->execute($params);
while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
yield $row;
}
$stmt->closeCursor();
}
// Process 500K orders without loading them all
foreach (queryStream($pdo, 'SELECT * FROM orders WHERE year = ?', [2025]) as $order) {
$this->archiveOrder($order);
}
Important MySQL note: By default, PHP’s MySQL driver buffers the entire result set in memory on the client side (even with fetch()). To truly stream results, you need unbuffered queries:
function mysqlStream(PDO $pdo, string $sql, array $params = []): Generator
{
// Switch to unbuffered mode for this query
$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
try {
$stmt = $pdo->prepare($sql);
$stmt->execute($params);
while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
yield $row;
}
$stmt->closeCursor();
} finally {
// Restore buffered mode for subsequent queries
$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, true);
}
}
With unbuffered queries, the memory profile drops from “entire result set” to “one row” — the difference between OOM and success for large exports.
One caveat: with unbuffered queries, you cannot run other queries on the same PDO connection until the cursor is fully consumed or closeCursor() is called. If your pipeline needs to do lookups mid-stream — like the enrichWithCustomer stage below — use a separate PDO connection for those secondary queries.
Generator Pipelines: Composable Data Transformations
This is where generators go from “useful” to “architectural pattern.” You can chain generators to build data pipelines where each stage transforms or filters the stream:
// Stage 1: Read raw data
function readOrders(PDO $pdo): Generator
{
yield from queryStream($pdo, 'SELECT * FROM orders WHERE status = ?', ['completed']);
}
// Stage 2: Enrich with customer data
function enrichWithCustomer(Generator $orders, PDO $pdo): Generator
{
// Batch customer lookups for efficiency
$customerCache = [];
foreach ($orders as $order) {
$customerId = $order['customer_id'];
if (!isset($customerCache[$customerId])) {
$stmt = $pdo->prepare('SELECT name, email, tier FROM customers WHERE id = ?');
$stmt->execute([$customerId]);
$customerCache[$customerId] = $stmt->fetch(PDO::FETCH_ASSOC);
// Keep cache bounded — evict old entries
if (count($customerCache) > 1000) {
$customerCache = array_slice($customerCache, -500, null, true);
}
}
$order['customer'] = $customerCache[$customerId];
yield $order;
}
}
// Stage 3: Filter high-value orders
function filterHighValue(Generator $orders, float $threshold = 500.0): Generator
{
foreach ($orders as $order) {
if ((float) $order['total'] >= $threshold) {
yield $order;
}
}
}
// Stage 4: Format for export
function formatForExport(Generator $orders): Generator
{
foreach ($orders as $order) {
yield [
'order_id' => $order['id'],
'date' => date('Y-m-d', strtotime($order['created_at'])),
'customer_name' => $order['customer']['name'],
'customer_tier' => $order['customer']['tier'],
'total' => number_format((float) $order['total'], 2),
];
}
}
// Compose the pipeline
$pipeline = formatForExport(
filterHighValue(
enrichWithCustomer(
readOrders($pdo),
$pdo,
),
threshold: 1000.0,
)
);
// Write to CSV — the entire pipeline processes one row at a time
$out = fopen('high-value-orders.csv', 'w');
$headerWritten = false;
foreach ($pipeline as $row) {
if (!$headerWritten) {
fputcsv($out, array_keys($row));
$headerWritten = true;
}
fputcsv($out, $row);
}
fclose($out);
This pipeline reads from the database, enriches with customer data (with a bounded cache), filters, formats, and writes to CSV — all in constant memory. A 500K-row export uses the same ~5 MB of RAM as a 5K-row export.
Making Pipelines Cleaner with a Builder
The nested function calls above compose right-to-left, which can be hard to follow. A thin wrapper flips the composition to read left-to-right:
final class Pipeline
{
private Generator $source;
public function __construct(Generator $source)
{
$this->source = $source;
}
public function pipe(callable $stage): self
{
$this->source = $stage($this->source);
return $this;
}
public function filter(callable $predicate): self
{
return $this->pipe(function (Generator $input) use ($predicate): Generator {
foreach ($input as $key => $item) {
if ($predicate($item)) {
yield $key => $item;
}
}
});
}
public function map(callable $transform): self
{
return $this->pipe(function (Generator $input) use ($transform): Generator {
foreach ($input as $key => $item) {
yield $key => $transform($item);
}
});
}
public function each(callable $callback): void
{
foreach ($this->source as $item) {
$callback($item);
}
}
public function toArray(): array
{
return iterator_to_array($this->source);
}
public function reduce(callable $callback, mixed $initial = null): mixed
{
$carry = $initial;
foreach ($this->source as $item) {
$carry = $callback($carry, $item);
}
return $carry;
}
}
// Now the same pipeline reads left to right:
(new Pipeline(readOrders($pdo)))
->pipe(fn ($g) => enrichWithCustomer($g, $pdo))
->filter(fn ($order) => (float) $order['total'] >= 1000.0)
->map(fn ($order) => [
'order_id' => $order['id'],
'customer' => $order['customer']['name'],
'total' => number_format((float) $order['total'], 2),
])
->each(fn ($row) => fputcsv($out, $row));
yield from: Delegation and Flattening
yield from delegates to another generator (or any iterable), flattening nested sequences:
function allTransactions(PDO $pdo): Generator
{
// Combine multiple sources into one stream
yield from queryStream($pdo, 'SELECT *, "sale" as type FROM sales');
yield from queryStream($pdo, 'SELECT *, "refund" as type FROM refunds');
yield from queryStream($pdo, 'SELECT *, "chargeback" as type FROM chargebacks');
}
// Consumer sees a single flat stream of transactions
foreach (allTransactions($pdo) as $txn) {
$this->ledger->record($txn);
}
This is powerful for combining data from multiple tables, files, or APIs into one unified stream without loading any of them fully into memory.
A practical use case — multi-file import:
function importDirectory(string $dir): Generator
{
$files = glob("{$dir}/*.csv");
foreach ($files as $file) {
echo "Processing: {$file}\n";
yield from readCsv($file);
}
}
// Process all CSVs in a directory as one stream
foreach (importDirectory('/imports/2026-03') as $lineNum => $row) {
$this->importRow($row);
}
Bidirectional Communication: send() and Backpressure
Generators can receive values via send(), which becomes the return value of the yield expression. This enables backpressure patterns — the consumer can signal the producer:
function controllableProducer(PDO $pdo): Generator
{
$offset = 0;
$batchSize = 1000;
while (true) {
$stmt = $pdo->prepare("SELECT * FROM events LIMIT ? OFFSET ?");
$stmt->execute([$batchSize, $offset]);
$rows = $stmt->fetchAll(PDO::FETCH_ASSOC);
if (empty($rows)) {
return; // No more data
}
foreach ($rows as $row) {
$signal = yield $row;
// Consumer can send signals back
if ($signal === 'skip_batch') {
break; // Skip rest of this batch
}
if ($signal === 'stop') {
return; // Stop entirely
}
}
$offset += $batchSize;
}
}
// Usage
$producer = controllableProducer($pdo);
foreach ($producer as $event) {
$result = $this->processEvent($event);
if ($result === 'rate_limited') {
// Tell the producer to stop — we'll resume later
$producer->send('stop');
}
}
I’ll be honest: I use send() rarely. In most cases, you can achieve the same result with a simple break or by tracking state outside the generator. But for complex producer-consumer patterns where the producer needs to adapt its behavior, send() is the clean tool.
Performance: Generators vs. Arrays
Let’s settle this with numbers. Processing 100K items through a filter + map + reduce:
| Approach | Peak memory | Time | Notes |
|---|---|---|---|
array_filter + array_map + array_reduce | 82 MB | 95ms | Creates 3 intermediate arrays |
| Raw generator pipeline | 2 MB | 88ms | Minimal overhead |
The time difference is negligible. The memory difference is not. When you’re running 10 queue workers, each processing large datasets, the difference between 82 MB and 2 MB per worker is the difference between needing 820 MB and 20 MB of total memory. That’s real money on your hosting bill.
Common Mistakes
1. Calling iterator_to_array() Too Early
// ❌ Defeats the entire purpose of the generator
$allRows = iterator_to_array(readCsv('huge-file.csv'));
// You just loaded the entire file into an array
// ✅ Consume lazily
foreach (readCsv('huge-file.csv') as $row) {
process($row);
}
2. Forgetting That Generators Are Forward-Only
$gen = readCsv('data.csv');
foreach ($gen as $row) { /* first pass */ }
foreach ($gen as $row) { /* nothing happens — generator is exhausted */ }
// If you need multiple passes, either:
// a) Create a new generator for each pass
// b) Collect into an array (if it fits in memory)
// c) Restructure to do everything in one pass
3. Not Handling Cleanup
If a consumer breaks out of a generator early, the finally block runs — use it for cleanup:
function readFile(string $path): Generator
{
$handle = fopen($path, 'r');
try {
while (($line = fgets($handle)) !== false) {
yield trim($line);
}
} finally {
// This runs even if the consumer breaks early
fclose($handle);
}
}
// The file handle is properly closed even though we break early
foreach (readFile('log.txt') as $line) {
if (str_contains($line, 'FATAL')) {
$this->alert($line);
break; // finally block still runs, file is closed
}
}
4. Returning Values from Generators
Generators can return a final value (accessible via getReturn()), but it’s rarely useful and often confusing:
function countedRead(string $path): Generator
{
$count = 0;
$handle = fopen($path, 'r');
while (($line = fgets($handle)) !== false) {
yield trim($line);
$count++;
}
fclose($handle);
return $count; // Accessible after generator completes
}
$gen = countedRead('data.txt');
foreach ($gen as $line) {
process($line);
}
echo "Processed {$gen->getReturn()} lines"; // Works, but a counter variable is simpler
My advice: avoid return in generators. Track metadata with a separate counter or wrapper object. It’s clearer.
When to Reach for a Generator
My personal heuristic is simple:
- Processing more than ~1,000 items? Use a generator.
- Reading from a file, API, or database? Use a generator.
- Building a pipeline with filter → map → reduce? Use a generator.
- Combining multiple data sources? Use
yield from. - Need the full array for
usort,array_unique, or random access? Use an array — generators can’t do that.
Generators are not a premature optimization. They’re a better default. An array is the special case — you use it when you need random access or the dataset is small enough that it doesn’t matter.
Start yielding.