The Hidden Power of PHP Generators for Large Datasets

Mar 28, 2026

TL;DR: PHP generators let you process million-row CSVs, paginated APIs, and massive database exports without running out of memory. This article goes beyond the yield basics — covering generator pipelines, yield from delegation, bidirectional communication, and real memory comparisons that show why generators should be your default tool for any dataset that doesn’t fit comfortably in RAM.

You Already Know `yield`. Here’s Why You’re Not Using It Enough.

Most PHP developers learn generators from a tutorial that shows yield inside a range() replacement and think “cool, but I’ll never need that.” Then they write a queue worker that loads 200K rows into an array, blows through memory_limit, and spend an afternoon debugging it.

Generators are not a niche feature. They’re the right default for any data pipeline that processes more than a few hundred items. The problem is that most content stops at the basics. Let’s go further.

Quick Recap: What Generators Do

A generator function uses yield instead of return. Instead of building a complete array and returning it, it produces values one at a time, only when asked:

function naturalNumbers(): Generator
{
    $i = 1;
    while (true) {
        yield $i++;
    }
}

// This doesn't consume infinite memory — it produces one number at a time
$numbers = naturalNumbers();
echo $numbers->current(); // 1
$numbers->next();
echo $numbers->current(); // 2

The key insight: between yields, the generator is suspended. It doesn’t use CPU cycles. It holds only its local variables in memory — not the entire dataset.

In practice, you almost never call ->current() and ->next() directly. You iterate with foreach:

foreach (naturalNumbers() as $n) {
    if ($n > 1000000) break;
    // Process each number without storing the full sequence
}

Real-World Pattern #1: Streaming CSV Processing

The most common use case I hit in production. Clients upload CSV exports ranging from 10K to 5M rows. Processing them with file() or fgetcsv() in a loop that collects into an array is a ticking time bomb.

/**
 * Read a CSV file row by row, yielding associative arrays.
 * Memory usage is constant regardless of file size.
 */
function readCsv(string $path, string $separator = ','): Generator
{
    $handle = fopen($path, 'r');

    if ($handle === false) {
        throw new \RuntimeException("Cannot open file: {$path}");
    }

    try {
        // First row is the header
        $headers = fgetcsv($handle, 0, $separator);

        if ($headers === false) {
            return; // Empty file
        }

        // Trim BOM and whitespace from headers
        $headers = array_map(fn ($h) => trim($h, "\xEF\xBB\xBF \t\n\r"), $headers);
        $columnCount = count($headers);

        $lineNumber = 1;
        while (($row = fgetcsv($handle, 0, $separator)) !== false) {
            $lineNumber++;

            // Skip rows with wrong column count (malformed data)
            if (count($row) !== $columnCount) {
                // Log and skip rather than crash
                error_log("CSV line {$lineNumber}: expected {$columnCount} columns, got " . count($row));
                continue;
            }

            yield $lineNumber => array_combine($headers, $row);
        }
    } finally {
        fclose($handle);
    }
}

Usage is dead simple:

foreach (readCsv('/imports/customers-2026.csv') as $lineNum => $row) {
    $this->upsertCustomer(
        email: $row['email'],
        name: $row['full_name'],
        region: $row['region'],
    );
}

Let’s put numbers on this:

File size	Rows	`file()` + array	Generator
5 MB	10K	18 MB	2 MB
50 MB	100K	165 MB	2 MB
500 MB	1M	OOM (>512 MB)	2 MB
2 GB	5M	OOM	2 MB

The generator column is always ~2 MB because it only holds one row plus the headers in memory at any time. The file handle is buffered by the OS. You could process a 50 GB file on a server with 128 MB memory_limit and it would work fine.

Real-World Pattern #2: Paginated API Consumption

External APIs paginate their responses. The naive approach loads all pages into an array before processing:

// ❌ Accumulates all pages in memory
function getAllProducts(ApiClient $api): array
{
    $all = [];
    $page = 1;

    do {
        $response = $api->get('/products', ['page' => $page, 'per_page' => 100]);
        $all = array_merge($all, $response['data']);
        $page++;
    } while ($response['has_more']);

    return $all; // Could be 50K+ items
}

With a generator, each page is fetched on demand and each item is yielded individually:

// ✅ Fetches and yields one page at a time
function allProducts(ApiClient $api, int $perPage = 100): Generator
{
    $page = 1;

    do {
        $response = $api->get('/products', [
            'page' => $page,
            'per_page' => $perPage,
        ]);

        foreach ($response['data'] as $product) {
            yield $product;
        }

        $page++;
    } while ($response['has_more']);
}

// Consumer doesn't know or care about pagination
foreach (allProducts($api) as $product) {
    $this->syncProduct($product);
}

This pattern is even more powerful when the API uses cursor-based pagination:

function allOrders(ApiClient $api): Generator
{
    $cursor = null;

    do {
        $params = ['limit' => 100];
        if ($cursor !== null) {
            $params['after'] = $cursor;
        }

        $response = $api->get('/orders', $params);

        foreach ($response['data'] as $order) {
            yield $order;
        }

        $cursor = $response['next_cursor'];
    } while ($cursor !== null);
}

The consumer sees a flat stream of orders. The pagination complexity is entirely encapsulated in the generator.

Real-World Pattern #3: Database Result Streaming

PDO can fetch rows one at a time, but most code calls fetchAll() out of habit. For large result sets, use a generator:

function queryStream(PDO $pdo, string $sql, array $params = []): Generator
{
    $stmt = $pdo->prepare($sql);
    $stmt->execute($params);

    while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
        yield $row;
    }

    $stmt->closeCursor();
}

// Process 500K orders without loading them all
foreach (queryStream($pdo, 'SELECT * FROM orders WHERE year = ?', [2025]) as $order) {
    $this->archiveOrder($order);
}

Important MySQL note: By default, PHP’s MySQL driver buffers the entire result set in memory on the client side (even with fetch()). To truly stream results, you need unbuffered queries:

function mysqlStream(PDO $pdo, string $sql, array $params = []): Generator
{
    // Switch to unbuffered mode for this query
    $pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);

    try {
        $stmt = $pdo->prepare($sql);
        $stmt->execute($params);

        while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
            yield $row;
        }

        $stmt->closeCursor();
    } finally {
        // Restore buffered mode for subsequent queries
        $pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, true);
    }
}

With unbuffered queries, the memory profile drops from “entire result set” to “one row” — the difference between OOM and success for large exports.

One caveat: with unbuffered queries, you cannot run other queries on the same PDO connection until the cursor is fully consumed or closeCursor() is called. If your pipeline needs to do lookups mid-stream — like the enrichWithCustomer stage below — use a separate PDO connection for those secondary queries.

Generator Pipelines: Composable Data Transformations

This is where generators go from “useful” to “architectural pattern.” You can chain generators to build data pipelines where each stage transforms or filters the stream:

// Stage 1: Read raw data
function readOrders(PDO $pdo): Generator
{
    yield from queryStream($pdo, 'SELECT * FROM orders WHERE status = ?', ['completed']);
}

// Stage 2: Enrich with customer data
function enrichWithCustomer(Generator $orders, PDO $pdo): Generator
{
    // Batch customer lookups for efficiency
    $customerCache = [];

    foreach ($orders as $order) {
        $customerId = $order['customer_id'];

        if (!isset($customerCache[$customerId])) {
            $stmt = $pdo->prepare('SELECT name, email, tier FROM customers WHERE id = ?');
            $stmt->execute([$customerId]);
            $customerCache[$customerId] = $stmt->fetch(PDO::FETCH_ASSOC);

            // Keep cache bounded — evict old entries
            if (count($customerCache) > 1000) {
                $customerCache = array_slice($customerCache, -500, null, true);
            }
        }

        $order['customer'] = $customerCache[$customerId];
        yield $order;
    }
}

// Stage 3: Filter high-value orders
function filterHighValue(Generator $orders, float $threshold = 500.0): Generator
{
    foreach ($orders as $order) {
        if ((float) $order['total'] >= $threshold) {
            yield $order;
        }
    }
}

// Stage 4: Format for export
function formatForExport(Generator $orders): Generator
{
    foreach ($orders as $order) {
        yield [
            'order_id' => $order['id'],
            'date' => date('Y-m-d', strtotime($order['created_at'])),
            'customer_name' => $order['customer']['name'],
            'customer_tier' => $order['customer']['tier'],
            'total' => number_format((float) $order['total'], 2),
        ];
    }
}

// Compose the pipeline
$pipeline = formatForExport(
    filterHighValue(
        enrichWithCustomer(
            readOrders($pdo),
            $pdo,
        ),
        threshold: 1000.0,
    )
);

// Write to CSV — the entire pipeline processes one row at a time
$out = fopen('high-value-orders.csv', 'w');
$headerWritten = false;

foreach ($pipeline as $row) {
    if (!$headerWritten) {
        fputcsv($out, array_keys($row));
        $headerWritten = true;
    }
    fputcsv($out, $row);
}

fclose($out);

This pipeline reads from the database, enriches with customer data (with a bounded cache), filters, formats, and writes to CSV — all in constant memory. A 500K-row export uses the same ~5 MB of RAM as a 5K-row export.

Making Pipelines Cleaner with a Builder

The nested function calls above compose right-to-left, which can be hard to follow. A thin wrapper flips the composition to read left-to-right:

final class Pipeline
{
    private Generator $source;

    public function __construct(Generator $source)
    {
        $this->source = $source;
    }

    public function pipe(callable $stage): self
    {
        $this->source = $stage($this->source);
        return $this;
    }

    public function filter(callable $predicate): self
    {
        return $this->pipe(function (Generator $input) use ($predicate): Generator {
            foreach ($input as $key => $item) {
                if ($predicate($item)) {
                    yield $key => $item;
                }
            }
        });
    }

    public function map(callable $transform): self
    {
        return $this->pipe(function (Generator $input) use ($transform): Generator {
            foreach ($input as $key => $item) {
                yield $key => $transform($item);
            }
        });
    }

    public function each(callable $callback): void
    {
        foreach ($this->source as $item) {
            $callback($item);
        }
    }

    public function toArray(): array
    {
        return iterator_to_array($this->source);
    }

    public function reduce(callable $callback, mixed $initial = null): mixed
    {
        $carry = $initial;
        foreach ($this->source as $item) {
            $carry = $callback($carry, $item);
        }
        return $carry;
    }
}

// Now the same pipeline reads left to right:
(new Pipeline(readOrders($pdo)))
    ->pipe(fn ($g) => enrichWithCustomer($g, $pdo))
    ->filter(fn ($order) => (float) $order['total'] >= 1000.0)
    ->map(fn ($order) => [
        'order_id' => $order['id'],
        'customer' => $order['customer']['name'],
        'total' => number_format((float) $order['total'], 2),
    ])
    ->each(fn ($row) => fputcsv($out, $row));

`yield from`: Delegation and Flattening

yield from delegates to another generator (or any iterable), flattening nested sequences:

function allTransactions(PDO $pdo): Generator
{
    // Combine multiple sources into one stream
    yield from queryStream($pdo, 'SELECT *, "sale" as type FROM sales');
    yield from queryStream($pdo, 'SELECT *, "refund" as type FROM refunds');
    yield from queryStream($pdo, 'SELECT *, "chargeback" as type FROM chargebacks');
}

// Consumer sees a single flat stream of transactions
foreach (allTransactions($pdo) as $txn) {
    $this->ledger->record($txn);
}

This is powerful for combining data from multiple tables, files, or APIs into one unified stream without loading any of them fully into memory.

A practical use case — multi-file import:

function importDirectory(string $dir): Generator
{
    $files = glob("{$dir}/*.csv");

    foreach ($files as $file) {
        echo "Processing: {$file}\n";
        yield from readCsv($file);
    }
}

// Process all CSVs in a directory as one stream
foreach (importDirectory('/imports/2026-03') as $lineNum => $row) {
    $this->importRow($row);
}

Bidirectional Communication: `send()` and Backpressure

Generators can receive values via send(), which becomes the return value of the yield expression. This enables backpressure patterns — the consumer can signal the producer:

function controllableProducer(PDO $pdo): Generator
{
    $offset = 0;
    $batchSize = 1000;

    while (true) {
        $stmt = $pdo->prepare("SELECT * FROM events LIMIT ? OFFSET ?");
        $stmt->execute([$batchSize, $offset]);
        $rows = $stmt->fetchAll(PDO::FETCH_ASSOC);

        if (empty($rows)) {
            return; // No more data
        }

        foreach ($rows as $row) {
            $signal = yield $row;

            // Consumer can send signals back
            if ($signal === 'skip_batch') {
                break; // Skip rest of this batch
            }
            if ($signal === 'stop') {
                return; // Stop entirely
            }
        }

        $offset += $batchSize;
    }
}

// Usage
$producer = controllableProducer($pdo);

foreach ($producer as $event) {
    $result = $this->processEvent($event);

    if ($result === 'rate_limited') {
        // Tell the producer to stop — we'll resume later
        $producer->send('stop');
    }
}

I’ll be honest: I use send() rarely. In most cases, you can achieve the same result with a simple break or by tracking state outside the generator. But for complex producer-consumer patterns where the producer needs to adapt its behavior, send() is the clean tool.

Performance: Generators vs. Arrays

Let’s settle this with numbers. Processing 100K items through a filter + map + reduce:

Approach	Peak memory	Time	Notes
`array_filter` + `array_map` + `array_reduce`	82 MB	95ms	Creates 3 intermediate arrays
Raw generator pipeline	2 MB	88ms	Minimal overhead

The time difference is negligible. The memory difference is not. When you’re running 10 queue workers, each processing large datasets, the difference between 82 MB and 2 MB per worker is the difference between needing 820 MB and 20 MB of total memory. That’s real money on your hosting bill.

Common Mistakes

1. Calling `iterator_to_array()` Too Early

// ❌ Defeats the entire purpose of the generator
$allRows = iterator_to_array(readCsv('huge-file.csv'));
// You just loaded the entire file into an array

// ✅ Consume lazily
foreach (readCsv('huge-file.csv') as $row) {
    process($row);
}

2. Forgetting That Generators Are Forward-Only

$gen = readCsv('data.csv');

foreach ($gen as $row) { /* first pass */ }
foreach ($gen as $row) { /* nothing happens — generator is exhausted */ }

// If you need multiple passes, either:
// a) Create a new generator for each pass
// b) Collect into an array (if it fits in memory)
// c) Restructure to do everything in one pass

3. Not Handling Cleanup

If a consumer breaks out of a generator early, the finally block runs — use it for cleanup:

function readFile(string $path): Generator
{
    $handle = fopen($path, 'r');

    try {
        while (($line = fgets($handle)) !== false) {
            yield trim($line);
        }
    } finally {
        // This runs even if the consumer breaks early
        fclose($handle);
    }
}

// The file handle is properly closed even though we break early
foreach (readFile('log.txt') as $line) {
    if (str_contains($line, 'FATAL')) {
        $this->alert($line);
        break; // finally block still runs, file is closed
    }
}

4. Returning Values from Generators

Generators can return a final value (accessible via getReturn()), but it’s rarely useful and often confusing:

function countedRead(string $path): Generator
{
    $count = 0;
    $handle = fopen($path, 'r');

    while (($line = fgets($handle)) !== false) {
        yield trim($line);
        $count++;
    }

    fclose($handle);
    return $count; // Accessible after generator completes
}

$gen = countedRead('data.txt');
foreach ($gen as $line) {
    process($line);
}
echo "Processed {$gen->getReturn()} lines"; // Works, but a counter variable is simpler

My advice: avoid return in generators. Track metadata with a separate counter or wrapper object. It’s clearer.

When to Reach for a Generator

My personal heuristic is simple:

Processing more than ~1,000 items? Use a generator.
Reading from a file, API, or database? Use a generator.
Building a pipeline with filter → map → reduce? Use a generator.
Combining multiple data sources? Use yield from.
Need the full array for usort, array_unique, or random access? Use an array — generators can’t do that.

Generators are not a premature optimization. They’re a better default. An array is the special case — you use it when you need random access or the dataset is small enough that it doesn’t matter.

Start yielding.

You Already Know yield. Here’s Why You’re Not Using It Enough.