How does php handle large json data?

I'm working on a project now where I need to ingest and output large amounts of data between systems, without direct access to databases. This has come up on past projects with CSV files, but in this case I am using JSON for various reasons which changes things quite a bit.

CSV files are somewhat easier to work with when dealing with large amounts of data because each record is on its own line. Thus it's easy to create a basic file parser that will do the job by just reading one line at a time. However, with JSON, the file could be formatted in multiple different ways, with a single object possibly spanning multiple lines, or there may be just one single massive line of data containing all the objects.

I could have tried to write my own tool to handle this issue, but luckily somebody else has already solved this for us. In this case I am going to demonstrate the usage of the JSON machine PHP package to process an extremely large JSON file.

Setup

First we need to create an artificial large JSON file to simulate our issue. One could use something like an online JSON generator, but my browser would crash when I set a really high number of objects to create. Hence I used the following basic script to simulate my use-case of a massive array of items that have a depth of 1 (e.g. just name/value pairs).

<?php

$numItems = 1000000;
$items = [];

for ($i=0; $i<$numItems; $i++)
{
    $items[] = [
        "uuid" => md5(rand()),
        "isActive" => md5(rand()),
        "balance" => md5(rand()),
        "picture" => md5(rand()),
        "age" => md5(rand()),
        "eyeColor" => md5(rand()),
        "name" => md5(rand()),
        "gender" => md5(rand()),
        "company" => md5(rand()),
        "email" => md5(rand()),
        "phone" => md5(rand()),
        "address" => md5(rand()),
        "about" => md5(rand()),
        "registered" => md5(rand()),
        "latitude" => md5(rand()),
        "longitude" => md5(rand()),
    ];
}

print json_encode($items, JSON_PRETTY_PRINT);

I made sure to test this with and without the use of JSON_PRETTY_PRINT. This results in the generated file having different formatting, but the end result of this tutorial is exactly the same.

This generated me an 843 MB file of one million items, which I feel is suitably large enough for stress testing.

How does php handle large json data?

Running

Now that we have a suitably large file, we need to process it.

First we need to install the JSON Machine package:

composer require halaxa/json-machine

Then we can use it in a script like so:

<?php

require_once(__DIR__ . '/vendor/autoload.php');
$products = JsonMachine\JsonMachine::fromFile('large-file.json');

foreach ($products as $product)
{
    $productData = json_encode($product, JSON_PRETTY_PRINT);
    print($productData) . PHP_EOL;
}

This doesn't actually do anything that useful. It just prints out each object one-by-one. However it does demonstrate that we can safely loop over all the items in the JSON file one at a time without running out of memory etc. We can take this further and write some code to possibly batch insert them 1,000 at a time into a database or do some sort of operation before outputting to another file.

Last updated: 18th March 2021
First published: 18th March 2021

I know that the JSON streaming parser https://github.com/salsify/jsonstreamingparser has already been mentioned. But as I have recently(ish) added a new listener to it to try and make it easier to use out of the box I thought I would (for a change) put some information out about what it does...

There is a very good write up about the basic parser at https://www.salsify.com/blog/engineering/json-streaming-parser-for-php, but the issue I have with the standard setup was that you always had to write a listener to process a file. This is not always a simple task and can also take a certain amount of maintenance if/when the JSON changed. So I wrote the RegexListener.

The basic principle is to allow you to say what elements you are interested in (via a regex expression) and give it a callback to say what to do when it finds the data. Whilst reading the JSON, it keeps track of the path to each component - similar to a directory structure. So /name/forename or for arrays /items/item/2/partid- this is what the regex matches against.

An example is (from the source on github)...

$filename = __DIR__.'/../tests/data/example.json';
$listener = new RegexListener([
    '/1/name' => function ($data): void {
        echo PHP_EOL."Extract the second 'name' element...".PHP_EOL;
        echo '/1/name='.print_r($data, true).PHP_EOL;
    },
    '(/\d*)' => function ($data, $path): void {
        echo PHP_EOL."Extract each base element and print 'name'...".PHP_EOL;
        echo $path.'='.$data['name'].PHP_EOL;
    },
    '(/.*/nested array)' => function ($data, $path): void {
        echo PHP_EOL."Extract 'nested array' element...".PHP_EOL;
        echo $path.'='.print_r($data, true).PHP_EOL;
    },
]);
$parser = new Parser(fopen($filename, 'r'), $listener);
$parser->parse();

Just a couple of explanations...

'/1/name' => function ($data)

So the /1 is the the second element in an array (0 based), so this allows accessing particular instances of elements. /name is the name element. The value is then passed to the closure as $data

"(/\d*)" => function ($data, $path )

This will select each element of an array and pass it one at a time, as it's using a capture group, this information will be passed as $path. This means when a set of records is present in a file, you can process each item one at a time. And also know which element without having to keep track.

The last one

'(/.*/nested array)' => function ($data, $path):

effectively scans for any elements called nested array and passes each one along with where it is in the document.

Another useful feature I found was that if in a large JSON file, you just wanted the summary details at the top, you can grab those bits and then just stop...

$filename = __DIR__.'/../tests/data/ratherBig.json';
$listener = new RegexListener();
$parser = new Parser(fopen($filename, 'rb'), $listener);
$listener->setMatch(["/total_rows" => function ($data ) use ($parser) {
    echo "/total_rows=".$data.PHP_EOL;
    $parser->stop();
}]);

This saves time when you are not interested in the remaining content.

One thing to note is that these will react to the content, so that each one is triggered when the end of the matching content is found and may be in various orders. But also that the parser only keeps track of the content you are interested in and discards anything else.

If you find any interesting features (sometimes horribly know as bugs), please let me know or report an issue on the github page.

How does large JSON handle data?

Instead of reading the whole file at once, the 'chunksize' parameter will generate a reader that gets a specific number of lines to be read every single time and according to the length of your file, a certain amount of chunks will be created and pushed into memory; for example, if your file has 100.000 lines and you ...

How does JSON handle data in PHP?

To receive JSON string we can use the “php://input” along with the function file_get_contents() which helps us receive JSON data as a file and read it into a string. Later, we can use the json_decode() function to decode the JSON string.

Does PHP work with JSON?

PHP has some built-in functions to handle JSON. First, we will look at the following two functions: json_encode() json_decode()

Which would be better option to consider in a environment when you have big JSON file?

If you're concerned about parsing speed for your JSON library, choose Jackson for big files, GSON for small files, and JSON. simple for handling both.