# Travailler avec de gros fichiers JSON : Techniques et optimisations

Les gros fichiers JSON (> 100 MB) nécessitent des techniques spécifiques. Ce guide montre comment les gérer efficacement.

Problèmes courants

1. Mémoire insuffisante

// ❌ Charge tout en mémoire
const data = JSON.parse(fs.readFileSync('large.json'));
// OutOfMemoryError avec fichiers > 500 MB

2. Parse lent

Fichier 1 GB peut prendre plusieurs minutes à parser.

3. Blocage application

Parse bloque thread principal.

Solutions

Streaming (Node.js)

JSONStream

const fs = require('fs');
const JSONStream = require('JSONStream');

// Parser stream
const stream = fs.createReadStream('large.json')
  .pipe(JSONStream.parse('items.'));


stream.on('data', (item) => {
  // Traiter chaque item individuellement
  processItem(item);
});

stream.on('end', () => {
  console.log('Terminé');
});

Parser manuel

const fs = require('fs'); const { Transform } = require('stream'); class JSONArrayParser extends Transform { constructor() { super({ objectMode: true }); this.buffer = ''; this.depth = 0; this.inString = false; } _transform(chunk, encoding, callback) { this.buffer += chunk.toString(); let pos = 0; while (pos < this.buffer.length) { const char = this.buffer[pos]; if (char === '"' && this.buffer[pos-1] !== '\\') { this.inString = !this.inString; } if (!this.inString) { if (char === '{') this.depth++; if (char === '}') { this.depth--; if (this.depth === 0) { // Objet complet trouvé const obj = JSON.parse(this.buffer.substring(0, pos + 1)); this.push(obj); this.buffer = this.buffer.substring(pos + 1); pos = 0; continue; } } } pos++; } callback(); } } // Usage fs.createReadStream('large.json') .pipe(new JSONArrayParser())
.on('data', (obj) => console.log(obj));

Python streaming

ijson

import ijson # Parser itératif with open('large.json', 'rb') as f: # Parser chaque item du tableau for item in ijson.items(f, 'items.item'): process_item(item) # Parser avec prefix with open('large.json', 'rb') as f: parser = ijson.parse(f) for prefix, event, value in parser: if prefix == 'items.item.name':
print(f"Nom: {value}")

pandas chunking

import pandas as pd # Lire par chunks chunk_size = 1000 chunks = [] for chunk in pd.read_json('large.json', lines=True, chunksize=chunk_size): # Traiter chunk processed = chunk[chunk['age'] > 18] chunks.append(processed) # Combiner résultats
result = pd.concat(chunks, ignore_index=True)

JSON Lines (JSONL)

Format

{"id": 1, "nom": "Alice", "age": 30} {"id": 2, "nom": "Bob", "age": 25}
{"id": 3, "nom": "Charlie", "age": 35}

Avantages

Traitement ligne par ligne

Append facile

Moins de mémoire

Parallélisable

Lecture efficace

# Python with open('data.jsonl') as f: for line in f: item = json.loads(line)
process(item)

// Node.js const readline = require('readline'); const fs = require('fs'); const rl = readline.createInterface({ input: fs.createReadStream('data.jsonl'), crlfDelay: Infinity }); rl.on('line', (line) => { const item = JSON.parse(line); process(item);
});

Compression

gzip

# Compresser gzip large.json # Crée large.json.gz # Décompresser
gunzip large.json.gz

Streaming avec compression

const fs = require('fs');
const zlib = require('zlib');
const JSONStream = require('JSONStream');

// Lire JSON compressé
fs.createReadStream('large.json.gz')
  .pipe(zlib.createGunzip())
  .pipe(JSONStream.parse(''))
  .on('data', (item) => process(item));

import gzip
import json

# Python
with gzip.open('large.json.gz', 'rt') as f:
    data = json.load(f)

Indexation

Créer index

import json

# Créer index positions
index = {}
with open('large.json', 'rb') as f:
    # Parser et noter positions
    decoder = json.JSONDecoder()
    buffer = f.read().decode('utf-8')

    pos = 0
    while pos < len(buffer):
        try:
            obj, end_pos = decoder.raw_decode(buffer, pos)
            index[obj['id']] = (pos, end_pos)
            pos = end_pos
            # Ignorer whitespace
            while pos < len(buffer) and buffer[pos] in ' \n\t\r,[]':
                pos += 1
        except json.JSONDecodeError:
            break

# Sauver index
with open('index.json', 'w') as f:
    json.dump(index, f)

Utiliser index

# Charger index
with open('index.json') as f:
    index = json.load(f)

# Accès direct
def get_item_by_id(item_id):
    if str(item_id) in index:
        start, end = index[str(item_id)]
        with open('large.json', 'rb') as f:
            f.seek(start)
            data = f.read(end - start)
            return json.loads(data)
    return None

item = get_item_by_id(123)

Bases de données

SQLite pour gros JSON

import sqlite3
import json

# Créer DB
conn = sqlite3.connect('data.db')
cursor = conn.cursor()

cursor.execute('''
    CREATE TABLE items (
        id INTEGER PRIMARY KEY,
        data TEXT
    )
''')

# Importer JSON
with open('large.json') as f:
    data = json.load(f)
    for item in data:
        cursor.execute(
            'INSERT INTO items (data) VALUES (?)',
            (json.dumps(item),)
        )

conn.commit()

# Query efficace
cursor.execute("SELECT data FROM items WHERE json_extract(data, '$.age') > 30")
for row in cursor.fetchall():
    item = json.loads(row[0])
    print(item)

MongoDB

const MongoClient = require('mongodb').MongoClient;
const fs = require('fs');

async function importJSON() {
  const client = await MongoClient.connect('mongodb://localhost:27017');
  const db = client.db('mydb');
  const collection = db.collection('items');

  // Import par batch
  const batchSize = 1000;
  let batch = [];

  const stream = fs.createReadStream('large.json')
    .pipe(JSONStream.parse(''));


  stream.on('data', async (item) => {
    batch.push(item);

    if (batch.length >= batchSize) {
      await collection.insertMany(batch);
      batch = [];
    }
  });

  stream.on('end', async () => {
    if (batch.length > 0) {
      await collection.insertMany(batch);
    }
    client.close();
  });
}

Optimisations mémoire

Worker threads (Node.js)

const { Worker } = require('worker_threads'); function processChunk(chunk) { return new Promise((resolve, reject) => { const worker = new Worker('./processor.js', { workerData: chunk }); worker.on('message', resolve); worker.on('error', reject); }); } // processor.js const { parentPort, workerData } = require('worker_threads'); const result = workerData.map(item => { // Traitement intensif return processItem(item); });
parentPort.postMessage(result);

Multiprocessing (Python)

from multiprocessing import Pool import json def process_chunk(chunk): return [item for item in chunk if item['age'] > 18] # Lire par chunks chunks = [] chunk_size = 1000 with open('large.json') as f: data = json.load(f) for i in range(0, len(data), chunk_size): chunks.append(data[i:i+chunk_size]) # Traiter en parallèle with Pool(4) as pool: results = pool.map(process_chunk, chunks) # Combiner
final_result = [item for chunk in results for item in chunk]

Dask pour Big Data

import dask.dataframe as dd

# Lire JSON avec Dask
ddf = dd.read_json('large_data/.json')

# Opérations lazy
result = ddf[ddf['age'] > 30].groupby('ville')['age'].mean()

# Compute résultat
output = result.compute()
print(output)

Monitoring performance

Node.js

const v8 = require('v8');

console.log('Heap statistics:', v8.getHeapStatistics());

// Monitor mémoire
setInterval(() => {
  const used = process.memoryUsage();
  console.log(Heap used: ${Math.round(used.heapUsed / 1024 / 1024)} MB);
}, 1000);

Python

import tracemalloc
import time

tracemalloc.start()

# Code à profiler
start = time.time()
process_large_json()
elapsed = time.time() - start

current, peak = tracemalloc.get_traced_memory()
print(f"Temps: {elapsed:.2f}s")
print(f"Mémoire actuelle: {current / 1024 / 1024:.2f} MB")
print(f"Pic mémoire: {peak / 1024 / 1024:.2f} MB")

tracemalloc.stop()

Meilleures pratiques

1. Choisir bon format

Petits fichiers (< 10 MB) : JSON standard
Moyens (10-100 MB) : JSON avec streaming
Gros (> 100 MB) : JSONL ou base de données
Énormes (> 1 GB) : Parquet, Avro, ou DB

2. Streaming toujours

Ne jamais charger entièrement en mémoire si > 50 MB.

3. Utiliser compression

gzip réduit 70-90% taille.

4. Indexer si accès aléatoire

Créer index pour requêtes fréquentes.

5. Batch processing

Traiter par lots pour contrôler mémoire.

6. Formats alternatifs

Pour analytics, considérer Parquet ou Arrow.

Conclusion

Techniques essentielles :

Streaming pour gros fichiers
JSONL pour données ligne par ligne
Compression avec gzip
Indexation pour accès rapide
Bases de données pour requêtes complexes
Parallélisation pour performance

Outils recommandés :

Node.js : JSONStream
Python : ijson, pandas, Dask
CLI : jq avec streaming
DB : MongoDB, PostgreSQL JSON

Avec ces techniques, gérez efficacement fichiers JSON de toute taille !

Big JSON Team

Problèmes courants

1. Mémoire insuffisante

2. Parse lent

3. Blocage application

Solutions

Streaming (Node.js)

JSONStream

Parser manuel

Python streaming

ijson

pandas chunking

JSON Lines (JSONL)

Format

Avantages

Lecture efficace

Compression

gzip

Streaming avec compression

Indexation

Créer index

Utiliser index

Bases de données

SQLite pour gros JSON

MongoDB

Optimisations mémoire

Worker threads (Node.js)

Multiprocessing (Python)

Dask pour Big Data

Monitoring performance

Node.js

Python

Meilleures pratiques

1. Choisir bon format

2. Streaming toujours

3. Utiliser compression

4. Indexer si accès aléatoire

5. Batch processing

6. Formats alternatifs

Conclusion

Articles Connexes

Python et JSON : Guide complet pour manipuler des données JSON

JavaScript et JSON : Guide complet de manipulation des données

JSON en science des données : Analyse, transformation et machine learning

Read in English