# JSON nella scienza dei dati

JSON è uno dei formati più utilizzati nella data science per lo scambio, storage e analisi di dati. Questa guida ti mostrerà come utilizzare JSON efficacemente nei progetti di data science.

Perché JSON in Data Science?

Vantaggi

1. Ubiquità

Supportato da tutte le API moderne
Formato standard per web scraping
Integrazione facile con database NoSQL

2. Struttura flessibile

Gestisce dati nested
Schema-less per dati non strutturati
Facile da trasformare

3. Interoperabilità

Python, R, Julia nativamente supportano JSON
Librerie mature e performanti
Integrazione con big data tools

Svantaggi

❌ Meno efficiente di formati binari (Parquet, Arrow)
❌ Parsing lento su file molto grandi
❌ Usa più spazio di formati compressi
❌ Schema non enforced

JSON con Pandas

Caricamento dati

File JSON semplice:

import pandas as pd

# Read JSON file
df = pd.read_json('data.json')

# Da stringa
json_str = '{"nome":["Marco","Laura"],"età":[30,25]}'
df = pd.read_json(json_str)

print(df)
#     nome  età
# 0  Marco   30
# 1  Laura   25

JSON nested:

# data.json
# [
#   {"nome": "Marco", "skills": ["Python", "SQL"]},
#   {"nome": "Laura", "skills": ["R", "Julia"]}
# ]

df = pd.read_json('data.json')

# Espandi colonna array
df_expanded = df.explode('skills')
print(df_expanded)
#     nome skills
# 0  Marco Python
# 0  Marco    SQL
# 1  Laura      R
# 1  Laura  Julia

JSON con nesting profondo:

import json

# Carica JSON complesso
with open('complex.json') as f:
    data = json.load(f)

# Normalizza struttura nested
df = pd.json_normalize(
    data,
    record_path=['users', 'posts'],  # Array nested
    meta=[
        'company',
        ['users', 'name'],
        ['users', 'email']
    ],
    sep='_'
)

Esempio json_normalize:

data = {
    "company": "TechCorp",
    "users": [
        {
            "name": "Marco",
            "email": "marco@example.com",
            "posts": [
                {"id": 1, "title": "Post 1"},
                {"id": 2, "title": "Post 2"}
            ]
        },
        {
            "name": "Laura",
            "email": "laura@example.com",
            "posts": [
                {"id": 3, "title": "Post 3"}
            ]
        }
    ]
}

df = pd.json_normalize(
    data,
    record_path=['users', 'posts'],
    meta=['company', ['users', 'name'], ['users', 'email']],
    sep='_'
)

print(df)
#    id   title   company users_name        users_email
# 0   1  Post 1  TechCorp      Marco  marco@example.com
# 1   2  Post 2  TechCorp      Marco  marco@example.com
# 2   3  Post 3  TechCorp      Laura  laura@example.com

Esportazione da Pandas

# DataFrame → JSON
df = pd.DataFrame({
    'nome': ['Marco', 'Laura'],
    'età': [30, 25],
    'città': ['Roma', 'Milano']
})

# Orient: records (array of objects)
df.to_json('output.json', orient='records', indent=2)
# [
#   {"nome": "Marco", "età": 30, "città": "Roma"},
#   {"nome": "Laura", "età": 25, "città": "Milano"}
# ]

# Orient: columns (object of arrays)
df.to_json('output.json', orient='columns')
# {
#   "nome": ["Marco", "Laura"],
#   "età": [30, 25],
#   "città": ["Roma", "Milano"]
# }

# Orient: index
df.to_json('output.json', orient='index', indent=2)
# {
#   "0": {"nome": "Marco", "età": 30, "città": "Roma"},
#   "1": {"nome": "Laura", "età": 25, "città": "Milano"}
# }

API Data Collection

Fetch da REST API

import requests
import pandas as pd

# Chiama API
response = requests.get('https://api.example.com/users')
data = response.json()

# Converti in DataFrame
df = pd.DataFrame(data)

# Salva
df.to_csv('users.csv', index=False)

Paginazione API

def fetch_all_pages(base_url, params=None):
    """Fetch tutti i dati da API paginata"""
    all_data = []
    page = 1

    while True:
        # Richiesta pagina corrente
        response = requests.get(
            base_url,
            params={(params or {}), 'page': page}

        )
        data = response.json()

        # Nessun dato = fine
        if not data.get('results'):
            break

        all_data.extend(data['results'])
        page += 1

        # Limite sicurezza
        if page > 100:
            break

    return pd.DataFrame(all_data)

# Uso
df = fetch_all_pages('https://api.example.com/products')

Gestione rate limiting

import time from requests.adapters import HTTPAdapter from requests.packages.urllib3.util.retry import Retry def get_session_with_retry(): """Session con retry automatico""" session = requests.Session() retry = Retry( total=5, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504] ) adapter = HTTPAdapter(max_retries=retry) session.mount('http://', adapter) session.mount('https://', adapter) return session # Uso con rate limiting session = get_session_with_retry() for page in range(1, 11): response = session.get(f'https://api.example.com/data?page={page}') data = response.json() # Process data process_page(data) # Aspetta per evitare rate limit
time.sleep(1)

Machine Learning

Preprocessing dati JSON

import json import pandas as pd from sklearn.preprocessing import LabelEncoder, StandardScaler # Carica dati with open('training_data.json') as f: data = json.load(f) df = pd.DataFrame(data) # Gestione campi nested # es: {"user": {"age": 30, "city": "Roma"}} if 'user' in df.columns: user_df = pd.json_normalize(df['user']) df = pd.concat([df.drop('user', axis=1), user_df], axis=1) # Encoding categorie le = LabelEncoder() df['città_encoded'] = le.fit_transform(df['città']) # Scaling numerici scaler = StandardScaler() df[['età', 'income']] = scaler.fit_transform(df[['età', 'income']]) # Split features/target X = df.drop('target', axis=1)
y = df['target']

Feature extraction da JSON

def extract_features(json_record): """Estrai features da record JSON complesso""" features = {} # Features base features['age'] = json_record.get('age', 0) features['income'] = json_record.get('income', 0) # Count nested items features['num_skills'] = len(json_record.get('skills', [])) features['num_projects'] = len(json_record.get('projects', [])) # Boolean features features['has_degree'] = 'degree' in json_record.get('education', {}) # Text length features bio = json_record.get('bio', '') features['bio_length'] = len(bio) features['bio_words'] = len(bio.split()) return features # Applica a dataset with open('users.json') as f: users = json.load(f) features_list = [extract_features(user) for user in users]
df = pd.DataFrame(features_list)

Salvataggio modelli

import joblib import json from sklearn.ensemble import RandomForestClassifier # Train model model = RandomForestClassifier() model.fit(X_train, y_train) # Salva model joblib.dump(model, 'model.joblib') # Salva metadata come JSON metadata = { 'model_type': 'RandomForestClassifier', 'features': list(X_train.columns), 'accuracy': float(model.score(X_test, y_test)), 'timestamp': '2026-01-26', 'params': model.get_params() } with open('model_metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)

Time Series da JSON

Parsing date

import pandas as pd from datetime import datetime # JSON con date ISO json_data = ''' [ {"timestamp": "2026-01-20T10:00:00Z", "value": 100}, {"timestamp": "2026-01-21T10:00:00Z", "value": 150}, {"timestamp": "2026-01-22T10:00:00Z", "value": 120} ] ''' df = pd.read_json(json_data) # Converti a datetime df['timestamp'] = pd.to_datetime(df['timestamp']) # Set come index df.set_index('timestamp', inplace=True) # Resample
daily_avg = df.resample('D').mean()

Time series nested

# JSON con serie temporale nested data = { "sensor_id": "temp_01", "location": "Roma", "readings": [ {"time": "2026-01-20T08:00:00Z", "temp": 18.5}, {"time": "2026-01-20T09:00:00Z", "temp": 19.2}, {"time": "2026-01-20T10:00:00Z", "temp": 20.1} ] } # Estrai time series readings_df = pd.json_normalize(data, 'readings') readings_df['time'] = pd.to_datetime(readings_df['time']) readings_df.set_index('time', inplace=True) # Aggiungi metadata readings_df['sensor_id'] = data['sensor_id']
readings_df['location'] = data['location']

Visualizzazione dati JSON

Plotly da JSON

import plotly.express as px import json # Carica dati with open('sales.json') as f: data = json.load(f) df = pd.DataFrame(data) # Bar chart fig = px.bar( df, x='product', y='sales', color='region', title='Sales by Product and Region' ) fig.show() # Export come JSON (per web)
fig.write_json('chart.json')

Interactive dashboard

import plotly.graph_objects as go from plotly.subplots import make_subplots # Multi-chart dashboard fig = make_subplots( rows=2, cols=2, subplot_titles=('Sales', 'Revenue', 'Customers', 'Growth') ) # Aggiungi charts fig.add_trace( go.Bar(x=df['month'], y=df['sales']), row=1, col=1 ) fig.add_trace( go.Scatter(x=df['month'], y=df['revenue'], mode='lines+markers'), row=1, col=2 ) # Salva come HTML con dati embedded come JSON
fig.write_html('dashboard.html')

Big Data e JSON

Streaming JSON

File grandi - leggi a chunks:

import pandas as pd # Read in chunks chunks = [] for chunk in pd.read_json('large.json', lines=True, chunksize=10000): # Process chunk processed = process_chunk(chunk) chunks.append(processed) # Combina risultati
result = pd.concat(chunks, ignore_index=True)

JSON Lines (JSONL):*

# Ogni linea è un JSON object # {"id": 1, "value": 100} # {"id": 2, "value": 200} # Read df = pd.read_json('data.jsonl', lines=True) # Write
df.to_json('output.jsonl', orient='records', lines=True)

Spark e JSON

from pyspark.sql import SparkSession # Inizializza Spark spark = SparkSession.builder.appName("JSONApp").getOrCreate() # Leggi JSON df = spark.read.json("large_data.json") # Schema automatico inferred df.printSchema() # Query SQL-like df.createOrReplaceTempView("data") result = spark.sql(""" SELECT category, AVG(price) as avg_price FROM data WHERE available = true GROUP BY category """) # Converti a Pandas per plotting
pandas_df = result.toPandas()

Dask per JSON grandi

import dask.dataframe as dd

# Read JSON lazy
ddf = dd.read_json('big_data_.json')

# Operazioni lazy (non eseguite subito)
result = ddf[ddf['value'] > 100].groupby('category').mean()

# Compute (esegue operazioni)
final_result = result.compute()

Best Practices

1. Valida schema

from jsonschema import validate, ValidationError

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number", "minimum": 0},
        "email": {"type": "string", "format": "email"}
    },
    "required": ["name", "age"]
}

def validate_record(record):
    try:
        validate(instance=record, schema=schema)
        return True, None
    except ValidationError as e:
        return False, str(e)

# Valida dataset
with open('data.json') as f:
    data = json.load(f)

valid_records = []
errors = []

for i, record in enumerate(data):
    is_valid, error = validate_record(record)
    if is_valid:
        valid_records.append(record)
    else:
        errors.append({'index': i, 'error': error})

print(f"Valid: {len(valid_records)}, Invalid: {len(errors)}")

2. Gestisci missing values

# Carica con handling nan
df = pd.read_json('data.json')

# Identifica missing
print(df.isnull().sum())

# Strategie:
# 1. Drop
df_clean = df.dropna()

# 2. Fill con default
df['age'].fillna(df['age'].median(), inplace=True)

# 3. Fill con metodo
df['category'].fillna('unknown', inplace=True)

3. Ottimizza memory

# Converti tipi per risparmiare memoria
df['category'] = df['category'].astype('category')
df['age'] = pd.to_numeric(df['age'], downcast='unsigned')

# Check memory usage
print(df.memory_usage(deep=True))

4. Usa formato efficiente per storage

# JSON → Parquet (più efficiente)
df.to_parquet('data.parquet', compression='snappy')

# Read Parquet (molto più veloce)
df = pd.read_parquet('data.parquet')

# JSON vs Parquet benchmark
import time

# JSON
start = time.time()
df_json = pd.read_json('large.json')
json_time = time.time() - start

# Parquet
start = time.time()
df_parquet = pd.read_parquet('large.parquet')
parquet_time = time.time() - start

print(f"JSON: {json_time:.2f}s, Parquet: {parquet_time:.2f}s")
# Parquet è tipicamente 10-100x più veloce!

Conclusione

JSON in Data Science:

✅ Ottimo per API e web scraping
✅ Flessibile per dati semi-strutturati
✅ Supporto eccellente in Python/Pandas
⚠️ Non ottimale per dataset molto grandi
⚠️ Considerare Parquet/Arrow per performance

Workflow consigliato:

Collect: JSON da API

Process: Pandas/Python

Store: Parquet per efficienza

Analyze: SQL/Pandas

Visualize: Plotly/Matplotlib

Usa JSON dove ha senso, ma non esitare a convertire in formati più efficienti per analisi pesanti!

JSON nella scienza dei dati: Analisi e processing con Python

Big JSON Team

Perché JSON in Data Science?

Vantaggi

Svantaggi

JSON con Pandas

Caricamento dati

Esportazione da Pandas

API Data Collection

Fetch da REST API

Paginazione API

Gestione rate limiting

Machine Learning

Preprocessing dati JSON

Feature extraction da JSON

Salvataggio modelli

Time Series da JSON

Parsing date

Time series nested

Visualizzazione dati JSON

Plotly da JSON

Interactive dashboard

Big Data e JSON

Streaming JSON

Spark e JSON

Dask per JSON grandi

Best Practices

1. Valida schema

2. Gestisci missing values

3. Ottimizza memory

4. Usa formato efficiente per storage

Conclusione

Articoli Correlati

Python e JSON: Guida completa alla manipolazione dati

Convertire JSON in Excel: Guida completa con esempi pratici

Lavorare con file JSON grandi: Streaming, performance, best practices

Read in English