← Back to Blog

JSON in Data Science: Python and Pandas Guide

Complete guide to JSON in data science workflows. Learn to process JSON with Python, Pandas, and integrate into ML pipelines.

Big JSON Team13 min readprogramming
B

Big JSON Team

Technical Writer

Expert in JSON data manipulation, API development, and web technologies. Passionate about creating tools that make developers' lives easier.

13 min read

JSON in Data Science

JSON is ubiquitous in data science for API data, NoSQL databases, and configuration files.

Loading JSON with Pandas

import pandas as pd

# Simple JSON file

df = pd.read_json('data.json')

# JSON Lines format

df = pd.read_json('data.jsonl', lines=True)

# From string

df = pd.read_json(json_string)

Handling Nested JSON

import pandas as pd

# Nested data

data = {

"users": [

{"name": "Alice", "address": {"city": "NYC"}},

{"name": "Bob", "address": {"city": "LA"}}

]

}

# Normalize nested structure

df = pd.json_normalize(data['users'])

# Columns: name, address.city

Loading from APIs

import requests

import pandas as pd

response = requests.get('https://api.example.com/data')

data = response.json()

df = pd.DataFrame(data)

Handling Missing Data

data = [

{"name": "Alice", "age": 30, "email": "alice@example.com"},

{"name": "Bob", "age": 25}, # No email

]

df = pd.DataFrame(data)

df['email'] = df['email'].fillna('unknown')

JSON Lines for Large Files

# Process in chunks

chunks = pd.read_json('large.jsonl', lines=True, chunksize=10000)

for chunk in chunks:

process(chunk)

Streaming with ijson

import ijson

def process_large_json(filename):

with open(filename, 'rb') as f:

for item in ijson.items(f, 'items.item'):

yield item

for record in process_large_json('data.json'):

print(record)

Data Analysis

# Basic analysis

df.info()

df.describe()

df['status'].value_counts()

# Group by

summary = df.groupby('category').agg({

'price': ['mean', 'sum'],

'rating': 'mean'

})

Exporting Results

# To JSON

df.to_json('output.json', orient='records', indent=2)

# To Excel

df.to_excel('output.xlsx', index=False)

# To CSV

df.to_csv('output.csv', index=False)

ML Pipeline Integration

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

# Feature engineering

df = pd.json_normalize(data)

df_encoded = pd.get_dummies(df, columns=['category'])

# Split data

X = df_encoded.drop('target', axis=1)

y = df_encoded['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Save metadata

metadata = {

"features": list(X.columns),

"samples": len(df),

"date": datetime.now().isoformat()

}

with open('metadata.json', 'w') as f:

json.dump(metadata, f)

Complex Transformations

# Extract nested arrays

df = pd.json_normalize(

data,

record_path=['items'],

meta=['user_id', 'timestamp']

)

# Multiple levels

df = pd.json_normalize(

data,

record_path=['orders', 'items'],

meta=['customer_id', ['orders', 'order_id']]

)

Best Practices

  • Validate JSON before processing
  • Use UTF-8 encoding
  • Handle missing data appropriately
  • Stream large files
  • Document data schemas
  • Performance Tips

    # Use orient='records' for better performance
    

    df.to_json('output.json', orient='records')

    # Compression

    df.to_json('output.json.gz', compression='gzip')

    # Specify dtypes

    df = pd.read_json('data.json', dtype={'id': int, 'value': float})

    Conclusion

    JSON is essential in data science. Master pd.json_normalize() for nested data and ijson for large files!

    Share:

    Related Articles

    Read in other languages