JSON in Data Science: Python and Pandas Guide
Complete guide to JSON in data science workflows. Learn to process JSON with Python, Pandas, and integrate into ML pipelines.
Big JSON Team
• Technical WriterExpert in JSON data manipulation, API development, and web technologies. Passionate about creating tools that make developers' lives easier.
JSON in Data Science
JSON is ubiquitous in data science for API data, NoSQL databases, and configuration files.
Loading JSON with Pandas
import pandas as pd
# Simple JSON file
df = pd.read_json('data.json')
# JSON Lines format
df = pd.read_json('data.jsonl', lines=True)
# From string
df = pd.read_json(json_string)
Handling Nested JSON
import pandas as pd
# Nested data
data = {
"users": [
{"name": "Alice", "address": {"city": "NYC"}},
{"name": "Bob", "address": {"city": "LA"}}
]
}
# Normalize nested structure
df = pd.json_normalize(data['users'])
# Columns: name, address.city
Loading from APIs
import requests
import pandas as pd
response = requests.get('https://api.example.com/data')
data = response.json()
df = pd.DataFrame(data)
Handling Missing Data
data = [
{"name": "Alice", "age": 30, "email": "alice@example.com"},
{"name": "Bob", "age": 25}, # No email
]
df = pd.DataFrame(data)
df['email'] = df['email'].fillna('unknown')
JSON Lines for Large Files
# Process in chunks
chunks = pd.read_json('large.jsonl', lines=True, chunksize=10000)
for chunk in chunks:
process(chunk)
Streaming with ijson
import ijson
def process_large_json(filename):
with open(filename, 'rb') as f:
for item in ijson.items(f, 'items.item'):
yield item
for record in process_large_json('data.json'):
print(record)
Data Analysis
# Basic analysis
df.info()
df.describe()
df['status'].value_counts()
# Group by
summary = df.groupby('category').agg({
'price': ['mean', 'sum'],
'rating': 'mean'
})
Exporting Results
# To JSON
df.to_json('output.json', orient='records', indent=2)
# To Excel
df.to_excel('output.xlsx', index=False)
# To CSV
df.to_csv('output.csv', index=False)
ML Pipeline Integration
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Feature engineering
df = pd.json_normalize(data)
df_encoded = pd.get_dummies(df, columns=['category'])
# Split data
X = df_encoded.drop('target', axis=1)
y = df_encoded['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Save metadata
metadata = {
"features": list(X.columns),
"samples": len(df),
"date": datetime.now().isoformat()
}
with open('metadata.json', 'w') as f:
json.dump(metadata, f)
Complex Transformations
# Extract nested arrays
df = pd.json_normalize(
data,
record_path=['items'],
meta=['user_id', 'timestamp']
)
# Multiple levels
df = pd.json_normalize(
data,
record_path=['orders', 'items'],
meta=['customer_id', ['orders', 'order_id']]
)
Best Practices
Performance Tips
# Use orient='records' for better performance
df.to_json('output.json', orient='records')
# Compression
df.to_json('output.json.gz', compression='gzip')
# Specify dtypes
df = pd.read_json('data.json', dtype={'id': int, 'value': float})
Conclusion
JSON is essential in data science. Master pd.json_normalize() for nested data and ijson for large files!
Related Articles
Python and JSON: Complete Guide to json Module
Master JSON in Python with the json module. Learn to parse, generate, and manipulate JSON data with practical examples and best practices.
Convert JSON to Excel: Complete Guide with Tools 2026
Learn how to convert JSON to Excel files. Covers online tools, Python pandas, JavaScript libraries, and automated conversion methods.
Working with Large JSON Files: Performance Guide 2026
Learn to handle large JSON files efficiently. Covers streaming parsers, memory optimization, and specialized tools for big data.