← Kembali ke Blog

JSON dalam Ilmu Data: Panduan Python dan Pandas

Panduan lengkap JSON dalam alur kerja ilmu data. Pelajari cara memproses JSON dengan Python, Pandas, dan integrasi ke dalam pipeline ML.

Big JSON Team13 menit bacapemrograman
B

Big JSON Team

Technical Writer

Expert in JSON data manipulation, API development, and web technologies. Passionate about creating tools that make developers' lives easier.

13 min read

JSON dalam Ilmu Data

JSON ada di mana-mana dalam ilmu data untuk data API, database NoSQL, dan file konfigurasi.

Memuat JSON dengan Pandas

import pandas as pd

# File JSON sederhana

df = pd.read_json('data.json')

# Format JSON Lines

df = pd.read_json('data.jsonl', lines=True)

# Dari string

df = pd.read_json(json_string)

Menangani JSON Bersarang

import pandas as pd

# Data bersarang

data = {

"users": [

{"name": "Alice", "address": {"city": "NYC"}},

{"name": "Bob", "address": {"city": "LA"}}

]

}

# Normalisasi struktur bersarang

df = pd.json_normalize(data['users'])

# Kolom: name, address.city

Memuat dari API

import requests

import pandas as pd

response = requests.get('https://api.example.com/data')

data = response.json()

df = pd.DataFrame(data)

Menangani Data yang Hilang

data = [

{"name": "Alice", "age": 30, "email": "alice@example.com"},

{"name": "Bob", "age": 25}, # Tidak ada email

]

df = pd.DataFrame(data)

df['email'] = df['email'].fillna('unknown')

JSON Lines untuk File Besar

# Proses dalam bongkahan (chunks)

chunks = pd.read_json('large.jsonl', lines=True, chunksize=10000)

for chunk in chunks:

process(chunk)

Streaming dengan ijson

import ijson

def process_large_json(filename):

with open(filename, 'rb') as f:

for item in ijson.items(f, 'items.item'):

yield item

for record in process_large_json('data.json'):

print(record)

Analisis Data

# Analisis dasar

df.info()

df.describe()

df['status'].value_counts()

# Group by

summary = df.groupby('category').agg({

'price': ['mean', 'sum'],

'rating': 'mean'

})

Mengekspor Hasil

# Ke JSON

df.to_json('output.json', orient='records', indent=2)

# Ke Excel

df.to_excel('output.xlsx', index=False)

# Ke CSV

df.to_csv('output.csv', index=False)

Integrasi Pipeline ML

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

# Rekayasa fitur (Feature engineering)

df = pd.json_normalize(data)

df_encoded = pd.get_dummies(df, columns=['category'])

# Bagi data

X = df_encoded.drop('target', axis=1)

y = df_encoded['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Simpan metadata

metadata = {

"features": list(X.columns),

"samples": len(df),

"date": datetime.now().isoformat()

}

with open('metadata.json', 'w') as f:

json.dump(metadata, f)

Transformasi Kompleks

# Ekstrak array bersarang

df = pd.json_normalize(

data,

record_path=['items'],

meta=['user_id', 'timestamp']

)

# Beberapa level

df = pd.json_normalize(

data,

record_path=['orders', 'items'],

meta=['customer_id', ['orders', 'order_id']]

)

Praktik Terbaik

  • Validasi JSON sebelum memproses
  • Gunakan enkoding UTF-8
  • Tangani data yang hilang dengan tepat
  • Lakukan streaming pada file besar
  • Dokumentasikan skema data
  • Tips Performa

    # Gunakan orient='records' untuk performa lebih baik
    

    df.to_json('output.json', orient='records')

    # Kompresi

    df.to_json('output.json.gz', compression='gzip')

    # Tentukan dtypes

    df = pd.read_json('data.json', dtype={'id': int, 'value': float})

    Kesimpulan

    JSON sangat penting dalam ilmu data. Kuasai pd.json_normalize() untuk data bersarang dan ijson untuk file besar!

    Share:

    Artikel Terkait

    Read in English