logo

Python 大数据集处理

wangzf / 2022-07-24


目录

Python 主流数据处理工具

如何在有限的 RAM 下快速地读取数据,使用更少的 disk 来存储数据是我们在处理大型数据时需要特别考虑的

pandas

# jupyter lab/notebook
import pandas as pd

%%time
dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

df = pd.read_csv("data/train.csv", dtype = dtypes)
print("Train size:", df.shape)
df.head()

Dask

# jupyter lab/notebook
import dask.dataframe as dd

%%time
dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

df = dd.read_csv("data/train.csv", dtype = dtypes).compute()
print("Train size:", data.shape)
df.head()

datatable

# jupyter lab/notebook
import datatable as dt

%%time
df = dt.fread("data/train.csv") 
print("Train size:", df.shape)
df.head()

rapids

# jupyter lab/notebook

# rapids installation (make sure to turn on GPU)
import sys
!cp ../input/rapids/rapids.0.16.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path

import cudf

%%time
df = cudf.read_csv("data/train.csv") 
print("Train size:", df.shape)
df.head()

参考资料

Python 主流数据存储格式

csv

# jupyter lab/notebook
import pandas as pd

%%time
train_df = pd.read_csv("data/train.csv")
train_df.info

csv 格式转换为 pickle/feather/parquet/jay/h5

import pandas as pd
import datatable as dt

# train_df = dt.fread("data/train.csv").to_pandas()
train_df.to_csv("data/train.csv", index = False)
train_df.to_pickle("data/train.pkl.gzip")
train_df.to_feather("data/train.feather")
train_df.to_parquet("data/train.parquet")
dt.Frame(train_df).to_jay("data/train.jay")
train_df.to_hdf("data/train.h5", "train")
dt.Frame(train_df).to_jay("data/train.jay")

pickle

# jupyter lab/notebook
import pandas as pd

%%time
train_pickle = pd.read_pickle("data/train.pkl.gzip")
train_pickle.info()

feather

# jupyter lab/notebook
import pandas as pd

%%time
train_feather = pd.read_feather("data/train.feather")
train_feather.info()

parquet

# jupyter lab/notebook
import pandas as pd

%%time
train_parquet = pd.read_parquet("data/train.parquet")
train_parquet.info()

jay

# jupyter lab/notebook
import pandas as pd

%%time
train_jay = dt.fread("data/train.jay")
train.jay.shape

参考资料

pandas

参考资料

datatable

安装

$ pip install datatable
import datatable as dt
print(dt.__version__)

核心概念

核心方法

最佳实践

参考资料

Dask

安装

$ pip install 'dask[complete]'  # Install everything

$ pip install dask  # Install only core parts of dask

$ pip install 'dask[array]'  # Install requirements for dask array
$ pip install 'dask[dataframe]'  # Install requirements for dask dataframe
$ pip install 'dask[diagnostics]'  # Install requirements for dask diagnostics
$ pip install 'dask[distributed]'  # Install requirements for distributed dask

参考资料

RAPIDS

tqdm

PySpark