Windowsでmultiprocessingにおける「if __name__ == "__main__":」制限を回避してPandasで複数ファイルを並列で読み込む
忘備録として書く
題名長すぎる...
以下のようにWindowsの場合main guardを書かないとmultiprocessingが動かない stackoverflow.com
何とか回避してPandasを使って複数プロセスでファイルを読み込んでみる.具体的には以下のように別のpythonのプロセスを動かして,pickle経由でデータを受け取る
# -*- coding: utf-8 -*- import pickle import subprocess import pandas as pd from sklearn import datasets mp = \ '''# -*- coding: utf-8 -*- from multiprocessing import Pool import pandas as pd import sys import pickle def read_(f, header): return pd.read_csv(f, header=header) def read(x): return read_(x[0], x[1]) def read_csv(fs): with Pool() as p: return p.map(read, fs) if __name__ == "__main__": f = sys.argv[1] n1 = int(sys.argv[2]) n2 = int(sys.argv[3]) header = int(sys.argv[4]) fs = [] for i in range(n1, n2): fs.append((f.replace('{}', str(i)), header)) dfs = read_csv(fs) binary = pickle.dumps(dfs) sys.stdout.buffer.write(binary) ''' with open('mp.py', mode='w') as f: f.write(mp) def read_csv(f_name, start=0, end=30, header=0): s_str = str(start) e_str = str(end) h_str = str(header) result = subprocess.run(['python', 'mp.py', f_name, s_str, e_str, h_str], stdout=subprocess.PIPE) return pickle.loads(result.stdout) iris = datasets.load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) for i in range(30): df.to_csv('data'+str(i)+'.csv') dfs = read_csv('data{}.csv', 0, 30) print('len:', len(dfs)) print('===============================================') print('head:') print(dfs[0].head()) print('===============================================') print('info:') dfs[0].info()
出力
len: 30 =============================================== head: Unnamed: 0 sepal length (cm) sepal width (cm) petal length (cm) \ 0 0 5.1 3.5 1.4 1 1 4.9 3.0 1.4 2 2 4.7 3.2 1.3 3 3 4.6 3.1 1.5 4 4 5.0 3.6 1.4 petal width (cm) 0 0.2 1 0.2 2 0.2 3 0.2 4 0.2 =============================================== info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): Unnamed: 0 150 non-null int64 sepal length (cm) 150 non-null float64 sepal width (cm) 150 non-null float64 petal length (cm) 150 non-null float64 petal width (cm) 150 non-null float64 dtypes: float64(4), int64(1) memory usage: 5.9 KB
これをモジュール化してやれば良さそう
でもこんな事するくらいなら素直にmain guard書いた方がいい説...