Windowsでmultiprocessingにおける「if __name__ == "__main__":」制限を回避してPandasで複数ファイルを並列で読み込む

忘備録として書く

題名長すぎる...

以下のようにWindowsの場合main guardを書かないとmultiprocessingが動かない stackoverflow.com

何とか回避してPandasを使って複数プロセスでファイルを読み込んでみる．具体的には以下のように別のpythonのプロセスを動かして，pickle経由でデータを受け取る

# -*- coding: utf-8 -*-
import pickle
import subprocess
import pandas as pd
from sklearn import datasets

mp = \
'''# -*- coding: utf-8 -*-
from multiprocessing import Pool
import pandas as pd
import sys
import pickle

def read_(f, header):
    return pd.read_csv(f, header=header)

def read(x):
    return read_(x[0], x[1])

def read_csv(fs):
    with Pool() as p:
        return p.map(read, fs)

if __name__ == "__main__":
    f = sys.argv[1]
    n1 = int(sys.argv[2])
    n2 = int(sys.argv[3])
    header = int(sys.argv[4])
    fs = []
    for i in range(n1, n2):
        fs.append((f.replace('{}', str(i)), header))
    dfs = read_csv(fs)
    binary = pickle.dumps(dfs)
    sys.stdout.buffer.write(binary)
'''

with open('mp.py', mode='w') as f:
    f.write(mp)

def read_csv(f_name, start=0, end=30, header=0):
    s_str = str(start)
    e_str = str(end)
    h_str = str(header)
    result = subprocess.run(['python', 'mp.py', f_name, s_str, e_str, h_str], stdout=subprocess.PIPE)
    return pickle.loads(result.stdout)


iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
for i in range(30):
    df.to_csv('data'+str(i)+'.csv')

dfs = read_csv('data{}.csv', 0, 30)
print('len:', len(dfs))
print('===============================================')
print('head:')
print(dfs[0].head())
print('===============================================')
print('info:')
dfs[0].info()

出力

len: 30
===============================================
head:
   Unnamed: 0  sepal length (cm)  sepal width (cm)  petal length (cm)  \
0           0                5.1               3.5                1.4   
1           1                4.9               3.0                1.4   
2           2                4.7               3.2                1.3   
3           3                4.6               3.1                1.5   
4           4                5.0               3.6                1.4   

   petal width (cm)  
0               0.2  
1               0.2  
2               0.2  
3               0.2  
4               0.2  
===============================================
info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
Unnamed: 0           150 non-null int64
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
dtypes: float64(4), int64(1)
memory usage: 5.9 KB

これをモジュール化してやれば良さそう

でもこんな事するくらいなら素直にmain guard書いた方がいい説...

suzuzusu日記

(´･ω･｀)

Windowsでmultiprocessingにおける「if name == "main":」制限を回避してPandasで複数ファイルを並列で読み込む