고객의 연간 지출액, 서비스 사용시간, 멤버쉽 기간 등 데이터로 모델을 학습시킨 후
새로운 고객이 유입되었을 때 해당 고객의 연간 지출액을 예측한다.
모듈 및 데이터 로드¶
In [1]:
# 모듈 로드
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# 데이터 로드
data = pd.read_csv('ecommerce.csv')
데이터 특성 확인¶
In [3]:
data
Out[3]:
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
0 | mstephenson@fernandez.com | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 | 587.951054 |
1 | hduke@hotmail.com | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 | 392.204933 |
2 | pallen@yahoo.com | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 | 487.547505 |
3 | riverarebecca@gmail.com | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 | 581.852344 |
4 | mstephens@davidson-herman.com | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 | 599.406092 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | lewisjessica@craig-evans.com | 4483 Jones Motorway Suite 872\nLake Jamiefurt,... | Tan | 33.237660 | 13.566160 | 36.417985 | 3.746573 | 573.847438 |
496 | katrina56@gmail.com | 172 Owen Divide Suite 497\nWest Richard, CA 19320 | PaleVioletRed | 34.702529 | 11.695736 | 37.190268 | 3.576526 | 529.049004 |
497 | dale88@hotmail.com | 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... | Cornsilk | 32.646777 | 11.499409 | 38.332576 | 4.958264 | 551.620145 |
498 | cwilson@hotmail.com | 680 Jennifer Lodge Apt. 808\nBrendachester, TX... | Teal | 33.322501 | 12.391423 | 36.840086 | 2.336485 | 456.469510 |
499 | hannahwilson@davidson.com | 49791 Rachel Heights Apt. 898\nEast Drewboroug... | DarkMagenta | 33.715981 | 12.418808 | 35.771016 | 2.735160 | 497.778642 |
500 rows × 8 columns
In [4]:
data.head()
Out[4]:
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
0 | mstephenson@fernandez.com | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 | 587.951054 |
1 | hduke@hotmail.com | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 | 392.204933 |
2 | pallen@yahoo.com | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 | 487.547505 |
3 | riverarebecca@gmail.com | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 | 581.852344 |
4 | mstephens@davidson-herman.com | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 | 599.406092 |
In [5]:
data.tail()
Out[5]:
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
495 | lewisjessica@craig-evans.com | 4483 Jones Motorway Suite 872\nLake Jamiefurt,... | Tan | 33.237660 | 13.566160 | 36.417985 | 3.746573 | 573.847438 |
496 | katrina56@gmail.com | 172 Owen Divide Suite 497\nWest Richard, CA 19320 | PaleVioletRed | 34.702529 | 11.695736 | 37.190268 | 3.576526 | 529.049004 |
497 | dale88@hotmail.com | 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... | Cornsilk | 32.646777 | 11.499409 | 38.332576 | 4.958264 | 551.620145 |
498 | cwilson@hotmail.com | 680 Jennifer Lodge Apt. 808\nBrendachester, TX... | Teal | 33.322501 | 12.391423 | 36.840086 | 2.336485 | 456.469510 |
499 | hannahwilson@davidson.com | 49791 Rachel Heights Apt. 898\nEast Drewboroug... | DarkMagenta | 33.715981 | 12.418808 | 35.771016 | 2.735160 | 497.778642 |
In [6]:
data.sample()
Out[6]:
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
287 | matthewgraves@mills-shaffer.com | 66636 Jason Parkway\nKellyside, NV 67033 | DarkSeaGreen | 33.908565 | 12.914847 | 39.068864 | 1.48236 | 432.472061 |
In [7]:
data.shape
Out[7]:
(500, 8)
In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 500 entries, 0 to 499 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Email 500 non-null object 1 Address 500 non-null object 2 Avatar 500 non-null object 3 Avg. Session Length 500 non-null float64 4 Time on App 500 non-null float64 5 Time on Website 500 non-null float64 6 Length of Membership 500 non-null float64 7 Yearly Amount Spent 500 non-null float64 dtypes: float64(5), object(3) memory usage: 31.4+ KB
In [9]:
data.describe()
Out[9]:
Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | |
---|---|---|---|---|---|
count | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 |
mean | 33.053194 | 12.052488 | 37.060445 | 3.533462 | 499.314038 |
std | 0.992563 | 0.994216 | 1.010489 | 0.999278 | 79.314782 |
min | 29.532429 | 8.508152 | 33.913847 | 0.269901 | 256.670582 |
25% | 32.341822 | 11.388153 | 36.349257 | 2.930450 | 445.038277 |
50% | 33.082008 | 11.983231 | 37.069367 | 3.533975 | 498.887875 |
75% | 33.711985 | 12.753850 | 37.716432 | 4.126502 | 549.313828 |
max | 36.139662 | 15.126994 | 40.005182 | 6.922689 | 765.518462 |
In [10]:
sns.pairplot(data)
Out[10]:
<seaborn.axisgrid.PairGrid at 0x1a90900e850>
불필요한 컬럼 제거¶
In [11]:
# 고객의 연간 지출액 예측에 불필요한 컬럼 제거
data.drop(['Email', 'Address', 'Avatar'], axis=1, inplace=True)
Train/Test Set 분리¶
In [12]:
from sklearn.model_selection import train_test_split
In [13]:
X = data.drop('Yearly Amount Spent', axis=1)
In [14]:
y = data['Yearly Amount Spent']
In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)
Linear Regression 모델 만들기¶
In [16]:
import statsmodels.api as sm
In [17]:
model = sm.OLS(y_train, X_train).fit()
In [18]:
model.summary()
Out[18]:
Dep. Variable: | Yearly Amount Spent | R-squared (uncentered): | 0.998 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.998 |
Method: | Least Squares | F-statistic: | 4.798e+04 |
Date: | Sat, 29 Jan 2022 | Prob (F-statistic): | 0.00 |
Time: | 12:29:00 | Log-Likelihood: | -1820.0 |
No. Observations: | 400 | AIC: | 3648. |
Df Residuals: | 396 | BIC: | 3664. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Avg. Session Length | 11.9059 | 0.869 | 13.696 | 0.000 | 10.197 | 13.615 |
Time on App | 34.3257 | 1.121 | 30.610 | 0.000 | 32.121 | 36.530 |
Time on Website | -14.1405 | 0.812 | -17.405 | 0.000 | -15.738 | -12.543 |
Length of Membership | 61.0149 | 1.144 | 53.318 | 0.000 | 58.765 | 63.265 |
Omnibus: | 0.490 | Durbin-Watson: | 1.987 |
---|---|---|---|
Prob(Omnibus): | 0.783 | Jarque-Bera (JB): | 0.606 |
Skew: | -0.022 | Prob(JB): | 0.739 |
Kurtosis: | 2.814 | Cond. No. | 55.4 |
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
예측 및 평가¶
In [19]:
predictions = model.predict(X_test)
In [20]:
predictions
Out[20]:
69 418.211323 29 567.097473 471 534.706617 344 425.690888 54 474.931682 ... 460 570.877250 152 564.267305 154 557.093996 56 489.285778 392 550.720695 Length: 100, dtype: float64
In [21]:
y_test
Out[21]:
69 451.575685 29 554.722084 471 541.049831 344 442.722892 54 522.404141 ... 460 618.845970 152 555.892595 154 595.803819 56 520.898794 392 549.131573 Name: Yearly Amount Spent, Length: 100, dtype: float64
In [22]:
sns.scatterplot(x=y_test, y=predictions)
Out[22]:
<AxesSubplot:xlabel='Yearly Amount Spent'>
In [23]:
from sklearn import metrics
In [24]:
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
MSE: 482.2890139088915 RMSE: 21.961079525125616
'데이터분석 > E커머스 데이터분석' 카테고리의 다른 글
Logistic Regression을 통한 고객의 광고 반응률 예측 (0) | 2022.01.30 |
---|