Chapter 3 Data transformation
3.1 Import Data
- SP500 & Sector Index
## [1] "S&P500" "Energy" "Information Tech"
## [4] "Financial" "Health Care" "Consumer Discretionary"
## [7] "Utility" "Consumer Staples" "Industrials"
## [10] "Communication Services" "Materials" "Real Estate"
Pick a dataset to take a brief look at:
## # A tibble: 6 x 4
## Date PX_LAST PX_VOLUME Return
## <dttm> <dbl> <dbl> <dbl>
## 1 2021-12-31 00:00:00 4766. 1871893220 1.09
## 2 2021-12-24 00:00:00 4726. 2191128627 1.08
## 3 2021-12-17 00:00:00 4621. 4954862556 1.03
## 4 2021-12-10 00:00:00 4712. 3004269454 1.07
## 5 2021-12-03 00:00:00 4538. 4438623886 0.993
## 6 2021-11-26 00:00:00 4595. 2248016670 1.02
- S&P500 Constituents Financials:
## Symbol Name Sector Price Price.Earnings
## 1 MMM 3M Company Industrials 222.89 24.31
## 2 AOS A.O. Smith Corp Industrials 60.24 27.76
## 3 ABT Abbott Laboratories Health Care 56.27 22.51
## 4 ABBV AbbVie Inc. Health Care 108.48 19.41
## 5 ACN Accenture plc Information Technology 150.51 25.47
## 6 ATVI Activision Blizzard Information Technology 65.83 31.80
## Dividend.Yield Earnings.Share X52.Week.Low X52.Week.High Market.Cap
## 1 2.3328617 7.92 259.770 175.490 138721055226
## 2 1.1479592 1.70 68.390 48.925 10783419933
## 3 1.9089824 0.26 64.600 42.280 102121042306
## 4 2.4995599 3.29 125.860 60.050 181386347059
## 5 1.7144699 5.44 162.600 114.820 98765855553
## 6 0.4319032 1.28 74.945 38.930 52518668144
## EBITDA Price.Sales Price.Book
## 1 9048000000 4.390271 11.34
## 2 601000000 3.575483 6.35
## 3 5744000000 3.740480 3.19
## 4 10310000000 6.291571 26.14
## 5 5643228000 2.604117 10.62
## 6 2704000000 10.595120 5.16
## SEC.Filings
## 1 http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM
## 2 http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS
## 3 http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT
## 4 http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV
## 5 http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ACN
## 6 http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ATVI
## # A tibble: 6 x 25
## `MARKET REPRESENTATION` `2021` `2020` `2021 CHANGE` high election...6 `covid low`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Energy 0.0267 0.0228 0.172 0.0250 0.0201 0.0248
## 2 Materials 0.0256 0.0263 -0.0244 0.0251 0.0272 0.0240
## 3 Industrials 0.0778 0.0840 -0.0734 0.0840 0.0860 0.0784
## 4 Consumer Discretionary 0.125 0.127 -0.0140 0.123 0.115 0.102
## 5 Consumer Staples 0.0588 0.0651 -0.0976 0.0582 0.0702 0.0805
## 6 Health Care 0.133 0.135 -0.0132 0.133 0.141 0.150
## # ... with 18 more variables: pre-covid hi <dbl>, election...9 <dbl>,
## # bear low <dbl>, bull high <dbl>, prior bear low <dbl>,
## # prior bull high <dbl>, 2019 <dbl>, 2018 <dbl>, 2017 <dbl>, 2016 <dbl>,
## # 2015 <dbl>, 2014 <dbl>, 2013 <dbl>, 2012 <dbl>, 2011 <dbl>, 2009 <dbl>,
## # 1999 <dbl>, 1989 <dbl>
3.2 Data Transformation:
3.2.1 Rename column names
First of all, some names of the columns in the data set are not easy for people to read or understand. Therefore, we change some of the names of the columns:
- For each 11 sectors, we name them by their sector name first and then form the data frame by using “P_name of sector”(ie.P_SP500,P_CommuSer….) instead of “PX_LAST” as shown in the original data set.
- Same for the volume columns, we change the name of “PX_VOLUME” into “V_name of sector” for 11 different sectors.
## [1] "Sector" "2021" "2020" "2021_CHANGE"
## [5] "high" "election_1" "covid_low" "pre-covid_hi"
## [9] "election_2" "bear_low" "bull_high" "prior_bear_low"
## [13] "prior_bull_high" "2019" "2018" "2017"
## [17] "2016" "2015" "2014" "2013"
## [21] "2012" "2011" "2009" "1999"
## [25] "1989"
## [1] "Symbol" "Name" "Sector" "Price"
## [5] "Price_Earnings" "Dividend_Yield" "Earnings_Share" "X52_Week_Low"
## [9] "X52_Week_High" "Market_Cap" "EBITDA" "Price_Sales"
## [13] "Price_Book" "SEC_Filings"
3.2.2 Merge into one dataset
## # A tibble: 6 x 37
## Date P_SP500 V_SP500 R_SP500 P_CommuSer V_CommuSer R_CommuSer
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-12-31 00:00:00 4766. 1871893220 1.09 267. 201746084 0.532
## 2 2021-12-24 00:00:00 4726. 2191128627 1.08 270. 218743361 0.544
## 3 2021-12-17 00:00:00 4621. 4954862556 1.03 263. 540307245 0.506
## 4 2021-12-10 00:00:00 4712. 3004269454 1.07 266. 324612410 0.525
## 5 2021-12-03 00:00:00 4538. 4438623886 0.993 259. 467692149 0.482
## 6 2021-11-26 00:00:00 4595. 2248016670 1.02 266. 210905394 0.524
## # ... with 30 more variables: P_ConsumerD <dbl>, V_ConsumerD <dbl>,
## # R_ConsumerD <dbl>, P_ConsumerS <dbl>, V_ConsumerS <dbl>, R_ConsumerS <dbl>,
## # P_Energy <dbl>, V_Energy <dbl>, R_Energy <dbl>, P_Fin <dbl>, V_Fin <dbl>,
## # R_Fin <dbl>, P_Health <dbl>, V_Health <dbl>, R_Health <dbl>,
## # P_Industrial <dbl>, V_Industrial <dbl>, R_Industrial <dbl>,
## # P_Infotech <dbl>, V_Infotech <dbl>, R_Infotech <dbl>, P_Materials <dbl>,
## # V_Materials <dbl>, R_Materials <dbl>, P_RealEstate <dbl>, ...
3.2.3 Seperate Price and Volume Data
## # A tibble: 6 x 13
## Date P_SP500 P_CommuSer P_ConsumerD P_ConsumerS P_Energy P_Fin
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-12-31 00:00:00 4766. 267. 1611. 805. 423. 650.
## 2 2021-12-24 00:00:00 4726. 270. 1604. 785. 418. 646.
## 3 2021-12-17 00:00:00 4621. 263. 1545. 780. 409. 639.
## 4 2021-12-10 00:00:00 4712. 266. 1614. 771. 431. 648.
## 5 2021-12-03 00:00:00 4538. 259. 1575. 745. 415. 631.
## 6 2021-11-26 00:00:00 4595. 266. 1613. 748. 419. 644.
## # ... with 6 more variables: P_Health <dbl>, P_Industrial <dbl>,
## # P_Infotech <dbl>, P_Materials <dbl>, P_RealEstate <dbl>, P_Utility <dbl>
## # A tibble: 6 x 13
## Date V_SP500 V_CommuSer V_ConsumerD V_ConsumerS V_Energy V_Fin
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-12-31 00:00:00 1.87e9 201746084 235410161 121522191 1.32e8 2.06e8
## 2 2021-12-24 00:00:00 2.19e9 218743361 265524409 147001217 1.66e8 2.68e8
## 3 2021-12-17 00:00:00 4.95e9 540307245 557366831 359064442 3.59e8 6.44e8
## 4 2021-12-10 00:00:00 3.00e9 324612410 356340119 201653031 2.01e8 3.44e8
## 5 2021-12-03 00:00:00 4.44e9 467692149 505253793 293469274 3.74e8 5.37e8
## 6 2021-11-26 00:00:00 2.25e9 210905394 286769129 140448143 1.74e8 2.72e8
## # ... with 6 more variables: V_Health <dbl>, V_Industrial <dbl>,
## # V_Infotech <dbl>, V_Materials <dbl>, V_RealEstate <dbl>, V_Utility <dbl>
## # A tibble: 6 x 13
## Date R_SP500 R_CommuSer R_ConsumerD R_ConsumerS R_Energy R_Fin
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-12-31 00:00:00 1.09 0.532 1.43 0.503 -0.242 0.662
## 2 2021-12-24 00:00:00 1.08 0.544 1.42 0.467 -0.250 0.653
## 3 2021-12-17 00:00:00 1.03 0.506 1.33 0.457 -0.267 0.635
## 4 2021-12-10 00:00:00 1.07 0.525 1.44 0.441 -0.228 0.656
## 5 2021-12-03 00:00:00 0.993 0.482 1.38 0.392 -0.255 0.614
## 6 2021-11-26 00:00:00 1.02 0.524 1.43 0.398 -0.249 0.646
## # ... with 6 more variables: R_Health <dbl>, R_Industrial <dbl>,
## # R_Infotech <dbl>, R_Materials <dbl>, R_RealEstate <dbl>, R_Utility <dbl>
Moreover, we have one small add-ons of the data. Although it is not data transformation on the original ones that we directly get from the website. We indeed make some changes. We add one more column for calculating the returns by using the formula: (Pi/P1/6/17)-1 where i is the date price.
## Date Open High Low Close Adj.Close Volume
## 1 27/12/2021 177.09 181.33 177.07 178.20 177.9733 276185800
## 2 20/12/2021 168.28 176.85 167.46 176.28 176.0557 359176900
## 3 13/12/2021 181.12 182.13 169.69 171.14 170.9222 769299200
## 4 06/12/2021 164.29 179.63 164.28 179.45 179.2217 569227700
## 5 29/11/2021 159.37 170.30 157.80 161.84 161.6341 669611100
## 6 22/11/2021 161.68 165.70 156.36 156.81 156.6105 359933200
## Date APPLE_close Apple_return
## 1 27/12/2021 178.20 1.089175
## 2 20/12/2021 176.28 3.003389
## 3 13/12/2021 171.14 -4.630815
## 4 06/12/2021 179.45 10.881118
## 5 29/11/2021 161.84 3.207702
## 6 22/11/2021 156.81 -2.329495
Other than what I mentioned above, that’s all the changes we made on the original data. There are no further transformations on the data.