Chapter 3 Data transformation

3.1 Import Data

  1. SP500 & Sector Index
##  [1] "S&P500"                 "Energy"                 "Information Tech"      
##  [4] "Financial"              "Health Care"            "Consumer Discretionary"
##  [7] "Utility"                "Consumer Staples"       "Industrials"           
## [10] "Communication Services" "Materials"              "Real Estate"

Pick a dataset to take a brief look at:

## # A tibble: 6 x 4
##   Date                PX_LAST  PX_VOLUME Return
##   <dttm>                <dbl>      <dbl>  <dbl>
## 1 2021-12-31 00:00:00   4766. 1871893220  1.09 
## 2 2021-12-24 00:00:00   4726. 2191128627  1.08 
## 3 2021-12-17 00:00:00   4621. 4954862556  1.03 
## 4 2021-12-10 00:00:00   4712. 3004269454  1.07 
## 5 2021-12-03 00:00:00   4538. 4438623886  0.993
## 6 2021-11-26 00:00:00   4595. 2248016670  1.02
  1. S&P500 Constituents Financials:
##   Symbol                Name                 Sector  Price Price.Earnings
## 1    MMM          3M Company            Industrials 222.89          24.31
## 2    AOS     A.O. Smith Corp            Industrials  60.24          27.76
## 3    ABT Abbott Laboratories            Health Care  56.27          22.51
## 4   ABBV         AbbVie Inc.            Health Care 108.48          19.41
## 5    ACN       Accenture plc Information Technology 150.51          25.47
## 6   ATVI Activision Blizzard Information Technology  65.83          31.80
##   Dividend.Yield Earnings.Share X52.Week.Low X52.Week.High   Market.Cap
## 1      2.3328617           7.92      259.770       175.490 138721055226
## 2      1.1479592           1.70       68.390        48.925  10783419933
## 3      1.9089824           0.26       64.600        42.280 102121042306
## 4      2.4995599           3.29      125.860        60.050 181386347059
## 5      1.7144699           5.44      162.600       114.820  98765855553
## 6      0.4319032           1.28       74.945        38.930  52518668144
##        EBITDA Price.Sales Price.Book
## 1  9048000000    4.390271      11.34
## 2   601000000    3.575483       6.35
## 3  5744000000    3.740480       3.19
## 4 10310000000    6.291571      26.14
## 5  5643228000    2.604117      10.62
## 6  2704000000   10.595120       5.16
##                                                          SEC.Filings
## 1  http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM
## 2  http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS
## 3  http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT
## 4 http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV
## 5  http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ACN
## 6 http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ATVI
## # A tibble: 6 x 25
##   `MARKET REPRESENTATION` `2021` `2020` `2021 CHANGE`   high election...6 `covid low`
##   <chr>                    <dbl>  <dbl>         <dbl>  <dbl>        <dbl>       <dbl>
## 1 Energy                  0.0267 0.0228        0.172  0.0250       0.0201      0.0248
## 2 Materials               0.0256 0.0263       -0.0244 0.0251       0.0272      0.0240
## 3 Industrials             0.0778 0.0840       -0.0734 0.0840       0.0860      0.0784
## 4 Consumer Discretionary  0.125  0.127        -0.0140 0.123        0.115       0.102 
## 5 Consumer Staples        0.0588 0.0651       -0.0976 0.0582       0.0702      0.0805
## 6 Health Care             0.133  0.135        -0.0132 0.133        0.141       0.150 
## # ... with 18 more variables: pre-covid hi <dbl>, election...9 <dbl>,
## #   bear low <dbl>, bull high <dbl>, prior bear low <dbl>,
## #   prior bull high <dbl>, 2019 <dbl>, 2018 <dbl>, 2017 <dbl>, 2016 <dbl>,
## #   2015 <dbl>, 2014 <dbl>, 2013 <dbl>, 2012 <dbl>, 2011 <dbl>, 2009 <dbl>,
## #   1999 <dbl>, 1989 <dbl>

3.2 Data Transformation:

3.2.1 Rename column names

First of all, some names of the columns in the data set are not easy for people to read or understand. Therefore, we change some of the names of the columns:

  • For each 11 sectors, we name them by their sector name first and then form the data frame by using “P_name of sector”(ie.P_SP500,P_CommuSer….) instead of “PX_LAST” as shown in the original data set.
  • Same for the volume columns, we change the name of “PX_VOLUME” into “V_name of sector” for 11 different sectors.
##  [1] "Sector"          "2021"            "2020"            "2021_CHANGE"    
##  [5] "high"            "election_1"      "covid_low"       "pre-covid_hi"   
##  [9] "election_2"      "bear_low"        "bull_high"       "prior_bear_low" 
## [13] "prior_bull_high" "2019"            "2018"            "2017"           
## [17] "2016"            "2015"            "2014"            "2013"           
## [21] "2012"            "2011"            "2009"            "1999"           
## [25] "1989"
##  [1] "Symbol"         "Name"           "Sector"         "Price"         
##  [5] "Price_Earnings" "Dividend_Yield" "Earnings_Share" "X52_Week_Low"  
##  [9] "X52_Week_High"  "Market_Cap"     "EBITDA"         "Price_Sales"   
## [13] "Price_Book"     "SEC_Filings"

3.2.2 Merge into one dataset

## # A tibble: 6 x 37
##   Date                P_SP500    V_SP500 R_SP500 P_CommuSer V_CommuSer R_CommuSer
##   <dttm>                <dbl>      <dbl>   <dbl>      <dbl>      <dbl>      <dbl>
## 1 2021-12-31 00:00:00   4766. 1871893220   1.09        267.  201746084      0.532
## 2 2021-12-24 00:00:00   4726. 2191128627   1.08        270.  218743361      0.544
## 3 2021-12-17 00:00:00   4621. 4954862556   1.03        263.  540307245      0.506
## 4 2021-12-10 00:00:00   4712. 3004269454   1.07        266.  324612410      0.525
## 5 2021-12-03 00:00:00   4538. 4438623886   0.993       259.  467692149      0.482
## 6 2021-11-26 00:00:00   4595. 2248016670   1.02        266.  210905394      0.524
## # ... with 30 more variables: P_ConsumerD <dbl>, V_ConsumerD <dbl>,
## #   R_ConsumerD <dbl>, P_ConsumerS <dbl>, V_ConsumerS <dbl>, R_ConsumerS <dbl>,
## #   P_Energy <dbl>, V_Energy <dbl>, R_Energy <dbl>, P_Fin <dbl>, V_Fin <dbl>,
## #   R_Fin <dbl>, P_Health <dbl>, V_Health <dbl>, R_Health <dbl>,
## #   P_Industrial <dbl>, V_Industrial <dbl>, R_Industrial <dbl>,
## #   P_Infotech <dbl>, V_Infotech <dbl>, R_Infotech <dbl>, P_Materials <dbl>,
## #   V_Materials <dbl>, R_Materials <dbl>, P_RealEstate <dbl>, ...

3.2.3 Seperate Price and Volume Data

## # A tibble: 6 x 13
##   Date                P_SP500 P_CommuSer P_ConsumerD P_ConsumerS P_Energy P_Fin
##   <dttm>                <dbl>      <dbl>       <dbl>       <dbl>    <dbl> <dbl>
## 1 2021-12-31 00:00:00   4766.       267.       1611.        805.     423.  650.
## 2 2021-12-24 00:00:00   4726.       270.       1604.        785.     418.  646.
## 3 2021-12-17 00:00:00   4621.       263.       1545.        780.     409.  639.
## 4 2021-12-10 00:00:00   4712.       266.       1614.        771.     431.  648.
## 5 2021-12-03 00:00:00   4538.       259.       1575.        745.     415.  631.
## 6 2021-11-26 00:00:00   4595.       266.       1613.        748.     419.  644.
## # ... with 6 more variables: P_Health <dbl>, P_Industrial <dbl>,
## #   P_Infotech <dbl>, P_Materials <dbl>, P_RealEstate <dbl>, P_Utility <dbl>
## # A tibble: 6 x 13
##   Date                V_SP500 V_CommuSer V_ConsumerD V_ConsumerS V_Energy  V_Fin
##   <dttm>                <dbl>      <dbl>       <dbl>       <dbl>    <dbl>  <dbl>
## 1 2021-12-31 00:00:00  1.87e9  201746084   235410161   121522191   1.32e8 2.06e8
## 2 2021-12-24 00:00:00  2.19e9  218743361   265524409   147001217   1.66e8 2.68e8
## 3 2021-12-17 00:00:00  4.95e9  540307245   557366831   359064442   3.59e8 6.44e8
## 4 2021-12-10 00:00:00  3.00e9  324612410   356340119   201653031   2.01e8 3.44e8
## 5 2021-12-03 00:00:00  4.44e9  467692149   505253793   293469274   3.74e8 5.37e8
## 6 2021-11-26 00:00:00  2.25e9  210905394   286769129   140448143   1.74e8 2.72e8
## # ... with 6 more variables: V_Health <dbl>, V_Industrial <dbl>,
## #   V_Infotech <dbl>, V_Materials <dbl>, V_RealEstate <dbl>, V_Utility <dbl>
## # A tibble: 6 x 13
##   Date                R_SP500 R_CommuSer R_ConsumerD R_ConsumerS R_Energy R_Fin
##   <dttm>                <dbl>      <dbl>       <dbl>       <dbl>    <dbl> <dbl>
## 1 2021-12-31 00:00:00   1.09       0.532        1.43       0.503   -0.242 0.662
## 2 2021-12-24 00:00:00   1.08       0.544        1.42       0.467   -0.250 0.653
## 3 2021-12-17 00:00:00   1.03       0.506        1.33       0.457   -0.267 0.635
## 4 2021-12-10 00:00:00   1.07       0.525        1.44       0.441   -0.228 0.656
## 5 2021-12-03 00:00:00   0.993      0.482        1.38       0.392   -0.255 0.614
## 6 2021-11-26 00:00:00   1.02       0.524        1.43       0.398   -0.249 0.646
## # ... with 6 more variables: R_Health <dbl>, R_Industrial <dbl>,
## #   R_Infotech <dbl>, R_Materials <dbl>, R_RealEstate <dbl>, R_Utility <dbl>

Moreover, we have one small add-ons of the data. Although it is not data transformation on the original ones that we directly get from the website. We indeed make some changes. We add one more column for calculating the returns by using the formula: (Pi/P1/6/17)-1 where i is the date price.

##         Date   Open   High    Low  Close Adj.Close    Volume
## 1 27/12/2021 177.09 181.33 177.07 178.20  177.9733 276185800
## 2 20/12/2021 168.28 176.85 167.46 176.28  176.0557 359176900
## 3 13/12/2021 181.12 182.13 169.69 171.14  170.9222 769299200
## 4 06/12/2021 164.29 179.63 164.28 179.45  179.2217 569227700
## 5 29/11/2021 159.37 170.30 157.80 161.84  161.6341 669611100
## 6 22/11/2021 161.68 165.70 156.36 156.81  156.6105 359933200
##         Date APPLE_close Apple_return
## 1 27/12/2021      178.20     1.089175
## 2 20/12/2021      176.28     3.003389
## 3 13/12/2021      171.14    -4.630815
## 4 06/12/2021      179.45    10.881118
## 5 29/11/2021      161.84     3.207702
## 6 22/11/2021      156.81    -2.329495

Other than what I mentioned above, that’s all the changes we made on the original data. There are no further transformations on the data.