class: center, middle, inverse, title-slide # Part 4: Working with Data ### Michael Kane
School of Public Health, Biostatistics Department
Yale University
###
kaneplusplus
kaneplusplus
--- # <br> Overview of Part 4 -- <br> ## Highlights from Part 3 -- <br> ### Generators -- <br> ### Classes and Objects --- # <br> Overview of Part 4 <br> ### Importing Data with Pandas -- <br> ### Data Cleaning -- <br> ### Data Exploration -- <br> ### Pricing Wine -- <br> ### R's R6 Object System -- <br> ### Calling Python from R --- # <br> Importing Data with Pandas <br> ```python # Python import numpy as np import pandas as pd import plotnine from plotnine import * plotnine.options.figure_size = (4, 2) wine = pd.read_csv("winemag-data-130k-v2.csv") wine = wine.drop('Unnamed: 0', axis = 1) # How many rows? wine.shape[0] ``` ``` ## 129971 ``` --- # <br> Importing Data with Pandas <br> ## Accessing columns -- ```python # Python # What are the column names? wine.keys() ``` ``` ## Index(['country', 'description', 'designation', 'points', 'price', 'province', ## 'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title', ## 'variety', 'winery'], ## dtype='object') ``` --- # <br> Data Cleaning <br> ## Dealing with Missing Data -- ```python # How many missing countries? wine[ ~ wine['country'].isnull() ].shape[0] # Drop them ``` ``` ## 129908 ``` ```python wine = wine[ ~ wine['country'].isnull() ] wine.shape ``` ``` ## (129908, 13) ``` --- # <br> Data Cleaning <br> ## How many wineries and countries? -- ```python # How many unique wineries? len(wine['winery'].unique()) # How many countries? ``` ``` ## 16745 ``` ```python len(wine['country'].unique()) ``` ``` ## 43 ``` --- # <br> Data Exploration <br> ## Visualizing Country Counts -- ```python # Make country categorical, ordered by count. country_list = wine['country'].value_counts().index.tolist() country_cat = pd.Categorical(wine['country'], categories = country_list) wine = wine.assign(country_cat = country_cat) # Where are the wines from? p = (ggplot(wine, aes(x = "country_cat")) + theme(axis_text_x=element_text(rotation=270, hjust=1)) + geom_histogram()) p.draw() ``` --- # <br> Data Exploration <br> ## Where do the wines come from? -- ```python # Make country categorical, ordered by count. country_list = wine['country'].value_counts().index.tolist() country_cat = pd.Categorical(wine['country'], categories = country_list) wine = wine.assign(country_cat = country_cat) # Where are the wines from? p = (ggplot(wine, aes(x = "country_cat")) + theme(axis_text_x=element_text(rotation=270, hjust=1)) + geom_histogram() + theme(subplots_adjust={'bottom': .3})) # p.draw() ``` --- # <br> Data Exploration <br> ## How much do they cost? -- ```python p = (ggplot(wine, aes(x = "price")) + theme(axis_text_x=element_text(rotation=270, hjust=1)) + geom_histogram() + theme(subplots_adjust={'bottom': .3})) # p.draw() ``` --- # <br> Data Exploration <br> ## Switching to the log scale -- ```python import warnings warnings.filterwarnings("ignore") # Let's log. Get rid of NaN's first. wine = wine[ ~ wine['price'].isna() ] wine = wine.assign(log_price = np.log(wine['price'])) p = (ggplot(wine, aes(x = "price")) + scale_x_continuous(trans='log2') + theme(axis_text_x=element_text(rotation=270, hjust=1)) + geom_histogram() + xlab("Price") + ylab("Count") + theme(subplots_adjust={'left': .1, 'bottom': 0.2})) # p.draw() ``` --- # <br> Data Exploration <br> ## Wine Varieties -- ```python # How many types are represented? variety_count = pd.value_counts(wine.variety.values) # How many varieties have been rated at least 1000 times? np.sum(variety_count > 1000) ``` ``` ## 24 ``` ```python variety_count = variety_count[variety_count > 1000] wine = wine[wine.variety.isin(variety_count.keys())] wine.shape ``` ``` ## (92852, 15) ``` --- # <br> Data Exploration ## Wine Rating -- ```python # What is the average number points per variety? (wine.points.groupby(wine.variety).describe().sort_values("mean", ascending = False))[:10] ``` ``` ## count mean std ... 50% 75% max ## variety ... ## Nebbiolo 2331.0 90.331188 2.748415 ... 90.0 92.0 99.0 ## Grüner Veltliner 1145.0 90.015721 2.279760 ... 90.0 92.0 96.0 ## Champagne Blend 1211.0 89.645747 2.982182 ... 90.0 92.0 100.0 ## Riesling 4971.0 89.439147 2.855625 ... 89.0 91.0 98.0 ## Pinot Noir 12785.0 89.409230 3.131741 ... 90.0 92.0 99.0 ## Syrah 4086.0 89.290749 3.046863 ... 90.0 92.0 100.0 ## Rhône-style Red Blend 1404.0 89.133903 2.828027 ... 89.0 91.0 98.0 ## Portuguese Red 2196.0 88.864299 2.997764 ... 88.0 91.0 100.0 ## Bordeaux-style Red Blend 5340.0 88.792135 3.075173 ... 89.0 91.0 100.0 ## Sangiovese 2377.0 88.612958 2.835852 ... 88.0 90.0 100.0 ## ## [10 rows x 8 columns] ``` --- # <br> Data Exploration <br> ## Who are the wine tasters? ```python # How many tasters? taster_counts = pd.value_counts(wine['taster_name'].values) taster_counts[:10] ``` ``` ## Roger Voss 14641 ## Michael Schachner 10239 ## Virginie Boone 8532 ## Paul Gregutt 8435 ## Kerin O’Keefe 6523 ## Matt Kettmann 5206 ## Sean P. Sullivan 4258 ## Anna Lee C. Iijima 3722 ## Joe Czerwinski 3692 ## Jim Gordon 3439 ## dtype: int64 ``` --- # <br> Data Exploration <br> ## Subsetting / Filtering -- ```python # Let's get the individuals with at least 1000 tastings. keep_tasters = taster_counts[taster_counts > 1000] wine = wine[wine.taster_name.isin(keep_tasters.keys())] (wine.points.groupby(wine.taster_name).describe().sort_values("mean", ascending = False))[:5] ``` ``` ## count mean std min 25% 50% 75% max ## taster_name ## Anne Krebiehl MW 2499.0 90.785914 2.300700 80.0 89.0 91.0 92.0 97.0 ## Matt Kettmann 5206.0 90.122359 2.633234 81.0 88.0 90.0 92.0 97.0 ## Virginie Boone 8532.0 89.306259 3.026332 80.0 87.0 90.0 91.0 99.0 ## Kerin O’Keefe 6523.0 89.200675 2.601344 81.0 87.0 89.0 91.0 100.0 ## Paul Gregutt 8435.0 89.133966 2.825336 80.0 87.0 89.0 91.0 100.0 ``` --- # <br> Pricing Wine <br> ## A Linear Regression -- ```python # What is the assocation between price and points? import statsmodels.formula.api as sm fit = sm.ols(formula = "points ~ price", data = wine).fit() ols_summary = fit.summary() # Try this yourself to see the output ``` --- # <br> Pricing Wine <br> ## Another Linear Regression -- ```python np.sort(wine.variety.unique()) # Is there an interaction with variety? ``` ``` ## array(['Bordeaux-style Red Blend', 'Cabernet Franc', 'Cabernet Sauvignon', ## 'Champagne Blend', 'Chardonnay', 'Grüner Veltliner', 'Malbec', ## 'Merlot', 'Nebbiolo', 'Pinot Grigio', 'Pinot Gris', 'Pinot Noir', ## 'Portuguese Red', 'Red Blend', 'Rhône-style Red Blend', 'Riesling', ## 'Rosé', 'Sangiovese', 'Sauvignon Blanc', 'Sparkling Blend', ## 'Syrah', 'Tempranillo', 'White Blend', 'Zinfandel'], dtype=object) ``` ```python fit = sm.ols(formula = "points ~ log_price + variety + price:variety", \ data = wine).fit() ols_summary = fit.summary() # Ommited for slide real estate. ``` --- # <br> Pricing Wine <br> -- ## Which wines are the best buy? ```python # Which wine is the most overpriced? wine = wine.assign(resid = fit.resid) wine = wine.assign(fitted = fit.fittedvalues) under_priced = wine[wine.resid == np.max(wine.resid)] under_priced.keys() ``` ``` ## Index(['country', 'description', 'designation', 'points', 'price', 'province', ## 'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title', ## 'variety', 'winery', 'country_cat', 'log_price', 'resid', 'fitted'], ## dtype='object') ``` --- # <br> Pricing Wine <br> ## Which wines are the best buy (cont'd)? ```python under_priced.values ``` ``` ## array([['Italy', ## "This gorgeous, fragrant wine opens with classic Sangiovese scents of violet, rose, perfumed red berry, new leather and a whiff of baking spice. The elegant, radiant palate delivers crushed Marasca cherry, ripe strawberry, cinnamon, black tea and a hint of pipe tobacco. Firm, ultrafine tannins and bright acidity offer an age-worthy structure and impeccable balance. It's already stunning but will evolve for decades. Drink 2020–2050.", ## 'Riserva', 100, 550.0, 'Tuscany', 'Brunello di Montalcino', nan, ## 'Kerin O’Keefe', '@kerinokeefe', ## 'Biondi Santi 2010 Riserva (Brunello di Montalcino)', ## 'Sangiovese', 'Biondi Santi', 'Italy', 6.309918278226516, ## 8.620905854558387, 91.37909414544161]], dtype=object) ``` --- # <br> Pricing Wine <br> ## Which ones are the worst? ```python over_priced = wine[wine.resid == np.min(wine.resid)] over_priced.values ``` ``` ## array([['Chile', ## 'Jammy berry aromas come in front of green, vegetal notes. The palate is round and lacks balancing acidity.', ## 'El Principal Andetelmo', 81, 90.0, 'Maipo Valley', nan, nan, ## 'Michael Schachner', '@wineschach', ## 'Viña el Principal 2013 El Principal Andetelmo Red (Maipo Valley)', ## 'Bordeaux-style Red Blend', 'Viña el Principal', 'Chile', ## 4.499809670330265, -11.19866222294631, 92.19866222294631]], ## dtype=object) ``` --- # <br> R's R6 Object System <br> -- ## Python's object are standard in object oriented programming <br> -- ## R has a similar system called R6 <br> -- ## Methods and attributes are associated with an environment <br> -- ## Instead of `.` , we use `$`. --- # <br> R's R6 Object System -- ```r library(R6) AddOneToList <- R6Class( "AddOneToList", public = list( initialize = function(lst) { private$lst <- lst self$greet() }, add_one = function() { stop("Method is abstract") }, get_lst = function() { stop("Method is abstract") }, greet = function() { cat(paste0("Abstract add one to list created", ".\n")) } ), private = list( lst = NULL ) ) ``` --- # <br> R's R6 Object System <br> ## Let's try to instantiate an object <br> ```r lst_adder <- AddOneToList$new(as.list(1:5)) ``` ``` ## Abstract add one to list created. ``` ```r lst_adder$add_one() ``` ``` ## Error in lst_adder$add_one(): Method is abstract ``` --- # <br> R's R6 Object System ```r AddOneToNumericList <- R6Class("AddOneToNumericList", inherit = AddOneToList, public = list( initialize = function(lst) { #browser() if (!all(is.numeric(unlist(lst)))) { stop("All values should be numeric.") } super$initialize(lst) }, add_one = function() { private$lst <- lapply(private$lst, function(x) x + 1) }, get_lst = function() { private$lst }, greet = function() { cat(paste0("Concrete add one to list created", ".\n")) } ) ) ``` --- # <br> R's R6 Object System <br> ## Let's instantiate a proper concrete object ```r lst_adder <- AddOneToNumericList$new(as.list(letters[1:5])) ``` ``` ## Error in initialize(...): All values should be numeric. ``` ```r lst_adder <- AddOneToNumericList$new(as.list(1:5)) ``` ``` ## Concrete add one to list created. ``` ```r lst_adder$add_one() unlist(lst_adder$get_lst()) ``` ``` ## [1] 2 3 4 5 6 ``` --- # <br> Calling Python from R <br> -- ## R mirrors Pythons object through the R6 class system <br> -- ## We've created Python code that does something useful <br> -- - We'd like to be able to reuse the code and objects created. -- <br> - We'd like to use R to augment our analysis. --- # <br> Calling Python from R <br> ## Sourcing a Python file ```r # R source_python("part-4.py", convert = FALSE) ls()[1:10] ``` ``` ## [1] "AddOneToList" "AddOneToNumericList" "aes" ## [4] "annotate" "annotation_logticks" "annotation_stripes" ## [7] "arrow" "as_labeller" "coord_cartesian" ## [10] "coord_equal" ``` --- # <br> Calling Python from R <br> ## Get the column names of the wine data set into R ```r library(dplyr) col_keys <- wine$keys()$to_list() %>% py_to_r() col_keys ``` ``` ## [1] "country" "description" "designation" ## [4] "points" "price" "province" ## [7] "region_1" "region_2" "taster_name" ## [10] "taster_twitter_handle" "title" "variety" ## [13] "winery" "country_cat" "log_price" ## [16] "resid" "fitted" ``` --- # <br> Calling Python from R ## Now read the Pandas data frame into R as a `tibble` ```r # R library(tibble) library(DT) handle_nan_and_unlist <- function(x) { x <- unlist(x) x[is.nan(x)] <- NA x } wine_df <- wine[as.list(setdiff(col_keys, "country_cat"))] %>% py_to_r() %>% mutate_if(is.list, handle_nan_and_unlist) %>% as_tibble() ``` --- # <br> Calling Python from R ## Now read the Pandas data frame into R as a `tibble` ```r wine_df %>% head(n = 100) %>% datatable() ```
--- # <br> Calling Python from R <br> ## What about our numerical routines? ```r import("numpy", as = "np") ``` ``` ## Module(numpy) ``` ```r ols_code <- ' def ols(Y, X): q, r = np.linalg.qr(X) return(np.linalg.inv( r ).dot( q.T ).dot( Y )) ' writeLines(ols_code, "ols_code.py") source_python("ols_code.py", convert = FALSE) ``` --- # <br> Calling Python from R ## What about our numerical routines? ```r form <- points ~ log_price + variety + price:variety wine_mf <- model.frame(form, wine_df) X <- model.matrix(form, wine_mf) Y <- matrix(wine_mf$points, ncol = 1) fit <- ols(X, Y)$flatten() %>% py_to_r() names(fit) <- colnames(X) fit[1:10] ``` ``` ## (Intercept) log_price varietyCabernet Franc ## 0.0112567969 0.0379463013 0.0001536870 ## varietyCabernet Sauvignon varietyChampagne Blend varietyChardonnay ## 0.0009329323 0.0001703754 0.0012528949 ## varietyGrüner Veltliner varietyMalbec varietyMerlot ## 0.0001790393 0.0003876218 0.0003037841 ## varietyNebbiolo ## 0.0002769390 ``` --- # <br> Wrapping up <br> ## Python has modeled most of it's data science libraries after R <br> ## Python's programming constructs although distinct provide a new twist ## on familar method of programming <br> ## If you can understand the syntax you can figure out how to use the libraries <br> ## If you want, you can do this from R --- <style type="text/css"> .huge { font-size: 200%; } </style> <br> <br> <br> <br> .center[ .huge[ You made it to the end of the class. Thanks very much! ] ]