class: center, middle, inverse, title-slide # Part 2: Digging Deeper into Python Programming Constructs ### Michael Kane
School of Public Health, Biostatistics Department
Yale University
###
kaneplusplus
kaneplusplus
--- # <br> Highlights from Part 1 <br> ## 1. Python has a notion of environments which enacapsulate the interpreter ## and a set of package. -- <br> ## 2. Python syntax is distinct from, but not unrelated to R -- <br> ## 3. Most of the time we've call functions with `function(object)` but ## sometimes it's been `object.function()` (as with copy). We'll talk more ## about this later. -- <br> ## 4. Zero indexing. --- # <br> Topics for Part 2 <br> ## 1. Packages -- <br> ## 2. Numeric computing with Numpy. -- <br> ## 3. Using Objects --- # <br> Python Packages -- <br> ## Python is "batteries not included." -- <br> ## R includes _a lot_ of computing facilities with the core language. -- <br> ## 1. Plotting -- ## 2. Vectors, matrices, and arrays -- ## 3. Optimized and vectorized linear algebra routines -- ## 4. Suite of statistical functions and models -- ## 5. `data.frame`s -- <br> ## None of these are included with Python. --- # <br> Python Packages -- <br> ## Python Virtual Environments and Package Management -- <br> ## This class uses conda to create an environment and add packages. <br> -- ## [Anaconda](https://www.anaconda.com/) creates environments and adds packages from [Anaconda Cloud](https://anaconda.org/anaconda/repo). ## It is an environment and package manager, not just for Python. <br> -- ## There is an R installation. <br> -- ## I have heard only bad things about it. --- # <br> Python Packages <br> ## What's with the emphasis on environments? <br> -- Development culture is different than R's - R community (CRAN maintainers) place a higher value on package user's time by enforcing _downstream dependencies_. This is the reason R's `install.packages()` function "just works". It does a better job of hiding analysts from package development. -- - Python community places higher value on developer time. It is often up to the user to sort out compatibility problems between packages. This became a big enough issue that companies, like [Continuum Analytics](http://www.continuumanalytics.com/), began creating pre-packaged virtual environments. -- Result is that R users tend to use environments (with `packrat`, `switchr`, `renv`) more for reproducibility, Python users tend to use them more for package compatibility. --- # <br> Numerical Computing <br> ## Python doesn't have a built-in notion of vectorized operations. -- ```python # Python list(range(10)) + 1 ``` ``` ## Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: can only concatenate list (not "int") to list ## ## Detailed traceback: ## File "<string>", line 1, in <module> ``` -- We can perform this with list comprehensions. -- ```python # Python print( [x + 1 for x in list(range(10))] ) ``` ``` ## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ``` -- Or, we could create a method for adding values to lists, but it's already been done. --- # <br> Numerical Computing <br> ## Importing the `numpy` Package -- ```python # Python import numpy print( numpy.arange(10) + 1 ) ``` ``` ## [ 1 2 3 4 5 6 7 8 9 10] ``` -- ```python # Python from numpy import * print( arange(10) + 1 ) ``` ``` ## [ 1 2 3 4 5 6 7 8 9 10] ``` -- ```python # Python import numpy as np print( np.arange(10) + 1 ) ``` ``` ## [ 1 2 3 4 5 6 7 8 9 10] ``` --- # <br> Numerical Computing <br> ## Numpy doesn't distinguish vectors, matrices, and arrays ```python # Python # A vector vec = np.array(list(range(12))) print(vec) # A matrix ``` ``` ## [ 0 1 2 3 4 5 6 7 8 9 10 11] ``` ```python mat = np.array( [ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ] ) print(mat) ``` ``` ## [[ 1 2 3] ## [ 4 5 6] ## [ 7 8 9] ## [10 11 12]] ``` --- # <br> Numerical Computing <br> ## Numpy doesn't distinguish vectors, matrices, and arrays (cont'd) ```python # Python tensor3 = np.array( [ [[1, 2], [3, 4]], [[5, 6], [7,8]] ]) print(tensor3) ``` ``` ## [[[1 2] ## [3 4]] ## ## [[5 6] ## [7 8]]] ``` --- # <br> Numerical Computing <br> ## What is the dimension of a numpy array? <br> ```python # Python print(vec.shape) ``` ``` ## (12,) ``` ```python print(mat.shape) ``` ``` ## (4, 3) ``` ```python print(tensor3.shape) ``` ``` ## (2, 2, 2) ``` --- # <br> Numerical Computing <br> ## What kind of values are stored? ```python # Python tensor3.dtype ``` ``` ## dtype('int64') ``` ```python double(tensor3).dtype ``` ``` ## dtype('float64') ``` ```python np.array([str(x) for x in mat.flatten().tolist()], dtype = str).reshape(4, 3) ``` ``` ## array([['1', '2', '3'], ## ['4', '5', '6'], ## ['7', '8', '9'], ## ['10', '11', '12']], dtype='<U2') ``` --- # <br> Numerical Computing <br> ## What other information is stored in a numpy array? ```python # Python vec_slots = dir(vec) len(vec_slots) ``` ``` ## 162 ``` ```python print( [vec_slots[-i] for i in range(1, 51)] ) ``` ``` ## ['view', 'var', 'transpose', 'trace', 'tostring', 'tolist', 'tofile', 'tobytes', 'take', 'swapaxes', 'sum', 'strides', 'std', 'squeeze', 'sort', 'size', 'shape', 'setflags', 'setfield', 'searchsorted', 'round', 'resize', 'reshape', 'repeat', 'real', 'ravel', 'put', 'ptp', 'prod', 'partition', 'nonzero', 'newbyteorder', 'ndim', 'nbytes', 'min', 'mean', 'max', 'itemsize', 'itemset', 'item', 'imag', 'getfield', 'flatten', 'flat', 'flags', 'fill', 'dumps', 'dump', 'dtype', 'dot'] ``` --- # <br> Numerical Computing <br> ## What else is in numpy? ```python # Python np_objects = dir(np) len(np_objects) ``` ``` ## 620 ``` ```python print( [np_objects[i] for i in range(70)] ) ``` ``` ## ['ALLOW_THREADS', 'AxisError', 'BUFSIZE', 'CLIP', 'ComplexWarning', 'DataSource', 'ERR_CALL', 'ERR_DEFAULT', 'ERR_IGNORE', 'ERR_LOG', 'ERR_PRINT', 'ERR_RAISE', 'ERR_WARN', 'FLOATING_POINT_SUPPORT', 'FPE_DIVIDEBYZERO', 'FPE_INVALID', 'FPE_OVERFLOW', 'FPE_UNDERFLOW', 'False_', 'Inf', 'Infinity', 'MAXDIMS', 'MAY_SHARE_BOUNDS', 'MAY_SHARE_EXACT', 'MachAr', 'ModuleDeprecationWarning', 'NAN', 'NINF', 'NZERO', 'NaN', 'PINF', 'PZERO', 'RAISE', 'RankWarning', 'SHIFT_DIVIDEBYZERO', 'SHIFT_INVALID', 'SHIFT_OVERFLOW', 'SHIFT_UNDERFLOW', 'ScalarType', 'Tester', 'TooHardError', 'True_', 'UFUNC_BUFSIZE_DEFAULT', 'UFUNC_PYVALS_NAME', 'VisibleDeprecationWarning', 'WRAP', '_NoValue', '_UFUNC_API', '__NUMPY_SETUP__', '__all__', '__builtins__', '__cached__', '__config__', '__doc__', '__file__', '__git_revision__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_add_newdoc_ufunc', '_distributor_init', '_globals', '_mat', '_pytesttester', 'abs', 'absolute', 'add'] ``` --- # <br> Numerical Computing <br> ## Array Indexing ```r # R mat <- t(matrix(seq_len(12), ncol = 4)) mat ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 ## [4,] 10 11 12 ``` ```r mat[1:2, 1:3] ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ``` --- # <br> Numerical Computing <br> ## Array Indexing (cont'd) ```python # Python mat[:2, :3] ``` ``` ## array([[1, 2, 3], ## [4, 5, 6]]) ``` ```python mat[1:3, :3] ``` ``` ## array([[4, 5, 6], ## [7, 8, 9]]) ``` ```python mat[ [0, 3, 2], :] ``` ``` ## array([[ 1, 2, 3], ## [10, 11, 12], ## [ 7, 8, 9]]) ``` --- # <br> Numerical Computing <br> ## Boolean Indexing ```r # R mat > 2 ``` ``` ## [,1] [,2] [,3] ## [1,] FALSE FALSE TRUE ## [2,] TRUE TRUE TRUE ## [3,] TRUE TRUE TRUE ## [4,] TRUE TRUE TRUE ``` ```r mat[mat > 2] ``` ``` ## [1] 4 7 10 5 8 11 3 6 9 12 ``` --- # <br> Numerical Computing <br> ## Boolean Indexing (cont'd) ```python # Python mat > 2 ``` ``` ## array([[False, False, True], ## [ True, True, True], ## [ True, True, True], ## [ True, True, True]]) ``` ```python mat[mat > 2] ``` ``` ## array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) ``` --- # <br> Numerical Computing <br> ## Fitting Ordinary Least Squares Recall the formula for fitting the ordinary least squares model: $$ `\begin{align} \widehat{\beta} &= (X^T \ X)^{-1} \ X^T \ Y. \\ \end{align}` $$ Letting `\(X = QR\)` where `\(Q^TQ = I\)` and `\(R\)` is upper right triangular we can rewrite as: $$ `\begin{align} \widehat{\beta} &= ( (QR) ^T\ QR) ^{-1} \ (Q R)^{T} \ Y \\ &= (QR)^{-1} \ ((QR)^T )^{-1} \ (Q R)^{T} \ Y \\ &= (QR)^{-1} \ Y \\ &= R^{-1} Q^T Y \end{align}` $$ to create a _numerically stable_, if not limited, implementation of OLS. --- # <br> Numerical Computing <br> ## Our implementation ```python # Python import seaborn as sns # for iris def ols(Y, X): q, r = np.linalg.qr(X) return(np.linalg.inv( r ).dot( q.T ).dot( Y )) iris = sns.load_dataset("iris") iris_mat = iris[["sepal_width", "petal_length", "petal_width"]].values print(ols(iris['sepal_length'].values, iris_mat)) ``` ``` ## [ 1.12106169 0.92352887 -0.89567583] ``` --- # <br> Numerical Computing <br> ## Our implementation (cont'd) ```r fit <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width - 1, data = iris) fit$coefficients ``` ``` ## Sepal.Width Petal.Length Petal.Width ## 1.1210617 0.9235289 -0.8956758 ``` --- # <br> Numerical Computing <br> ## How would you debug this? <br> ## Python doesn't have error recover the way R does ## (`options(error = recover)`) <br> ## It does let you set breakpoints with the pdb package though. <br> ## [pdb-debug.py](pdb-debug.py) --- # <br> Using Objects to Visualize Data <br> ## Object Oriented Programming You've already been making use of Python's object oriented functionality. - `list_vals.copy()` - `np.linalg.inv( r ).dot( q.T )` In each case you were accessing data or calling a function called a (method) associated with an object using the `.` operator. Packages are themselves object. ### An _object_ contains data (called attributes or fields) and methods (functions). An object _has_ attributes and can do things with methods. An object is an _instance_ of a class. - `np` is an instance of type `Module`. - `list_vals` is an instance of type `list`. --- # <br> Using Objects to Visualize Data ## A Primitive Vector Object in R ```r # R vec_vals <- list( vals = 1:10, add_one = function(vec_vals) { vec_vals$vals <- vec_vals$vals + 1 vec_vals } ) print(vec_vals) ``` ``` ## $vals ## [1] 1 2 3 4 5 6 7 8 9 10 ## ## $add_one ## function(vec_vals) { ## vec_vals$vals <- vec_vals$vals + 1 ## vec_vals ## } ``` --- # <br> Using Objects to Visualize Data <br> ## A Primitive Vector Object in R (cont'd) ```r # R print(vec_vals$add_one(vec_vals)) ``` ``` ## $vals ## [1] 2 3 4 5 6 7 8 9 10 11 ## ## $add_one ## function(vec_vals) { ## vec_vals$vals <- vec_vals$vals + 1 ## vec_vals ## } ``` Note two differences: 1. Python uses `.` instead of our `$`. 2. The calling object is invisibly passes as the first argument. --- # <br> Using Objects to Visualize Data <br> ## Plotting with Objects ```r # R library(ggplot2) ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point() + theme_minimal() ``` ![](part-2_files/figure-html/unnamed-chunk-22-1.png)<!-- --> --- # <br> Using Objects to Visualize Data <br> ## Plotting with Objects ```python # Python import plotnine from plotnine import * plotnine.options.figure_size = (5, 3) p = ggplot(iris, aes(x = "sepal_length", y = "sepal_width", color = "species")) +\ geom_point() + theme(subplots_adjust={'right': .7}) p.draw() ``` --- # <br> Using Objects to Visualize Data <br> ## What's with `+\` ? <br> ### Python expects something to the left and right of `+\`. <br> ### It can't find the second argument on the second line. <br> ### So we need to tell the interpreter to look at the next line with `\`. <br> ### Alternative is to wrap the entire statement in `()` ```python (ggplot(iris, aes(x = "sepal_length", y = "sepal_width", color = "species")) + geom_point() + theme(subplots_adjust={'right': .7})) ``` --- # <br> Using Objects to Visualize Data <br> ## Where are the objects? The `ggplot` function creates an object of type `ggplot`. ```python # Python p = ggplot(iris, aes(x = "sepal_length", y = "sepal_width", color = "species")) type(p) ``` ``` ## <class 'plotnine.ggplot.ggplot'> ``` --- # <br> Using Objects to Visualize Data <br> ## Where are the methods? `+` is an _infix operator_ - a function where the first argument is on the left of the operator and the second is on the right - it is implemented via the `__add__` method. ```python # Python dir(p) ``` ``` ## ['__add__', '__class__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__iadd__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rrshift__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_apply_theme', '_build', '_create_figure', '_draw', '_draw_breaks_and_labels', '_draw_labels', '_draw_layers', '_draw_legend', '_draw_title', '_draw_using_figure', '_draw_watermarks', '_resize_panels', '_save_filename', '_setup_parameters', '_update_labels', 'axs', 'coordinates', 'data', 'draw', 'environment', 'facet', 'figure', 'guides', 'labels', 'layers', 'layout', 'mapping', 'save', 'scales', 'theme', 'watermarks'] ``` --- # <br> Using Objects to Visualize Data <br> ## We can call `__add__` with `.` ```python # Python p = ggplot(iris, aes(x = "sepal_length", y = "sepal_width", color = "species")) geom = geom_point() type(geom) ``` ``` ## <class 'plotnine.geoms.geom_point.geom_point'> ``` ```python theme_change = theme(subplots_adjust={'right': .7}) # Note that __add__ creates copies. p = p.__add__(geom) p = p.__add__(theme_change) ``` --- # <br> Using Objects to Visualize Data <br> ## We can call `__add__` with `.` (cont'd) <br> ```python # Python p.draw() ``` --- # <br> Summing the section up <br> ## Python includes less functionality and relies more on packages than R. <br> ## Everything in Python is an object <br> ## Object have functions and data that are accessed with `.` <br> ## Those functions and data can be found with `dir()` <br> ## Their documentation can be found with `help()` --- <style type="text/css"> .huge { font-size: 200%; } </style> <br> <br> <br> <br> .center[ .huge[ You made it to the end of part 2. ] ]