Bloque mañana

Teaching Data Science

Mine Çetinkaya-Rundel

Success in data science and statistics is dependent on the development of both analytical and computational skills. As statistics educators we are more familiar and comfortable with teaching the former, but the latter is becoming increasingly important. The goal of this workshop is to equip educators with concrete information on content and infrastructure for painlessly introducing modern computation into a data science and/or statistics curriculum. In addition to gaining technical knowledge, participants will engage in discussion around the decisions that go into choosing infrastructure and developing curriculum. Workshop attendees will work through several exercises from existing courses and get first-hand experience with using relevant tool-chains and techniques, including R/RStudio, literate programming with R Markdown, and collaboration, version control, and automated feedback with Git/GitHub. We will also discuss best practices for configuring and deploying classroom infrastructures to support these tools. This workshop is aimed at participants who are interested in the role of computing in either a Statistics or Data Science curriculum, including faculty designing new courses/programs and those interested in adding or improving a computational component to an existing course. A basic knowledge of R is assumed and familiarity with Git is preferred.

Fast and Scalable Machine Learning with H20

Erin LeDell

This workshop will provide an in-depth, hands-on, introduction to the H2O machine learning library in R. H2O is an open source, distributed machine learning platform designed for speed and scalability. The core machine learning algorithms of H2O are implemented in high-performance Java, however, fully-featured APIs are available in R, Python, Scala, and also through a web interface. Since H2O’s algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of Generalized Linear Models, Gradient Boosting Machines, Random Forest, Deep Neural Nets, Stacked Ensembles (aka “Super Learners”), dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), anomaly detection methods, among others as well as a fully automated machine learning algorithm (“AutoML”).
Topics covered in the workshop include: training and tuning machine learning algorithms (supervised and unsupervised), cross-validation, prediction, model evaluation, grid search as well as a special section on deep learning. We will also provide tips for encoding high-cardinality categorical features and text data. The workshop will end with a tutorial on the “AutoML” algorithm in H2O, which provides a fully automated algorithm for supervised learning.

Quienes concurran al tutorial de Erin deberán tener instalado:

  1. Java versión 7 a 11 (la ultima versión de Java, la 12, no funciona). Para quienes tienen Linux, deberán tener instalado openJDK. La versión 11 de Java se puede bajar de aquí. Oracle pide registrarse, pero se puede evitar usando usuarios comunitarios de este link.

  2. h2o desde CRAN o desde este link.

Otras cosas a tener en cuenta:

  • Si hay usuaries de Python, pueden instalar, además de Java, el paquete h2o usando Pip. El tutorial tendrá código tanto para R como para Python.
  • h2o.xgboost no esta implementado para Windows y no funcionará en estos casos.
  • Cualquier problema con la instalación de h2o, puede usarse desde rstudio.cloud donde todo h2o funciona independientemente del sistema operativo de la maquina de cada estudiante.
  • Mas tips de debugging o detalles sobre la instalación de Java en este hilo.

Bloque tarde

Visualización de Datos con Highcharter

Joshua Kunst

En este taller revisaremos la importancia de la visualización, aprenderemos de elementos técnicos como tipos de gráficos, correcto uso de colores, etiquetas y otros elementos; además de revisar buenas prácticas a considerar al momento de realizar una visualización. Luego, con R aprenderemos a realizar simples, interactivos y efectivos gráficos usando solo la función hchart, para después estudiar las funciones para configurar cada elemento del gráfico tales como los ejes, títulos, tooltips. Aprenderemos a graficar datos provenientes data.frames y otras fuentes de información, como datos espaciales. Finalizaremos aprendiendo cómo es como es la integración de highcharter con los paquetes Shiny y RMarkdown.

Package Development

Hadley Wickham

The key to well-documented, well-tested and easily-distributed R code is the package. This half day class will teach you how to make package development as easy as possible with devtools and usethis. You’ll also learn roxygen2 to document your code so others (including future you!) can understand what’s going on, and testthat to avoid future breakages. The class will consist of a series of demonstrations and hands on exercises.

Participants should bring a laptop setup to build R packages. Detailed instructions are available here.