*"Leaders in business, education and government must take action to foster a new generation of talent with the technical expertise and unique ideas to make the most of this tsunami of Big Data."*

-Richard Rodts, Manager of Global Academic Programs, IBM

__Course Offerings__

DSCI 351: Exploratory Data Science for Energy & Manufacturing

**Course Description: Data Sources, Data Assembly and Exploratory Data Analytics**

In this course, we will learn data science and analysis approaches applicable to energy and manufacturing technologies, to identify statistically significance relationships and better model and predict the behavior of these systems. We will assembly and explore real-world datasets, perform clustering and pair plot analyses to investigate correlations, and logistic regression will be employed to develop associated predictive models. Results will be interpreted, visualized and discussed.

We will introduce the basic elements of data science and analytics using R Project for Statistical Computing. R is an open-source software project with broad abilities to access machine-readable open-data resources, data cleaning and munging functions, and a rich selection of statistical packages, used for data analytics, model development and prediction. This will include an introduction to R data types, reading and writing data, looping, plotting and regular expressions, so that one can start performing variable transformations for linear fitting and developing structural equation models, while exploring for statistically significant relationships.

R Analytics will be applied to the case of energy systems (such as PV power plant degradation, and building energy efficiency) over time, by analyzing system responses, combined with results of experiments to identify fundamental principles that are statistically significant in the observed system performance. And it will be applied to manufacturing systems to understand the principles of statistical process control and identify critical factors of variability and uniformity.

**Learning Outcomes:**

Familiarity with R Statistics, scripting, functions, packages, automated data analysis.

Familiarity with exploratory data analysis, statistical model building

Applications of domain knowledge and statistical analytics to identify important predictors and develop initial predictive models

Data set characteristics will include:

Variety of types of information, including both, structured and unstructured data.

Volume: Data from human sources (vendors, suppliers, distributors, customers, etc.) and sensor networks of the energy system of factory, both small and large data volumes.

Velocity: Energy system and manufacturing supply chain changes will be included.

In this course, we will use an open data science tool chain to develop reproducible data analyses useful for inference, modeling and prediction of the behavior of real energy and manufacturing systems. In addition to the standard data cleaning, assembly and exploratory data analysis steps essential to all data analyses, we will identify statistically significant relationships from datasets derived from population samples, and infer the reliability of these findings. We will use regression methods to model a number of both real-world and lab-based systems producing predictive models applicable in comparable populations.

We will assemble and explore real-world datasets, use pair-wise plots to explore correlations, perform clustering, self-similarity, and logistic regression develop both fixed-effect and mixed-effect predictive models. We will also introduce machine-learning approaches for classification and tree-based methods. Results will be interpreted, visualized and discussed.

We will introduce the basic elements of data science and analytics using R Project for Statistical Computing. R Analytics will be applied to the case of energy systems (such as PV power plant degradation, and building energy efficiency) over time. And it will be applied to manufacturing systems to understand the principles of statistical process control and identify critical factors of variability and uniformity.

Familiarity with an open-data tool chain including R Statistics, scripting, functions, packages, automated data analysis, git versioning and Rmarkdown reproducible data science.

Familiarity with exploratory data analysis to guide data analysis

Familiarity with inference and significance of sample results to populations

Familiarity with regression and linear and non-linear statistical model building

Including training, testing and validating dataset strategies

Applications of domain knowledge and statistical analytics

To identify important predictors and develop initial predictive models

Familiarity with clustering, self-similarity methods

For categorization by different distance metrics

Introduction to machine-learning approaches such as tree-based methods

Data types include:

Time-series, spectral, image and higher order datatypes,

And their assembly to produce augmented and derivative datasets.

Data set characteristics will include:

Variety: Of types of information, including both structured and unstructured data,

Volume: Data from human sources (vendors, suppliers, distributors, customers, etc.) and

sensor networks of the energy system of factory, both small and large data volumes.

Velocity: Energy system and manufacturing supply chains changes will be included.