Artificial Intelligence (AI)
Artificial intelligence (AI) refers to simulating human intelligence or replicating human behavior with machines. In scope, this is quite broad, so AI is often separated into two types, weak and strong. Weak AI exhibits intelligence toward a specific, designated task and is often what companies mean when describing their technologies and services. This ranges from autonomous cars trained by machine learning to standard ovens that self-regulate temperature. Strong AI describes machines that exhibit generalized human cognition. Strong AI is still theoretical and has never been accomplished in practice, but an example would be Skynet from Terminator. To avoid any confusion between these types, we choose to describe our services as Data Science Solutions.
Machine learning (ML)
Machine learning is a quantitative process that uses computing systems to learn patterns in data. Rather than defining one specific process, machine learning describes many different quantitative algorithms. Some of these algorithms try to optimize a specific objective, while others attempt to transform data for better interpretation. Use cases for machine learning range from data segmentation & clustering to data visualization to predictive modeling.
Deep learning is a subset of machine learning. Machine learning becomes “deep” when a model gains a high degree of complexity through learning many new representations of input data, which can subsequently be used to learn more new representations, and so on. These layers & representations are what make a machine learning model deep.
In machine learning, there is often an objective we want to optimize. Supervised learning searches for the best way to optimize an outcome, based on input data.
In machine learning, sometimes we do not have an objective we want to optimize. Instead, we might be looking to better interpret and/or visualize our data. Unsupervised learning using machine learning for this purpose, tackling tasks such as data projection, segmentation, and clustering.
Data science looks at how to use data to quantitatively answer questions. This process typically involves several steps. First, a question or set of questions to research are proposed. A data scientist then explores what data exists that might answer the questions, formats the data, and then analyzes the data. To analyze data, machine learning algorithms are a set of tools data scientist can use. The output of this analysis is then used to answer the original questions.
Data analytics refers to using descriptive analytics and visualizations to reach conclusions about data. Data analysts require the ability to form research questions and gather data, but often they do not rely on machine learning methods.
Data engineering focuses on processing data into a form useful for analytics and/or data science.
Extract Transform and Load (ETL)
ETL is an acronym for extract, transform, and load. This process describes bringing data from one environment to another, making any changes in data format that are necessary for the destination to import incoming data.
Descriptive analytics focuses on summarizing data to extract insights. This involves calculation and interpretation of summary statistics like mean, variance, and moving averages on datasets, as well as creating meaningful visualizations from raw data. Additionally, descriptive analytics extends to algorithms used to elucidate data, including many clustering & segmentation analyses. Descriptive analytics methods, however, are not directly used to inform predictions about future events.
Predictive analytics focuses on using past data in order to inform predictions about future events. This type of analysis is often performed by creating predictive models based on gathered data. Currently, the most powerful of these models are created through machine learning methodologies.
Models are tools data scientists use to describe relationships between data. Two of the most common categories of models are known as regression models and classification models.
Regression analysis estimates the relationship between a set of input variables, often called independent variables, and a set of output variables, often called dependent variables. Typically regression refers to the estimation of a continuous outcome, rather than a categorical outcome. Many regression models, such as logistic regression, can however be viewed as classification models by applying a decision rule to the model output.
A classification model estimates the relationship between a set of input variables, often called independent variables, and a set of categorical output variables. For example, a binary classification model will produce two types of possible output values, often 0 and 1. Another example would be an image recognition model that predicts whether a picture is of a bird, mammal, or reptile.
Clustering describes a set of modeling techniques that separate input data into one or more groupings, often based on similarities between input data samples. An example would be separating a customer base into segments. Clustering is typically viewed as an unsupervised method, while classification is viewed as a supervised method; however, the two terms are often used interchangeably.
In machine learning, models must be trained in order to learn patterns in data. Specifically, it is the coefficients, such as m and b in y = mx + b, that are trained to capture data relationships. Specifically, these coefficients are often trained through some heuristic that attempts to minimize the difference between any observed outputs and our predicted outputs.
Model evaluation involves checking whether trained models are accurate and generalizable. Many methods, such as ROC curves, exist to assess how good or bad a trained model is at estimating an output. Generalizable refers to whether a model maintains accuracy outside of the data used for training. Models that work well on training data but poorly on other data are said to overfit the training data. On the other hand, models that lack the complexity to capture input to output relationships even for training data are said to underfit the data.
After we are satisfied with a model, post-evaluation, we often want to push it into production. The process of productionalizing a model is known as deployment. Deployment typically involves coordination between data scientists and data engineers.
Once a model goes into production, further improvement is often still possible. Iterative development is the process of improving and deploying models in a cyclic fashion. This workflow allows for results to be quickly productionalized and improvements to be easily measured.