A Beginner’s Guide to Important Topics in AI, Machine Learning, and Deep Learning.

Table of Contents

When unlabeled data is abundant but labeling it is difficult, you’ll want to use active learning (AL). AL interactively queries the oracle (human user) to judiciously select particular points in the data from which to learn. This helps achieve greater accuracy using fewer training label examples.

A set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service. It is also known as a publicly available web-based API that returns data, likely in JSON or XML.

AUC is a metric used when evaluating the accuracy of a binary classifier. It measures how well a parameter correctly distinguishes the category of the sample it belongs to. (Ex. , “healthy” versus “diseased”). The numerical value of AUC ranges from 1.0 – which represent the ideal perfect classifier – to 0.5 – which means that the classifier is no better than flipping a coin.

An autoencoder is an ANN that is designed to encode input data into fewer dimensions, and is then reproduced as output. The encoding compresses the data, ignoring the “noise” which allows the model to learn only the important features.

AutoML is when the cumbersome and technical parts of machine learning are performed automatically, thus allowing even non-experts to utilize and enjoy the benefits of machine learning. Using AutoML allows teams to build machine learning pipelines without needing to rely on assistance from professional machine learning personnel.

Iterative and recursive method used to train ANNs by calculating the gradient of the loss function for each layer in the network. Gradient descent then updates the weights to improve the network.

In bag-of-words model, a text is represented as a multiset of its words, disregarding grammer and even word order but keeping multiplicity (keeping the number of occurrences of each word). The model is commonly used in methods of document classification where the frequencies of words are used as a feature. The model is also used in natural language processing (NLP), information retrieval (IR) and computer vision.

Bagging is an ML ensemble meta-algorithm designed to improve stability and accuracy of ML algorithms. It “increases” the size of the dataset by sampling small subsets of the original dataset many times and collecting them to a new larger dataset based on the original.

The baseline solution is a simplistic approach for solving a problem. While it typically will not provide the best results it is useful when trying to evaluate the accuracy of a new model. In this situation you would want to compare it with the track record of the baseline solution.

In ensemble learning, a collection of models (often of the same type) are generated. Each one as an individual might be fairly inaccurate, or “weak”. But, when the results are combined (for example, by averaging) the aggregate score can often achieve greater accuracy than one single “strong” learning model.

A model trains on one batch of data samples at a time until all samples are used up. (For example, for 5000 samples, a batch size of 100 requires running 50 batches.) The model improves after each batch is processed. You must find a balance between a large batch size, which allows more significant improvement after each run, and a large number of batches, which allows for numerous opportunities to improve.

Batch Normalization is a technique for improving the speed, performance, and stability of neural networks. It is used to normalize the input layer by adjusting and scaling the activations in order to provide similar scales for all inputs. Example: if the dataset includes the 2 features – age (0-100) and miles driven (0-100K) we would like to have them both in the same scale (0-1) in order to avoid instability.

Batch prediction is when you use a model to predict an outcome for a set of instances, and not just a single sample. Batch prediction is common when there’s no need for real-time predictions.

In probability theory, Bayes’ theorem describes the probability of an event – based on prior knowledge of conditions – that might be related to the event. Example: if cancer is related to age, then, using Bayes’ theorem, a person’s age can be used to assess the probability that they have cancer. This Bayes’ formula looks like so (where A and B are two events):

A model that can accurately predict features of a data set is said to have low bias. Unfortunately this is sometimes due to overfitting, which is when the model incorporates “noise” and outliers, rather than just the important features. Thus, very low bias will usually result in high variance.

Bias is…

A constant you must add to helps the model fit the data.

The systematic prejudice in a model’s results.

Caused by erroneous assumptions in the machine learning process.

Used to delay the triggering of the activation function.

Typically a model with very low bias is only able to achieve great accuracy on the training set because it incorporates”noisy” data points. This is called “overfitting”. Such a model will do poorly when introduced to other data sets that don’t possess that same noise. In order to have consistent accuracy for other sample sets (“low variance”), you will need to sacrifice some degree of bias.

Bigram is a sequence of two adjacent elements is a string of tokens, or a sequence of two words. For example, “please turn”, “turn in”, “in your”, or “your homework”. Bigram is an n-gram where n=2.

Binary variables are variables which only take two values. For example Yes/No, 1/0, Female/Male, Win/Loss.

Binary Classification is the task of classifying elements of a given set into two groups on the basis of a classification rule. Example: Medical Testing to determine if a patient has a certain disease or not.

Data Binning is a data pre-processing technique used to reduce the effects of minor observation errors. It is done by grouping a number of more or less continuous values into a smaller number of “bins”. Example: if there is a dataset of people, we might want to group their ages into a smaller number of age intervals. This technique is also used in Image Processing when combining a cluster of pixels into a single pixel (reducing the number of pixels).

Boosting is an ML meta-algorithm used to reduce bias and variance in supervised learning, and convert weak learners into strong learners. AdaBoost is the first algorithm to provide this functionality and remains a popular one. There are some more recent algorithms like: LPBoost, TotalBoost, BrownBoost, XGBoost and more.

Bootstrapping is a re-sampling technique used to estimate statistics on a population by sampling a dataset with replacement. In the context of ML, bootstrapping is sometimes used when the dataset is too small and requires a larger data set in order to evaluate the skill of the machine learning model.

Bottleneck is a layer in a neural network that is smaller than (has less neurons) than the previous layer or the following layer. Having such a layer encourages the layer to compress the feature representations to best fit in the available space, in order to get the minimal loss during training.

Bucketing is the act of dividing data according to the most predictive features, and then used to analyze the sub groups. For example: in the illustration below we can see how the data has been bucketed into sub groups of car prices.

Datasets may have two types of variables: Continuous (Numerical) and Categorical. If you wish to use Regression based algorithms you will need to convert the Categorical variables into numeric variables. This conversion happens because in most ML libraries you can’t fit categorical variables into a regression equation in its raw form.

Categorical Variable is a variable that can take a fixed number of possible values. Example: If the data is about humans, Categorical Variables may be: 1) Blood Type (A, B, AB, O), 2) Preferred Political Party.

In mathematics and physics, the centroid of a plane figure is the arithmetic mean position of all points in the figure. It is the point at which a cutout of the shape could be perfectly balanced on the tip of a pin. In the context of ML, we look to centroid as the center of a group of samples in the dataset.

Class is the category or set to which an example in the dataset is “labeled” or “tagged”. Class label is a discrete attribute having finite values that the classifier wants to predict based on the values of the other attributes (features). Example: in a dataset which describes families of dogs, each family is a class.

The process of classifying examples to categories. In the context of ML, Classification is a supervised learning process where an algorithm learns given data and then uses it in order to classify new observations. (Unlike regression which uses continuous numerical labels.) Examples: an email of text can be classified as belonging to one of two classes: “spam” and “not spam”.

Classification Algorithm is an algorithm that receives labeled data as input, learns the given data, then classifies new instances based on the previous data. Popular classification algorithms are: SVM, XGBoost, LogisticRegression, and more. Some examples of classification problems are: speech recognition, handwriting recognition, biometric identification, document classification, etc.

Classification Threshold (a.k.a decision threshold) is a value that converts the result of a quantitative test to a simple binary decision. Example: when using logistic regression, values are mapped to number between 0 and 1. If the classification threshold is set as 0.6, any example which is classified as 0.6 or more by the regressor is labeled as 1, and the others labeled as 0. In the image below – the classification threshold is 0.5.

Classification Model is an algorithm that can classify input values given for training and predict the class labels/ categories for the new data. Example: a model which receives images of dogs and cats and classify each image to one of the two classes.

A Cluster is a subset of a dataset that is made of similar examples. In the below illustration the image to the right depicts each cluster as a green circle.

Clustering is the task of dividing the population or data points into a number of groups so that data points in the same group are more similar to other data points in the same group, and less similar to the data points in other groups. In the illustration below we can see the points in the left image, and then we can see them clustered into three groups in the right image.

Clustering Algorithm is an algorithm which performs clustering over a given dataset. There are many types of clustering algorithms. Some based on connectivity between examples, some based on centroids, some on density model and more. One example of a well known clustering algorithm is k-means.

When solving a machine learning classification problem, a confusion matrix (a.k.a error matrix) is a table which allows visualization and understanding of the performance of an algorithm. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class. The name comes from the idea that this matrix shows if the model confuses two classes. Illustration:

Continuous Variables are numeric variables that have an infinite number of values between any two values. More intuitively it is a numeric value which is not discrete. Examples: height of students in class, time is takes to get to school or distance traveled between classes.

An iterative algorithm is said to converge when as the iterations proceed, the output gets closer and closer to some specific value. More precisely, no matter how small an error range you choose, if you continue long enough the function will eventually stay within that error range around the final value. Important – not every algorithm will converge.

In mathematics, convolution is a mathematical operation on two functions to produce a third function that expresses how the shape of one is modified by the other.

A convolution is the simple application of a filter to an input that results in an activation. Repeated application of the same filter to an input results in a map of activations called a feature map, indicating the locations and strength of a detected feature in an input.

CNN is a class of deep neural networks, most commonly applied in the field of Computer Vision. CNNs are able to take in an input matrix (image), assign importance (learnable weights and biases) to various aspects\objects in the image and differentiate one from the other.

Cosine Similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between the vectors. In the image below, A and B are two vectors, and their multiplication is their dot product.

Cost Function is a mechanism utilized in supervised machine learning. The cost function returns the error between predicted outcomes and actual outcomes. The aim of supervised machine learning is to minimize the overall cost, thus optimizing the correlation of the model to the system it attempts to represent.

Cross-Validation (CV) is one of the techniques used to test the effectiveness of a machine learning model.t is also a re-sampling procedure used to evaluate a model if we have limited data. A common technique to perform cross-validation is k-fold cross validation.

A Dataset is a collection of data. Two well known datasets used in ML are Iris flowers and MNIST.

Decision Boundary is a problem of classifying a set to two classes. It divides the examples into two classes, all the points on one side of the boundary are classified as class A and all the points from the other side of the boundary are classified as class B. In the illustration below, the separating line is a decision boundary.

Decision Tree is a predictive modeling approach used in statistics, data mining and machine learning. Tree models where the target variable is a discrete set of values (i.e ClassA/ClassB) are called Classification Trees, in these trees the leaves are class labels. A Decision Tree in which the target variable can take continuous values, is called a Regression Tree. Illustration below:

Dimensionality Reduction is the process of reducing the number of features in a dataset. The process is divided into feature selection (finding a subset of features of the original set which models the dataset well) and feature extraction (reducing the data to a lower dimension). Popular algorithms are: PCA, LDA and GDA. In the image below, in the left image we have a 3-dimensional image which is reduced to two 2-dimensional images.

Dropout is a regularization technique for neural network models proposed by Srivastava, et al. In their 2014 paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting,

They explain Dropout as a technique that randomly selects neurons to be ignored during the training process. The neurons are “dropped-out” randomly, which means that their contribution to the activation of downstream neurons is temporarily removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

A Dummy Variable is a variable that takes the value 0 or 1 indicating the absence or presence of a categorical effect that may be expected to shift the outcome. Dummy Variables are used as devices to sort data into mutually exclusive categories (Example: Smoker/non-Smoker). Usually 1 stands for true and 0 stands for false.

Early Stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method such as gradient descent (GD). If your model learns your training data “too well”, it will be counterproductive. This is because the model will be incorporating too much of the noise features, rather than the important ones. You should stop the training before the generalization error gets too high.

An embedding is a mapping of a discrete categorical variable with a vector of continuous numbers. For example, a 300 dimensional embedding for English words could include:

blue: (0.01359, 0.00075997, 0.24608, …, -0.2524, 1.0048, 0.06259)

blues: (0.01396, 0.11887, -0.48963, …, 0.033483, -0.10007, 0.1158)

Ensemble Learning is a process where multiple ensembles of models are strategically generated and combined to solve a particular problem. Ensemble learning is primarily used to improve the performance of the model, or reduce the likelihood of a selection of a poor one. A well known ensemble learner is the “Bayes Optimal Classifier,” which ensembles all the hypotheses in the hypothesis space.

Ensemble models combine the decisions of multiple models to improve the overall performance of machine learning. A max voting algorithm is a great example of an ensemble learning algorithm. This algorithm is generally used for classification problems. Let’s say you ask 5 people to rate a restaurant and three of them give it a score of 4 and two give it a score of 3. Since the majority gave a rating of 4, the final rating will be scored as 4.

An epoch is defined as the completion of one learning cycle through your data set during the training process.. In a single epoch, all training samples are presented to the model exactly once.

Evaluation Metrics are used to evaluate machine learning algorithms and are an important part of every project. The model will give us good results during the evaluation process if we are using a specific metric, but it will give us poor results when evaluated against other metrics.

In the context of ML, an example is a single object in the dataset. For example: if the dataset describes hot dogs, each hot dog is an example.

A labeled example is an example in the dataset which has a label. Example: if the dataset describes hot dogs and their condiments , each hot dog is an example and the condiment is the label.

Unlabeled examples are examples in the dataset that don’t have a label.

An Exploding Gradient Problem is an issue found in training artificial neural networks with gradient based learning methods and backpropagation. Error gradients can accumulate during an update and result in very large gradients. These error gradients create large updates to the network weights, and in turn, result in an unstable network. At an extreme, the values of weights can become so large as to overflow and result in NaN values.

F1 Score is a measure of a test’s accuracy. The F-score is often used in the field of information retrieval (IR) for measuring performance in the field of natural language processing (NLP).

The formula to calculate it is:F1=2/(1/recall + 1/precision)=2 x (precision x recall)/(precision + recall)

When a classifier labels an example as “False”, when it should have been labeled as “True”.

Example: A pregnancy test which returns negative, but should have returned positive.

When a classifier labels an example as “True”, where is should have been labeled as “False”. Sometimes called a “False Alarm”.

Example: If an Antivirus program on your computer incorrectly classifies a program as malicious.

A feature is a measurable property or characteristic of a phenomenon being observed. For example, features of a car might include its make, model, year of production, engine size, etc. One may wish to analyze how these features are related to another feature, such as price.

The process of ranking the features or attributes by their value to the predictive ability of the model. This is usually done by algorithms like decision trees or subset regression.

A feature vector is a set of numeric values which represent an object relative to a given set of features.

A feedforward neural network is a neural network where the nodes and the edges does not form a cycle. Essentially, the information moves in one direction. It moves from the input layer through the hidden layers, all the way to the output layer.

Feed forward neural networks are used mainly for supervised ML tasks. They are typically used when the target function is known. They can also be used in the fields of Computer Vision and NLP as well.

Few shot learning is the practice of feeding a learning model with a very small amount of training data. This is considered as the go-to solution whenever there is a small amount of data available. This technique is mostly utilized in the field of Computer Vision, soan object categorization model will still give appropriate results even without having several training samples.

A gated recurrent unit is a recurrent neural network that attempts to solve the vanishing gradient problem. Its approach is similar to that of Long Short-Term Memory (LSTM). The difference is that there is an update and reset gate.

Gradient clipping is a technique that prevents exploding gradients in very deep networks such as RNNs. Exploding gradients can occur when the gradient becomes too large and error gradients accumulate, resulting in an unstable network. Gradient Clipping involves forcing the gradient values to a specific minimum or maximum value if the gradient exceeds an expected range.

Gradient descent is an optimization algorithm that finds the local minimum of a function. The algorithm will take iterative proportional steps toward the negative gradient of the function at the current point. The algorithm usually starts with parameters (weights and bias) and improves them slowly as it tries to get a sense of the value of the cost function for weights that are similar to the current weights (by calculating the gradient). Then it moves in the direction which reduces the cost function by repeating this step thousands of times. The algorithm will continually minimize the cost function.

Grid Search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. Example: if a model takes a,b,c – three hyperparameters as an input, the grid looks like:

a = [ … ] , b = [ … ] , c = [ …] (the dots are numbers) and the algorithm runs all the possible combinations and looks for the best one. Important – it might take a long time to run!

Hidden Layers are the layers in the neural network in between the input layer and the output layer where the neurons take in a set of weighted inputs and produce an output through an activation function.

Hierarchical Clustering is an algorithm that groups similar objects into groups called clusters.

The output of the algorithm is a set of clusters, where each cluster is distinct from one another. Within the cluster there are objects which are similar to one another.

Hinge Loss is a loss function used for training classifiers. The hinge loss is used for “maximum-margin” classification, most notably used for a support vector machine (SVM).

A Holdout Set is another name for “test set”, which means that it is a subset of the dataset that is held out, and not used for training algorithms. This set usually provides a final estimate of the machine learning model’s performance after training and validation.

A parameter whose value is set before the learning process begins. Different model training algorithms require different hyperparameters and some don’t even require any. Given the hyperparameters, the training algorithm learns the parameters from the data. The selection of the hyperparameter can dramatically influence the time of learning and the performance of the algorithm.

Hyperparameter Tuning is the act of choosing a set of optimal hyperparameters for a learning algorithm. Hyperparameters are parameters with values that control the learning process.

A hyperplane (H) is a linear subspace of a vector space (V) such that the basis of H has cardinality one less than the cardinality of the basis for V. In other words, if V is an n-dimensional vector space then H is an (n-1)-dimensional subspace. In ML, it may be useful to employ techniques such as SVM to learn hyperplanes to separate the data space for classification.

The very beginning of the workflow for the artificial neural network. It is composed of input neurons that bring the initial data into the system for further processing in the sub-layers of the artificial neurons. The neurons in the input layer are considered “passive” neurons since they are simply used to receive incoming data to the network.

An example or single occurrence of something. In the specific context of ML, it refers to single unit of data which is given to the algorithm either for training or for testing.

Example: In a problem of creating a classifier which classifies patients to diseases: theinstance is the patient.

Also known as Memory Based Algorithm. It is a family of learning algorithms that do not perform explicit generalization, rather, the algorithm compares new problem instances with instances which were seen in training and have been stored in memory. It is called that because the hypothesis created by the algorithm is created directly from the training data instances themselves. This means that the hypothesis complexity grows as the size of the data grows.

Example of an instance-based algorithm is KNN (k-nearest-neighbors).

Illustration of KNN – the red triangles and the blue squares are previously seen instances, the algorithm should classify the green circle with a question mark based on the previous instances.

An iteration is the process of going through a set of operations that deal with computer code. For example: loop

K-Means is a clustering algorithm. This algorithm aims to cluster the data to K groups (clusters) such that each example in the data is as close as possible to the nearest centroid. In the illustration below the green dots are the centroids and the colored spheres are the clusters.

Classic algorithm for classification and regression. The algorithm receives the natural number K as an input and a training set. During the test set, for each sample the algorithm checks the K’s closest elements in the training set to the sample and classify it by a “plurality vote”.

Example: In the illustration below, there are 11 samples in the training set (5 of them are classified as red triangles and 6 of them are classified as blue squares).

Now, the algorithm should classify the green circle. If the algorithm receives k = 1 or k = 2 as an input , it will classify the green circle as red triangle, but if k grows to 11 the algorithm will classify the green circle as a blue triangle.

Project management consists of various specific measurement tools for indicating how well teams are achieving specific goals.

Kernel methods are a class of algorithms for pattern analysis. In ML, it is usually referred to the “kernel trick”, which is a method of using linear classifiers to solve non-linear problem.

Labels are the final output of the learning algorithm (the output classes are also considered labels). This term is also used in the context of samples that have been tagged.

Labeled Examples are an example in the dataset which has a label. Example: if the dataset contains people details, the example might be a single person and the label might be the gender.

In the context of software development, a library is a suite of data and programming code that is used to develop software programs and applications. It is designed to assist both the programmer and the programming language compiler in building and executing software.

Linear Regression is a linear approach to modeling the relationship given examples and their labels. The algorithm seeks coefficients a, b such that the classifier y = a*x + b. The red line is the classifier. The loss function is a squared loss because the algorithms might miss and not be precise so the classifier measures the distance from the real points.

A Learning Algorithm is an algorithm which performs learning over a given dataset and returns a hypothesis (prediction rule) which is able to predict based on the previous data.

Learning Rate is a hyperparameter which determines to what extent newly acquired information overrides old information. As the goal is to find the minimum error, learning rate that is too high will make the learning jump over minima, and if it is too low, the learning rate will either take too long to converge or get stuck in an undesirable local minima.

Log loss is a classification metric based on probabilities, it uses when we have values between 0 to 1. It measures the performance of a classification model, by measuring the uncertainty of the probabilities of the model. A perfect model would have a log loss of 0.

Logistic Regression is an ML algorithm which is used for classification problems, it is a predictive analysis algorithm based on the concept of probability. The algorithm classifies with the sigmoid function which always returns values between 0 (absolutely false) and 1 (absolutely true). Illustration of the sigmoid function is below.

LSTM is an RNN architecture, LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

Loss Function is a method of evaluating how well specific algorithm models perform based on the given data. If the model predictions deviates too much from the actual results, the loss function would report a very large number, and on the contrary, if the model predictions are close to the actual results, the loss function reports smaller numbers.

LSTM is an RNN architecture. LSTM networks are well-suited for classifying, processing and making predictions based on time series data. LSTMs were developed to deal with the exploding and vanishing gradient problems that can be encountered while training tradinitional RNNs.

Machine Learning (ML) is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task by relying on patterns and inference instead of explicit instructions. ML algorithms are used in a wide variety of applications, such as computer vision (CV), data mining, natural language processing (NLP), etc.

A Machine Learning Algorithm is an algorithm which builds a mathematical model based on sample data (“training data”), in order to make predictions or decisions without being explicitly programmed to perform the task.

Metric Learning (close term to similarity learning) is the task of learning distance function over objects to evaluate similarities between them. Each task should have a different distance function. Example: Assume the task is to cluster documents. Firstly, based on the topics and secondly based on the author. Each of those tasks requires different distance functions.

Minibatch is a small subset of the training data. When the training data is split into small batches, each batch is called a minibatch. I.e., 1<𝑠𝑖𝑧𝑒(𝑚𝑖𝑛𝑖𝑏𝑎𝑡𝑐ℎ)<𝑠𝑖𝑧𝑒(𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑑𝑎𝑡𝑎).

For example: If we have 1 billion inputs of training data, and you set your minibatch size to 512, each epoch will have 512 inputs of training data to process. So the mini-batch size is the amount of data you would like to process in each epoch.

A model is a mathematical representation of a real-world process. In the context of ML, generating learning-algorithms requires data so the algorithm can learn and then perform a task based on the input data.

A Classification Model is a model that tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data. Examples of classification models: logistic regression, decision trees, random forests, etc.

A Regression Model is a model which is used to predict a continuous value. Examples of regressions: linear, polynomial, etc. A classic example of regression model would be predicting the price of a house using features: like size, number of rooms, etc in order to predict to house prices.

Multi-Class Classification is a classification task with more than two classes. In multi-class classification each sample is assigned to one and only one target label. Example: an animal can be a cat, a dog or any other animal, but not two of them at the same time.

Multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems. It is a model that is used to predict the probabilities of different possible outcomes.

In the context of ML, artificial neuron is a mathematical function. It takes one or more inputs that are multiplied by values called “weights” and added together. This value is then passed to non-linear function called “activation function” to becomes the neuron’s output. Illustration:

In the context of ML, Artificial Neural Networks are computing systems that “learn” to perform tasks by considering examples, generally without being programmed with a specific task. The network is based on a collection of connected nodes called “neurons”. Each connection can transmit a signal (value) from one neuron to another. A neuron that receives a signal, process it and then signal additional neurons connected to it. Illustration:

Normalization is a technique often applied as part of data preparation. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in ranges of value. Example: Assume the dataset includes the two features – age (0-100) and income (0-20k). So, both can be scaled to be in range of 0 and 1.

One-Hot Encoding is a method of preprocessing the data to be encoded in binary form where 1 indicates true value and 0 indicates false value. Illustration:

One-Shot Learning is an object categorization problem that aims to learn information about object categories from one, or only a few training samples. This technique commonly used in computer vision (CV).

Outlier is an object that deviates significantly from the rest of the objects. It can be caused by measurement or execution error. In the illustration below it’s clear that the single isolated point is an outlier.

The last layer in neural network, which produces given outputs for the program. Though they are made much like other neurons in the network, the output layer neurons may be built or observed in a different way, given that they are the last “actor” nodes on the network.

A model which is trained with a lot of data, and starts learning from the noise and inaccurate data entries in the dataset. Then, the model does not categorize the data correctly, because of too much of details and noise.

A method of ML in which data becomes available in a sequential order and is used to update our best model for future data at each step (unlike batch learning which first learns dataset and then develop an hypothesis). This technique is commonly used in situations where it is computationally infeasible to train over the entire dataset, for example: stock price prediction.

Online serving

Can’t find this…

In the context of models, parameter is a configuration variable that is internal to the model and whose value can be estimated from data. In programming, parameter is passed to a function (function argument). In this case parameter could have one of range of values. In ML, the specific model which is used is the function and requires parameters in order to make predictions.

Partitioning Clustering Algorithm used to classify observations within a data set, into multiple groups based on their similarity. When running the algorithm we need to define the number of clusters to be generated.

Pooling is a type of layers which is used in CNN. Pooling layers reduce the dimensions of the data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. There are two types of pooling: local and global. Usually, local combines small clusters (2×2 for example), global acts on all the neurons of the convolutional layer. In the illustration below there examples of max pooling and average pooling.

In statistics, Population is a set of similar items or events which is of interest for some question or experiment.

Precision is the proximity of two measurements to each other. .

In the context of pattern recognition, information retrieval and binary classification, precision is the fraction of relevant instances among the retrieved instances.

Predictive Modeling is the general concept of building a model that is capable of making predictions. Typically, such models include an ML algorithm that learns certain properties from a training dataset in order to make those predictions.

Principal Component Analysis (PCA) is a technique that finds underlying variables (known as principal components) that best differentiate the data points. Example: if the dataset is a collection of food items, vitamin c can be a principal component as it is present in vegetables but absent in meat, then the meat should be separated by another component, etc.

Random Search is a technique where random combinations of the hyperparameter are used to find the best solution for the built model. It tries random combinations of a range of values. To optimise with random search, the function is evaluated at some number of random configurations in the parameter space.

In the context of pattern recognition, information retrieval and binary classification, recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. Recall means how many of all the correct hits did you find. (Not to be confused with precision, which is how many of the items found were indeed correct.)

Class of neural networks where connections between nodes form a directed graph along a temporal sequence. Unlike feedforward networks, in this type of network, a cycle can be formed and this behavior allows the creation of “internal memory” in order to process sequences of inputs. This makes them applicable to tasks such as handwriting recognition or speech recognition.

Also known as regression analysis. Regression is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing which attempts to determine the strength of the relationship between one dependent variable and a series of other changing variables (known as independent variables).

Example: In linear regression the algorithm tries to develop linear hypothesis like: aX + b where a and b are real numbers, by a given Y. So in the terms used above: Y is the variable that the predictor tries to predict, X is the variable that the predictor uses to predict (Y).

An algorithm that performs a regression on a dependent variable and one or more independent variables. For example: Linear regression algorithm, logistic regression algorithm.

Regression Model is used to estimate the relationships among variables, the focus is on the relationship between a dependent variable and one or more independent variables. Most likely we will use it in order to estimate the conditional expectation of the dependent variable given the independent variables.

One of three basic machine learning paradigms (alongside supervised learning and unsupervised learning). Reinforcement learning is about taking suitable action to maximize reward in a particular situation. In this family of algorithms the input is an initial state. There are many possible outputs as there variety of solutions to a particular problem. The training phase is based upon the input – the model returns a state and the user will decide to reward or punish the model based on its output. The best solution is decided based on the maximum reward.

Representation Learning (also known as feature learning) is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This techniques replace manual feature engineering.

Regularization is the method to estimate a preferred complexity of the ML model so that the model generalizes and overfit/underfit problem is avoided. This is done by adding a penalty on the different parameters of the model thereby reducing the “freedom” of the model.

Statistical risk is a quantification of a situation’s risk using statistical methods. These methods can be used to estimate a probability distribution for the outcome of a specific variable.

Semi-Supervised Learning is a class of ML tasks and techniques that also make use of unlabeled data for training. Semi-Supervised Learning falls between supervised and unsupervised learning. For example, when developing a model for a large bank intended to detect fraud and some frauds are known, and other frauds are unknown.

You can use a semi-supervised learning algorithm to label the data and retrain the model with the newly labeled dataset.

A way of measuring how closely two pieces of data resemble each other. This is helpful when dividing your data points into categories, creating clusters, or using the k-nearest neighbors algorithm.

A method of keeping files that are stored in several different physical locations up to date. Cloud and storage vendors often offer software that helps with this process. It is also commonly used for backup and for mobile access to files

This is a useful tool when normalizing your data. The goal is to make the average zero and the standard deviation one. It takes three steps: First, for each feature determine the mean and standard deviation. Second, subtract the mean from each data point value. Finally, divide by the standard deviation.

Stochastic Gradient Descent (SGD) is an iterative method for optimizing an objective function with suitable smoothness properties. It is called stochastic because the method uses randomly selected samples to evaluate the gradients. In the image below, on the left image there is an illustration of classical gradient descent, and on the left image there is a stochastic gradient descent.

A classifier than can (almost) accurately predict the value of data. As opposed to a weak classifier which does a poor job of categorizing data. There are ways of combining the power of many weak classifiers into forming one strong classifier. For example random forest.

Supervised Learning is the ML task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.

Support Vector Machine (SVM) is a supervised learning model. The SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall. In the illustration below, the blue and red points are separated by a margin.

In the context of ML, target is whatever the output of the input variables. It could be the individual classes that the input variables may be mapped to in case of a classification problem or the output value range in a regression problem.

Test Set is a dataset that is independent of the training dataset but that follows the same probability distribution as the training set. The testset is a set of examples used only to assess the performance of a fully specified classifier and is held out during model training.

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases and other elements called tokens. The tokens become the input for another process like parsing and text mining.

In ML and natural language processing (NLP), a topic modeling is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery the hidden semantic structures in text body.

A phase in the process of developing a learning algorithm. In this phase we supply the learning algorithm with data which is also known as “training data” and the algorithm seeks patterns in the given data. At the end of the training phase, the developed hypothesis is able to make predictions.

An example in the training set.

The error on the training set of the data. After the phase of training, the model can be tested over the same set (the training set) and then returns a score which reflects “how good” the model is over this set.

Given a dataset, we would like to split it to some different portions. Usually, the set is divided to training set and test set. The model is initially fit on the training dataset, which is a set of examples used to fit the parameters (weight for example) of the model. The training set usually includes the larger part of the data set as we want our model to learn as much as possible.

The task of transferring information from one machine learning task to another. The idea is to use one solved problem in order to solve another different but related problem. It can be used in multiple situations like: transferring knowledge from the solution of a simpler task to a more complex one or involve transferring knowledge from a task where there is more data to one where there is less data. Example: A model which can recognize cars could apply when trying to recognize trucks.

TP is an outcome where the model correctly predicts the positive class. Example: a model is used to predict cancer, the model returns “cancer” and there really is.

TN is an outcome where the model correctly predict the negative class. Example: a model is used to predict cancer, the model returns “no cancer” and there really isn’t.

Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data. Visualization of underfit model:

Unlabeled example is an example which has no tag. Example can be labeled by a model after the model was trained over some labeled data.

Unsupervised Learning is the process of learning data which is given only by an input and no tags (labels). The goal is to model the underlying structure or distribution in the data in order to learn more about the data. It is called this way because there is no “correct answer”. Unsupervised Learning can be grouped into Clustering and Association problems.

A learning algorithm which learns unlabeled data. This algorithms family is usually divided into 2 groups: 1) Clustering – a problem family where the algorithm aims to discover the inherent groupings in the data, Example: grouping customers by purchasing behavior. Popular algorithm in this family is k-means.

2) Association – rule learning problem where the algorithm aims to discover rules that describe large portions of the data, Example: people that buy X also tend to buy Y.

The error on a validation set. In cases of having big dataset, we would sometimes like to divide the training into two phases: training and validation. Therefore, we would create another subset of the data. After the phase of training, the model can be tested over the validation set and then it returns the error over the validation set, which is called validation loss.

An example in the validation set.

Given a dataset, we would like to split it to some different partitions. If the dataset is large enough, we would sometimes like to divide it into: training, validation and test sets. The validation set is another set seperate from the others and its purpose is to test the trained model on unbiased data.

In deep learning, the error is reduced using backpropagation which, as a result of the chain rule of derivatives, must multiply the gradient of successive layers together. If one of the layers has already maxed out, and reached a gradient of zero, while the other layers are not yet fully trained, then you are in trouble, because the GD’s improvement of whole layer series cannot progress, since the product is zero.

Variance is the expectation of the squared deviation of a random variable from its mean. Intuitively, it measures how far a set of random numbers are spread out from their average value. Essentially it isis the measurement of how inconsistent the accuracy of your model is when applied to various different data sets.

Weak Classifier is defined to be a classifier which is only slightly correlated with the true classification, means that – weak classifier classifies examples well just a bit better random guessing (the loss is a bit better than 0.5).

A computer can normally only recognize items if the items resemble what the computer has already seen. Zero shot learning is where the machine can recognize new never-before-seen objects, based on abstracting the features of items it knows about.

Apache Airflow it’s a workflow management system developed by Airbnb. Airflow is a platform to programmatically author, schedule and monitor workflows.

Open source link – https://github.com/apache/airflow

Kubernetes is an open-source container-orchestration system for automating application deployment, scaling and management. It works with range of containing tools, including docker.

Open source link – https://github.com/kubernetes/kubernetes

Apache Spark is an open source general purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark includes a framework called MLlib for machine learning

Open source link – https://github.com/apache/spark

Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet made by Uber. The goal of Horovod is to make distributed Deep Learning fast and easy to use.

Open source link – https://github.com/horovod/horovod

DGX is a workstation made by NVIDIA, that specializes in GPU acceleration for deep learning applications. In the image below: NVIDIA DGX station deep learning system.

SciKit-Learn is a free python library. It features various classification, regression and clustering algorithms and is designed to interoperate with the libraries NumPy and SciPy.

Open source link – https://github.com/scikit-learn/scikit-learn

TensorFlow (also known as TF) is an open source library for programming across range of tasks. This is a popular library for machine learning and in particular deep learning.

Open source link – https://github.com/tensorflow/tensorflow

Numpy is a python library which supports multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Open source link – https://github.com/numpy/numpy

ETL – extract, transform, load is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in different context than the source(s).

When you want to introduce a new feature, but you are nervous that the performance might be damaged, you deploy the updated version only on a portion of your fleet. This way, you can choose to proceed only after having checked that everything still works smoothly.

Model Management is an essential part of any machine learning project. It’s referring to the training, maintenance, deployment, monitoring, organization and documentation of machine learning models. Wrong model management can lead to poor performance of the model and can result in high maintenance cost.

A/B Testing is a randomized experiment with two variants, A and B. It includes application of statistical hypothesis. It is also a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determine which of the two is more effective.

Continual Learning (CL) is the ability of a model to learn continually from a stream of data, building on what was learned previously, hence exhibiting positive transfer, as well as being able to remember previously seen tasks.

DAG – Directed Acyclic Graph. In mathematics, particularly graph theory, and computer science, a directed acyclic graph, is a finite directed graph with no directed cycles.

Amazon S3 (or Amazon Simple Storage Service) is a service offered by AWS (Amazon Web Services) that provides object storage through a web service interface.

Link – https://aws.amazon.com/s3/

HDFS – Hadoop Distributed File System is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.

Repository – https://git-wip-us.apache.org/repos/asf?p=hadoop.git

A bunch of open source tools that allow you to use the computing power of many machines, which can handle tons of data and fast computation. It works by splitting files into large chunks which it then distributes across nodes of a cluster.

Pandas is a python library for manipulation and data analysis.

Open source link – https://github.com/pandas-dev/pandas

Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages.

JupyterLab is the next-generation user interface for Project Jupyter offering all the familiar building blocks of the classic Jupyter Notebook.

Open source link – https://github.com/jupyterlab/jupyterlab

RStudio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics.

Open source link – https://github.com/rstudio/rstudio

TensorBoard is a suite of web applications for inspecting and understanding your TensorFlow runs and graphs.

Open source link – https://github.com/tensorflow/tensorboard

Voila convert Jupyter notebooks to standalone web applications and dashboards.

Why do you need it? It’s not ideal to share Jupyter notebooks with non technical coworker for example, they might not understand how to run it in order to see the needed results, Voila solve it by converting the Jupyter notebooks to standalone web application.

Shiny is an open source R package that provides an elegant and powerful web framework for building web applications using R.

Dash is an open-source framework for building analytic web apps with Python and R.

Web Link – https://plot.ly/dash, Open Source – https://github.com/plotly/dash

Docker is a platform to develop, deploy and run applications inside containers. Docker also enables to run multiple containers simultaneously on a host machine.

Nvidia-Docker is a toolkit which allows users to build and run GPU accelerated Docker containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs.

Open source link – https://github.com/NVIDIA/nvidia-docker

Container Registry is a developer tool supported by Google. Container Registry

Is a single place to manage Docker images, perform vulnerability analysis and decide who can access what.

Distributed Training is a concept of training a model on some machines together. When the size of the dataset is big, it might be computationally hard and time consuming to train a model on a single machine, so distributing the training phase to some different machines can be more efficient.

Spot instance are spare compute capacity in the cloud and offered by cloud providers (e.g. Amazon) who want to sell their compute capacity.

Our blog about saving money in ML development – https://cnvrg.io/save-in-cloud-costs

ML Pipelines is a pipeline infrastructure necessary for ML production. It consists of several components (see the image below). Usually refers to the process of: Collecting data -> Preprocessing data -> Training model -> Deploy.

An ML pipeline consists of several components, as the diagram shows. We’ll become familiar with these components later. For now, notice that the “Model” (the black box) is a small part of the pipeline infrastructure necessary for production ML.

Dataframe is a table, or two dimensional array- like structure. It’s a distributed collection of data organized into named columns.

Data Catalog is a metadata management tool designed to help organizations find and manage large amounts of data – including tables, files and databases stored in their ERP, human resources, finance and e-commerce systems and other sources.

Cross Industry Standard Process for Data Mining (CRISP-DM) is an open standard process model that describes common approaches used by data mining experts. Mostly used in analytics model. The process of data mining is usually broken into six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.

MLOps is a practice for collaboration and communication between data scientists and operations to help manage production ML lifecycle. MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements.

DataOps is an automated process-oriented methodology, used by analytic and data teams to improve the quality and reduce the cycle time of data analytics. It is considered today an independent approach to data analytics. DataOps is not tied to a particular technology, architecture, language or framework.

Natural Language Processing (NLP) is a subfield of computer science and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze natural language data.

Computer Vision (CV) is a scientific field that deals with how computers can be made to gain understanding from images and videos. Computer vision tasks include methods for acquiring, processing, analyzing and understanding images and videos and extraction of details from them.

Signal Processing is a subfield of electrical engineering that focuses on analysing, modifying and synthesizing signals such as sound, images and biological measurements.

Recommendation System is a subclass of information filtering systems that seeks to predict the “rating” or “preference” a user would give to an item. They are primarily used in commercial applications. Commonly used by entertainment and sales companies like Netflix, YouTube and Amazon.

Speech Recognition is a subfield of computational linguistics that developed methodologies and technologies that enables the recognition and translation of spoken language into text by computers.

Fraud detection is a set of activities undertaken to prevent money or property from being obtained through false pretenses. It’s applied in many industries such as banking or insurance.

Image Classification is the task of extracting information classes from an image. There are two types of classification: (1) supervised – where the user trains a classifier with data and then the classifier extracts details from the image, (2) unsupervised – the model classifies images to groups.

Text Classification (also known as text categorization) is the process of categorizing text into organized groups. By using NLP, text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on content.

Forecasting is the process of making predictions of the future based on past and present data and most commonly by analysis of trends. It uses statistical methods like employing time series, cross-sectional or longitudinal data.

Anomaly Detection is a subfield of data mining. It means identification of rare items, events or observations which raise suspicious by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, medical problems or errors in a text.