Machine Learning Cheat Sheet
A quick look through some helpful pointers to get your machine learning cracking.
Data Preprocessing includes number of operations on raw data to transform into clean data for algorithms to produce proper results from training. It involves removal of outliers, reduce noise and fill any missing value and reduce feature set for optimal model performance.
|Noisy Data||Noise is a random error or variance in a measured variable or containing outlier values which deviate from the expected outcome.|
|Missing Value||No recorded value for several variables in dataset. Missing values can be filled in through a) manually b) replace with constant value c) most probable value identified using decision tree method etc.|
|Feature||Any variable that describes the data point, also known as attributes or dimensions. Consider you have customer dataset having customer name (values = Bob, Sam, Jane), Customer City (values = NY, DL, IL), here customer name and customer city are features, also known as attributes or dimensions.|
|Scaling||Features in dataset may vary from different ranges of values, where the highest range of variable could dominate the context of an algorithm or affect the outcome. For e.g. height (range 3 feet to 7 feet), and weight (20 kg to 50 kg). Here both variables to be on right scale to rightly predict the required outcome. Most of the algorithm expect variables to be in common range. Two common approaches to bring features on common scale i.e. normalization and standardization.|
|Feature Engineering||The process to create most relevant features from existing features in the dataset to improve accuracy and performance of learning algorithms. Involve add or discard features or derive new feature space.|
|Feature Selection||Creation of subset from the original dataset, means selecting the most useful feature to train, and has lower prediction error than on full model. In this process, some variables are retained or discarded. Common methods for best feature selection are forward elimination, backward elimination etc.|
|Dimension Reduction||Irrelevant or redundant attributes are detected and removed to reduce model complexity.|
|Tokenization||A word (token) is the minimal unit that a machine can understand and process. Splits longer strings of text into smaller pieces, or tokens.|
|Stemming||Stemming is the process of eliminating affixes (suffixes, prefixes, infixes, circumfixes) from a word in order to obtain a word stem. They are basically driven by set of rules e.g. removing the ing from running (run).|
|Lemmatization||Lemmatization is related to stemming, able to capture canonical (word base form) forms based on a word’s lemma e.g. better –> good|
|Stop words||Typically adverbs and pronouns are generally classified as stop words, which are filtered out before further processing of text, since these words contribute little to overall meaning e.g. the, a, an|
|Part-of-speech (POS) Tagging||POS tagging consists of assigning a category tag to the tokenized parts of a sentence. The most popular POS tagging would be identifying words as nouns, verbs, adjectives, etc.|
|Bag of words||A piece of text (sentence or a document) is represented as a bag or multiset of words, disregarding grammar and even word order and the frequency or occurrence of each word is used as a feature for training a classifier.|
|Vector||In text classification first sentence is converted into a computer understandable format which can be thought of as a vector (array) of 0 and 1 with each index representing a word in the training data.|
|NER (Named Entity Recognition)||The process of locating and classifying elements in text into predefined categories such as the names of people, organizations, places, monetary values, percentages, etc.|
|N-grams||Combinations of adjacent words or letters of length n in source text. ‘N’ refers to the number of words or word parts. Find pair of words that occur next to each other. e.g. ‘I work in ValueFirst’
here possible pair could be, ‘I work’ ‘work in’ and so on.
|TF-IDF||Term Frequency – Inverse Document Frequency
The TF-IDF weighting for a word increases with the number of times the word appears in the document but decreases based on how frequently the word appears in the document set.
B) Machine Learning Algorithms
|Linear Regression||A supervised learning algorithm for finding the correlation between two continuous variables. After finding the correlation, it predicts a real value corresponding to the un-trained continuous variables.|
|Logistic Regression||Majorly used in binary classification problems. It predicts the occurrence of an event depending on the values of an independent variable,that could be categorical or numerical. For e.g. It can predict the chance that a patient has a particular disease given certain symptoms.|
|Naive Bayes||Classification technique based on Bayesian Theorem. Parameter estimation for naive bayes models uses the method of maximum likelihood. Assume that the value of a particular feature is independent of the value of any other feature, given the class variable. Particularly suited when the dimensionality of inputs is high.|
|SVM (Support Vector Machine)||The fundamental idea behind SVM is to separate the classes with a straight line (hyper plane). Help to maximize the margin i.e. the distance between the separating hyperplane and the training classes that are closest to the hyperplane. Most commonly used in text classification problems.|
|Decision Tree||Supervised learning algorithm, used for classification tasks. Here dataset / features split into smaller sub-groups (decision nodes where sub-nodes exists) to best describe the groups and input variables.|
|KNN||Here ‘K’ is the number that decides how many neighbors are defined based on distance metric that influence the classification. Classification is done by majority vote of its neighbors, and assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K=1, then assign to the class to nearest neighbor.|
|Neural Network||Similar to linear/logistic regression, but with the addition of an activation function which makes it possible to predict outputs that are not linear combinations of inputs.|
|Gradient Descent / Stochastic Gradient Descent (SGD)||Parameter optimization algorithm to obtain the best value of parameters which will give the highest accuracy on validation data. SGD or stochastic gradient descent is a form of gradient descent in which parameters are updated for each training data.|
|Overfitting||When the model performs well on the input data but poorly on the test data, it generally happens when the model is complex, means too many parameters.|
|Underfitting||When the model is unable to learn from the training data and performs poor on test data.|
|Hyperparameter||Parameter of learning algorithms, fixed before start of training process. Regularization is generally applied or controlled using hyperparameters. It can be decided through choosing different values, and training models.|
|Sigmoid Function||A type of non-linear activation function used to bring down the output between 0 and 1.|
|Bias||Measures the gap in prediction and the correct value if the model is rebuilt multiple times on different training datasets. A high bias model is most likely to underfit the training data.|
|Variance||Measures the variability of the model prediction if the model is retrained multiple times, means model is sensitive to small variations in the training data. High variance is likely to overfit the training data.|
|Bias/Variance Tradeoff||Tradeoff means increase variance, reduce its bias and vice versa. Generally increasing model complexity increases the variance and reduces its bias. Similarly reducing model complexity reduces the variance, and increases its bias. One way of finding a good bias-variance tradeoff is to tune the complexity of model via regularization.|
|Weight||A type of parameter used in neural network, which determines the relevance of feature for the next layer of neural network.|
|Cost Function||Cost function or error function is used to calculate the error obtained when data is being trained using a machine learning algorithm. Cost function is used for optimizing the parameters in GD algorithm.|
|Confusion Matrix||Used to evaluate the performance of a classifier. It counts the number of times a class is wrong and classified to any other class. Prediction is compared with actual targets. It defines matrix on:
• True Positives (TP)
• True Negatives (TN)
• False Positives (FP)
• False Negatives (FN)
|Accuracy||Provide information on how many samples are misclassified. Accuracy is calculated as the sum of correct predictions divided by the total number of predictions.
|Precision||Total positive predictions that are actually positive (positive denotes a class label)
|Recall||Ratio of positive instances that are correctly detected by the classifier.
|F measure||Represent the measurement of both precision and recall. It will always be nearer to the smaller value of precision and recall. (Harmonic mean of precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0)|
|ROC (Receiver Operating Characteristic)||Commonly used graph / tool that is further used with binary classifier to summarize the performance of classifier by plotting True Positive Rate (TPR) against the False Positive Rate (FPR), where FPR is the ratio of negative instances that are incorrectly classified as positive.|
E) Python Libraries
|Scikit Learn||Provide range of machine learning algorithms.
|NLTK||Natural Language Toolkit provide the entire suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning etc.
|Numpy||Package for array processing for numbers, strings, records, and objects.
|Pandas||Powerful data structures for data analysis, data manipulation, time series, and statistics.
|Matplotlib||Visualization tool in python, for plotting different types of graph.
Keep your self updated with useful and snacky information which will help you with your AI learning initiatives. Reach us at firstname.lastname@example.org for any Martech enquiries.