Data Transformers are python classes that modify the data according to a specific requirement. Data Transformers in scikit-learn include SimpleImputer, StandardScaler, LabelEncoder, and much more. SimpleImputer attempt to modify data by imputing the missing values in data columns, StandardScaler tries to modify data by scaling the column values between specific intervals, and LabelEncoder maps the categorical values in a column to numerical values, so every Data Transformer modifies the data according to the purpose it is made for.
Data Transformers usually have methods such as fit and transform, so what are those? Let us have a look at it!
Confusion matrix (also called Error matrix) is used to analyze how well the Classification Models (like Logistic Regression, Decision Tree Classifier, etc.) performs. Why do we analyze the performance of the models? Analyzing the performance of the models helps us to find and eliminate the bias and variance problem if exist and it also helps us to fine-tune the model so that the model produces more accurate results. Confusion Matrix is usually applied to Binary classification problems but can be extended to Multi-class classification problems as well.
Clustering is an unsupervised technique in which the set of similar data points is grouped together to form a cluster. A Cluster is said to be good if the intra-cluster (the data points within the same cluster) similarity is high and the inter-cluster (the data points outside the cluster) similarity is low. Clustering could also be viewed as a Data Compression technique in which the data points of a cluster can be treated as a group. Clustering is also called Data Segmentation because it partitions the data such that a group of similar data points forms a cluster.
Machine Learning is an enticing and alluring field of study. Machine Learning is a subset of Artificial Intelligence and its algorithms revolve around mathematics. Machine Learning has transformed many industries. The industries that leverage Machine Learning are healthcare industries, finance, transportation industries, manufacturing industries, advertising industries, automobile industries, etc. The impact of Machine Learning is such that many industries couldn’t function without Machine Learning.
The term “Machine Learning” was coined by Arthur Samuel in 1952. Arthur Samuel describes Machine Learning as
“Field of study that gives computers the ability to learn without being explicitly programmed”.
Machine Learning Algorithms are very…
A Cost function is used to gauge the performance of the Machine Learning model. A Machine Learning model devoid of the Cost function is futile. Cost Function helps to analyze how well a Machine Learning model performs. A Cost function basically compares the predicted values with the actual values. Appropriate choice of the Cost function contributes to the credibility and reliability of the model.
Apache Spark is an open-source analytics engine and cluster-computing framework that boosts your data processing performance. As they claim, Spark is a lightning-fast unified analytics engine. Spark is entirely written in Scala.
Spark is effectively used in the field of Big-Data and Machine Learning for analytical purposes. Spark has been adopted by various companies like Amazon, eBay, and Yahoo.
Artificial Neural Network (ANN) is a deep learning algorithm that emerged and evolved from the idea of Biological Neural Networks of human brains. An attempt to simulate the workings of the human brain culminated in the emergence of ANN. ANN works very similar to the biological neural networks but doesn’t exactly resemble its workings.
ANN algorithm would accept only numeric and structured data as input. To accept unstructured and non-numeric data formats such as Image, Text, and Speech, Convolutional Neural Networks (CNN), and Recursive Neural Networks (RNN) are used respectively. In this post, we concentrate only on Artificial Neural Networks.
Categorical feature encoding is an important step in data preprocessing. Categorical feature encoding is the process of converting the categorical features into numeric features. Categorical variables are also called as Qualitative variables. The results produced by the model varies when different encoding techniques are used.
Two types of categorical features exist,
Dimensionality reduction prevents overfitting.
The problem of missing data is prevalent in most of the research areas. Missing data produces various problems.
Each of these problems may lead to spurious conclusions that reduce the reliability of the model.
Aspiring Machine Learning Practitioner 👨🏻💻 💻