Machine Learning using Sci-kit Learn!

07/05/2019

For those of you familiar with Python, I'm going to discuss how we can use the popular library "Scikit Learn" to conduct our Machine Learning. If you're not familiar with Python, I HIGHLY RECOMMEND that you consider learning it; it's arguably the most popular programming language today, and an incredibly useful language for Data Science/Machine Learning (along with R).

Keep in mind that the format I'm about to demonstrate, can be used for most of the ML models with only a few changes needed:

1. Import all of your necessary libraries. It's good convention to do this at the very beginning to avoid confusion or appear unorganized.

Import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler (used to scale data which is very important for most models)
from sklearn.model_selection import train_test_split (used to split the sample data into "training" and "testing" data)
from sklearn.neighbors import KNeighborsClassifier(example of how we would import the model we desire, in this case we chose KNN).
from sklearn.metrics import classification_report, confusion_matrix (used to measure model performance)

2. After making the necessary imports, we proceed with making our ML model:

iris = datasets.load_iris() ----- Load our builit-in dataset and assign it to variable named "iris"
X, y = iris.data[:, :2], iris.target ----- Create X,Y variables which will be the samples and target feature, respectively.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)----- split our data between Testing and Training sets. Essentially, we "plug" our statistical model with training data for the model to study. Afterwards we "test" the model's efficiency by running it with new, never before seen testing data (to avoid bias).
scaler = preprocessing.StandardScaler().fit(X_train) ----- instantiate a standardScaler function to a variable named "scaler". We usually need to "scale" our data columns so that the samples are in a good range of variation to be used for calculations.
X_train = scaler.transform(X_train) ---- Transform the data after scaling.
X_test = scaler.transform(X_test) ---- The test data as well...
knn = neighbors.KNeighborsClassifier(n_neighbors=5) -----Create a "knn" variable to instantiate our KNearest Neighbors model (with 5 neighbors in this case).
knn.fit(X_train, y_train)----- "fit" the training data onto the new model.
y_pred = knn.predict(X_test) ----- make predictions off the new testing data.
classification_report(y_test, y_pred)----- lastly measure our model's performance using the 2 metrics: classification reports and confusion matrices. Each can provide great detail about our model's accuracy using several measurements including recall, precision, etc.
confusion_matrix(y_test, y_pred)

There you have it. That is the simplest form of using a statistical model in scikit-learn for Python.

Thanks for reading and message me for any concerns.