Photo by Drew Graham on Unsplash

Write your First Machine Learning Model

A crash course in making a classifier model in 13 lines of code!

--

So you want to build your first machine learning model? In data science, we use data to help us make predictions based on what we already know. We’ll try and use a few lines of code as possible: 13 total lines or 9 lines if you don’t count imports!

In this step-by-step tutorial you will:

  1. Import the data
  2. Create features and target variables
  3. Train/Test Split the data
  4. Fit a classification model
  5. Evaluate your model

Requirements

  1. Python 3.7
  2. NumPy 1.16
  3. SciPy 1.2
  4. scikit-learn 0.21

Note: I’m sure the older version of these packages will work but these are just the ones installed on my machine.

Preface — The Data

The iris dataset is one of the most used machine learning beginner datasets. It is the “hello world” of ML datasets and a great starting point in building your first model. The 150 sample dataset consists of three different species of the iris flowers. There are 50 samples of each species, with every sample containing four features: sepal length, sepal width, petal length, and petal width. We will use these features, train a model, and hopefully predict which species a sample comes from, given their four measurements.

Iris setosa, Iris versicolor, Iris virginica

1. Import the Data

Our first step is to import the iris dataset from scikit-learn (sklearn). We will set the dataset as a dictionary variable, which we’ll call iris.

from sklearn import datasets
iris = datasets.load_iris()

Usually, we would explore the data and use our intuition to clean the data. It is not necessary to clean the iris dataset (and to be honest, this would require more lines code and I promised we can do it with 13!) For more information on the iris dataset, check out the Wikipedia article.

2. Create features and target variables

From the iris variable, let’s get the data we need. In supervised machine learning, we denote our features list as upper caseX and our target as y variables.

X = iris[‘data’]
y = iris[‘target’]

Note: a caveat of using any model is that every data point has to be numeric. This can be a very long and expensive process of converting all your points to numbers. Luckily for us, the iris dataset is conveniently stored as numbers in an array so we can go ahead with the next step (it won’t always be this easy).

3. Train/Test Split the data

With a given set of data, we will randomly split the data into two parts: the training set and the testing set. We will train our model using the training set and then test our model on the testing set (genius naming conventions folks). The purpose of train/test split is to help evaluate our model. Since we know the target variable of the testing set, we can use our model to compare our model’s prediction on the testing set. We will go over that later in the tutorial.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=.33,
stratify=y,
random_state=42)
* (this is only two lines of code, don't @ me)

Let’s take a deeper look at the code. First, we’ll import train_test_split from the scikit-learn package model_selection. Next, our X variable is split and assigned to X_train and X_test; our y variable is also split into y_train and y_test.

Scikit-learn’s train_test_split method can also be configured with additional parameters. By default, the scikit-learn’s split is 75% training and 25% test. Since we’re using a dataset of only 150 points, let’s decrease the training split to 64% of the data and increase the test to 33% of the data with test_size=.33. The stratify parameter will ensure that our y splits will have the same ratios as the original target variable. We will also set a random_state so we can reproduce the same split later.

4. Fit a classification model

In supervised learning, we have two types of problems we want to solve: regressions and classification. For this problem, we want to be able to classify our predictions. There are many to choose from, but for this problem, we’re going to use k-Nearest Neighbors.

A (very) quick rundown of k-Nearest Neighbors is that it uses distance from other data points to make a prediction. We pick a number (k) and find that many closest data points (nearest neighbors) for the point we’re trying to predict. Then we pick a majority class for the nearest neighbors and classify that point as such. For the example above, given k=5 neighbors, we would classify the question mark data point as red.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

Now for the actual machine learning part. While the name machine learning can sound complicated, it can be mostly be broken down into 3 lines of code: importing the model, instantiating the model, and then fitting the model.

First, we’ll import the KNeighborsClassifier model from sklearn.neighbors. Next, we’ll instantiate the model and call it knn (with the default k=3). Finally, we’ll fit the model with our features X_train to our target variable y_train. And that’s it! Your model is trained and can now make predictions on data it hasn’t seen!

5. Evaluate your model

So now that we have a trained model, we need to put it to the test. Since we split our data earlier, we leverage the fact that we already know what the predictions should be.

y_train_pred = knn.predict(X_train)
y_test_pred = knn.predict(X_test)

First, we will create two variables, y_train_pred and y_test_pred. We will use our knn model and predict the target variable, given the X_train and X_test, respectively.

from sklearn.metrics import accuracy_score
print(f'training score = {accuracy_score(y_train, y_train_pred)} \ntesting score = {accuracy_score(y_test, y_test_pred)}')
* also just 2 lines

From sklearn.metrics we’ll import accuracy_score. What accuracy_score does is take in our actual target variable and compare it to our predictions. accuracy_score will simply take our number of correct guesses and divide it our the total number of the set, giving us a percentage. Let’s see what our output is:

training score = 0.96 
testing score = 0.98

Wow! What this means is that our model is 96% accurate with our training set and 98% accurate with our testing set. These numbers are very good. Since we did use a very small data set, I would be wary about how accurate our model is at additional, unseen data points. But the fact of the matter is that we’ve made our first machine learning model!

Is that it?!

No. Far from it! In this tutorial, we’ve glossed and skipped over many important concepts. We should have explored our data and made sure it would work with our model. We could have used Pandas, converted our data to a DataFrame, and manipulated it. We should have cross-validated our train-test split. We could have fine-tuned our model and used multiple other machine learning models. We could have evaluated our model using f1-score, a confusion matrix, or AUC ROC.

It’s okay if you don’t know what anything I just said. But now you should have a decent idea of the very basics to fitting and testing a model. You’ve just taken your first step to becoming a data scientist! If you want more information on the topics I’ve brought up, check out the Additional Resources links at the end of the post.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X = iris[‘data’]
y = iris[‘target’]
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=.33,
stratify=y,
random_state=42)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_train_pred = knn.predict(X_train)
y_test_pred = knn.predict(X_test)
print(accuracy_score(y_train, y_train_pred),
accuracy_score(y_test, y_test_pred))

Additional Resources

Pandas

Train/Test Split and Cross Validation

Confusion Matrix

AUC — ROC Curve

--

--