Statistical Learning and Data Mining

As a simplified example of character recognition, we will investigate the potential for
using the very simple K Nearest Neighbour (KNN) classifier to predict whether an input
image represents the handwritten digit 9 (nine) or 8 (eight). These digits were selected
as they are quite similar to each other, both with a rounded upper part of the digit, but
with very similar yet different bottom parts (the nine is like an almost complete eight, with
a small gap missing on the left side).
Download (from moodle), NumberRecognition.mat. Note the data downloaded is already
divided into training and testing datasets. It also includes data samples for all
handwritten digits 0 to 9, but we will be using only 8 and 9 for this assignment. You can
implement your assignment in either Matlab or python, with details to follow:
Matlab
You can load this data in Matlab by typing “load \NumberRecognition.mat” into
your function, where path is the location on your computer where you placed the
NumberRecognition.mat file. All Matlab work is to be completed with functions, not with
scripts (you can look the difference up on the internet). Thus each code file has to begin
with “function []=myFunctionName();”. The data loaded consists of a 3 dimensional array
of either 750 (training) or 250 (testing) 28x28 images of each digit. You can view the
individual images with the imagesc command. Use the reshape command to restructure
the 3D array into a decomposed 2D array because KNN (and most machine learning
algorithms) only takes in a 2D array of measurements. To train a KNN classifier, call
fitcknn. To predict with that classifier, call predict.
Python
You may wish to use functions from any of the following packages:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sbn # better plotting and aesthetics
from pathlib import Path # just a utility for better cross-platform file-loading
from scipy.io import loadmat
from sklearn.neighbors import KNeighborsClassifier
specific functions of interest:
Matlab file loading: loadmat
reshape to reshape a 3D matrix of 2D images into a decomposed 2D matrix
KNeighborsClassifier for training
predict for prediction
Note that as 4th year and graduate level students, you are expected to read code
documentation online and explore the use of these techniques prior to asking for help
from the professor.
Question 1: Build 20 KNN models with varying K=1,2,3,…..,20 in a loop. Provide a plot
of testing error rate (as a percentage on the y axis) vs. K (x axis). Provide a printout of
your code (Matlab or python). Provide a printout of the plot. Answer the following
questions:
a) Why does testing error rise at high values of K?
b) What is the error rate at the lowest K? Do you expect this to be a reliable
performance estimate? Why?
It was previously announced on multiple occasions that each student is required to
assemble their own dataset compatible with supervised learning based classification
(i.e. a collection of measurements across many samples/instances/subjects that include
a group of interest distinct from the rest of the samples).
Question 2: Describe the dataset you have collected: total number of samples, total
number of measurements, brief description of the measurements included, nature of the
group of interest and what differentiates it from the other samples, sample counts for
your group of interest and sample count for the group not of interest. Write a program
that analyzes each measurement individually. For each measurement, compute Cohen’s
d statistic (the difference between the average value of the group of interest and the
average value of the group not of interest, divided by the standard deviation of the joint
distribution that includes both groups). Provide a printout of the 10 leading
measurements (d statistic furthest from zero), making it clear what those measurements
represent in your dataset (these are the measurements with the most obvious potential
to inform prediction in any given machine learning algorithm). Provide a printout of this
code.
Question 3: Adapt your code from Question 1 to be applied to the dataset that you’ve
organized for yourself. You will need to first randomize your samples into training and
testing subsets, so that you can train your machine learning model as you did in
Question 1 – this only needs to be done once for this question (no repeat validation is
required at this time). Provide a printout of the plot and your code. Answer the following
question: is the profile of K vs. test error rate similar or quite different to the digit
recognition example of Question 1? Elaborate on those similarities/differences – what
about your dataset may have contributed to what you observe in this plot?
Deadline: October 1st, 2019, In class