CS 3262 - Final Exam - Fall 2023
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Welcome to your final exam!
Logistics:
Open book/open note/no internet
You are not allowed to discuss the exam with each other
All questions about the exam will come to me, through email. Do not send any public messages to me, or each other about the exam.
If there are any clarifications required, I will post them on brightspace and update this document.
A note on the kinds of answers I expect: As is our style on HW and in class, many of these questions are open ended and are not asking you to repeat what you've read or heard in class. On the contrary, if I read my own words (or a texts) I will mark that down! I expect you to demonstrate your original thoughts. Almost none of these questions require 3-word answers (some do though, those should be clear by the question!).
Having said that, I also don't want you to start just typing out vocabulary words that we've used in class.
Tip: If you feel you can't answer a question, skip it and come back. Sometimes reading the entire thing will help clarify the individual parts. If all else fails, I will award partial credit for effort, and a clear explanation of what you're confused about and why.
Try and explain your confusion!
Changelog
Note: This is version 1, updated on 2023-12-11
Notebook Setup
# imports
import numpy as np
import matplotlib.pyplot as plt
colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
import seaborn as sns
import pandas as pd
import sklearn as sk
# styling additions
from IPython.display import HTML
style = '''
'''
HTML(style)
Problem 0 - Decision Trees/Random Forests
This problem will use two extra packages to make some nice visualizationsof our trees!
Uncommment and run this cell to install these packages:
#!pip install dtreeviz
Collecting dtreeviz
Downloading dtreeviz-2.2.2-py3-none-any.whl (91 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91.8/91.8 kB 2.2 MB/s eta 0:00:00
Requirement already satisfied: graphviz>=0.9 in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (0.20.1) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (1.5.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (1.23.5)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (1.2.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (3.7.1)
Requirement |
already |
satisfied: |
colour in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (0.1.5) |
Requirement |
already |
satisfied: |
pytest in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (7.4.3) |
Requirement |
already |
satisfied: |
contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dtreeviz) (1.2.0) |
Requirement |
already |
satisfied: |
cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dtreeviz) (0.12.1) |
Requirement |
already |
satisfied: |
fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dtreeviz) (4.46.0) |
Requirement |
already |
satisfied: |
kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dtreeviz) (1.4.5) |
This problem will use two extra packages to make some nice visualizationsof our trees!Just as what we did in the homeworks, if you run into errors when running'from dtreeviz import clfviz', you can replace it with'from dtreeviz import decision_boundaries':
import dtreeviz
#### approach 1: if this doesn't work, replace it with 'from dtreeviz import decision_boundaries'
from dtreeviz import decision_boundaries
#### approach 2:
from dtreeviz import decision_boundaries
Now we're ready. Lets start with the wine dataset we used in class:
from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data
X.shape
(178, 13)
This dataset has 13 features:
wine.feature_names
['alcohol',
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline']
Lets pick a subset for easy plotting:
X = X[:,[12,6]]
y = wine.target
Now we're ready!
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=20, n_jobs=-1)
rf.fit(X, y)
▾ RandomForestClassifier
RandomForestClassifier(min_samples_leaf=20, n_estimators=50, n_jobs=-1)
#### approach 1: using clfviz to visualize the boundary
#### if it's not working, try approach 2
fig,axes = plt.subplots(1,1,dpi=300)
clfviz(rf, X, y, ax=axes,
# show classification regions not probabilities
show=[ 'instances', 'boundaries', 'misclassified'],
feature_names=[ 'proline', 'flavanoid']);
NameError Traceback (most recent call last)
in ()
3
4 fig,axes = plt.subplots(1,1,dpi=300)
----> 5 clfviz(rf, X, y, ax=axes,
6 # show classification regions not probabilities
7 show=[ 'instances', 'boundaries', 'misclassified'],
NameError: name 'clfviz' is not defined
SEARCH STACK OVERFLOW
#### approach 2:
fig,axes = plt.subplots(1,1,dpi=300)
decision_boundaries(rf, X, y, ax=axes,
# show classification regions not probabilities
show=[ 'instances', 'boundaries', 'misclassified'],
feature_names=[ 'proline', 'flavanoid'])
Pause-and-Ponder: Below, regenerate the above analysis for different values of:
min_samples_leaf
max_depth
n_estimators
Investigate their effect on the decision boundary!
Double-click (or enter) to edit
BONUS B1 - PCA
For this bonus problem, run PCA on the full wine dataset we imported above! (meaning you don't have to split your data into trainig and test)
from sklearn.decomposition import PCA
Note: I have intentionally not given you a code example for this problem! Try reading the sklearn documentation and use what we currently know to see how to specify a PCA yourself!
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
import pandas as pd
wine = load_wine()
X = wine.data
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
pca_df = pd.DataFrame(data=X_pca, columns=[ 'PC1', 'PC2'])
print(pca_df)
|
PC1 |
PC2 |
0 |
318.562979 |
21.492131 |
1 |
303.097420 |
-5.364718 |
2 |
438.061133 |
-6.537309 |
3 |
733.240139 |
0.192729 |
4 |
-11.571428 |
18.489995 |
173 |
-6.980211 |
-4.541137 |
174 |
3.131605 |
2.335191 |
175 |
88.458074 |
18.776285 |
176 |
93.456242 |
18.670819 |
177 |
-186.943190 |
-0.213331 |
[178 rows x 2 columns]
Pause-and-Ponder: Comment below on the quality of the ¦t! How did PCA do on this dataset? Give a good answer here!
PC1 outperforms PC2
Pause-and-Ponder: Explain what exactly PCA is doing to our dataset. How is it different than linear regression? Comment below!
PCA is a method used to condense the information contained in a dataset with many variables into a smaller set of new variables, known as principal components. These components are ranked such that each subsequent component has the highest possible variance under the
constraint that it is orthogonal to the preceding components.
In contrast to PCA, Linear Regression is a predictive technique that requires both input and output variables. It attempts to predict the value of a dependent variable, based on one or more independent variables, by ¦tting a linear equation to observed data.
Key distinctions between the two methods include:
PCA operates without guidance from an output variable, aiming to simplify the data structure through variance. It is a technique for feature extraction and dimensionality reduction.
Linear Regression works under supervision, employing a target output to shape its predictions. It is a method for understanding the relationship between inputs and outputs within the dataset.
Bonus B2 - KMeans
For this problem run a K-Means on the result of your PCA dimensionality reduced data for the following:
2 components
5 components
10 components
And give the three plots!
Note: I have intentionally not given you a code example for this problem! Try reading the sklearn documentation and use what we currently know to see how to specify a KMeans yourself!
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
import pandas as pd
# 2 components
wine = load_wine()
X = wine.data
pca_2 = PCA(n_components=2)
X_pca_2 = pca_2.fit_transform(X)
kmeans_2 = KMeans(n_clusters=3, random_state=42)
clusters_2 = kmeans_2.fit_predict(X_pca_2)
plt.figure(figsize=(8, 6))
plt.scatter(X_pca_2[:, 0], X_pca_2[:, 1], c=clusters_2)
plt.title( 'K-Means with 2 PCA Components')
plt.xlabel( 'PC1')
plt.ylabel( 'PC2')
plt.show()
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning:
#5 components
pca_5 = PCA(n_components=5)
X_pca_5 = pca_5.fit_transform(X)
kmeans_5 = KMeans(n_clusters=3, random_state=42)
clusters_5 = kmeans_5.fit_predict(X_pca_5)
plt.figure(figsize=(8, 6))
plt.scatter(X_pca_5[:, 0], X_pca_5[:, 1], c=clusters_5)
plt.title( 'K-Means with 5 PCA Components')
plt.xlabel( 'PC1')
plt.ylabel( 'PC2')
plt.show()
#10 components
pca_10 = PCA(n_components=10)
X_pca_10 = pca_10.fit_transform(X)
kmeans_10 = KMeans(n_clusters=3, random_state=42)
clusters_10 = kmeans_10.fit_predict(X_pca_10)
plt.figure(figsize=(8, 6))
plt.scatter(X_pca_10[:, 0], X_pca_10[:, 1], c=clusters_10)
plt.title( 'K-Means with 10 PCA Components')
plt.xlabel( 'PC1')
plt.ylabel( 'PC2')
plt.show()
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning:
Pause-and-Ponder: How are KMeans KNN's different? How are they similar? Explain!
Differences:
1. Purpose and Application:
KMeans is an unsupervised learning algorithm used for clustering. It groups data into a speci¦ed number K of clusters based on feature similarity.
KNN is a supervised learning algorithm used for classi¦cation or regression. In classi¦cation, it predicts the class of a data point by looking at the K nearest labeled data points and taking a majority vote.
2. Learning Method:
KMeans learns by iteratively updating the centroids of clusters until convergence. It does not use labeled data; the algorithm organizes data into clusters based on feature similarity alone.
KNN does not have an explicit training phase. It makes predictions based on the labels of the nearest neighbors in the feature space. Each query involves analyzing the entire training set (or a signi¦cant portion of it) to ¦nd the K nearest neighbors.
Similarities:
1. Parameter K:
Both algorithms use a parameter K, but its meaning and purpose are different in each. In KMeans, K represents the number of clusters, while in KNN, K represents the number of nearest neighbors to consider for making predictions.
2. Reliance on Distance Metrics:
Both KMeans and KNN rely on distance metrics to measure similarity or proximity. In KMeans, this is used to assign points to the nearest cluster centroid. In KNN, it's used to ¦nd the nearest neighbors.
3. Feature Space Analysis:
Both algorithms operate in the feature space and perform some form of grouping based on feature similarity, although the way they use this information differs signi¦cantly.