ML101 QZ#1 2024

Date and Location

  • Date: 29 Oct (Tue) 09:00 - 10:50

  • Location: Class Room

Notice

  • QZ #1 covers the concept of ML and the classification models (not regression models).

  • Quiz will be administered through Google Forms.

  • Please bring your laptop for the quiz.

  • You are allowed to access any information through the Internet

  • However, communication with others is strictly prohibited.

  • Do not use any messaging apps (e.g., KakaoTalk, TikTok, Line, WeChat, etc.) during the quiz.


Submit your answers

Solve the problem and submit your answer by entering it in the Google form at the link below.

https://forms.gle/8AjmrYpnnQLo83RH9


Data

We are going to use a dataset named ‘penguins’ from the ‘palmerpenguins’ package. The dataset contains different body measurements for three species of penguins from three islands in the Palmer Archipelago, Antarctica. The penguins dataset is useful for learning R, because it contains multiple kinds of data (both categorical and numeric variables).


QZ

Part I. Data Import & Exploration

Let’s import “Palmerspenguins” data (Use the code below)

# Install palmerpenguins pacakge (if required)
if(!require(palmerpenguins)){
  install.packages("palmerpenguins")
}
Loading required package: palmerpenguins
# import libraries
library(palmerpenguins)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(penguins)

Let’s have a glimpse of the data

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

The penguins dataset consists of 344 observations and the following 8 variables:

  1. species: A factor with three levels - Adelie, Chinstrap, and Gentoo. These are the three penguin species under study.

  2. island: A factor with three levels - Biscoe, Dream, and Torgersen. These are the islands in the Palmer Archipelago where the penguins were observed.

  3. bill_length_mm: A numeric variable representing the length of the penguin’s culmen (bill) in millimeters.

  4. bill_depth_mm: A numeric variable representing the depth of the penguin’s culmen (bill) in millimeters.

  5. flipper_length_mm: A numeric variable representing the length of the penguin’s flipper in millimeters.

  6. body_mass_g: A numeric variable representing the penguin’s body mass in grams.

  7. sex: A factor with two levels - male and female.

  8. year: An integer variable representing the year of observation (2007, 2008, or 2009).

There are three species in palmerspenguins: Chinstrap / Gentoo / Adelie

The data table looks like ..

knitr::kable(penguins %>% head(10))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Torgersen 39.2 19.6 195 4675 male 2007
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Torgersen 42.0 20.2 190 4250 NA 2007


  1. How many observations in penguins data?
  • A) 324

  • B) 334

  • C) 344

  • D) 354


  1. How many variables in penguins data?
  • A) 8

  • B) 9

  • C) 10

  • D) 11


  1. When visualizing the distribution of flipper_length_mm among the three species in the penguins dataset, which of the following statements is most accurate?
penguins %>% 
  ggplot(aes(x = species, 
             y = flipper_length_mm,
             fill = species)) + 
  geom_boxplot()
  • A) Adelie penguins have the widest range of flipper_length_mm.

  • B) Gentoo penguins have a significantly higher mean flipper_length_mm compared to the other two species.

  • C) Chinstrap penguins have a non-symmetric distribution for flipper_length_mm.

  • D) Adelie penguins have the lowest values across all variables.


  1. Which species is most distinguishable using the flipper_length_mm ?
  • A) Adelie

  • B) Chinstrap

  • C) Gentoo

  • D) Non of them


  1. If you want to check if there is any missing values in the penguins dataset, which R code would you use?
  • A) any(is.na(penguins))

  • B) sum(is.na(penguins))

  • C) head(penguins)

  • D) str(penguins)


  1. Based on the output of summary(penguins), which of the following statements is correct?
  • A) There are no missing values in the dataset.

  • B) There are missing values in at least one variable.

  • C) The dataset contains only numeric variables.

  • D) The dataset only contains information about the species variable.


  1. After running table(penguins$year), what can you conclude about the data collection across years?
  • A) Data was collected only in one year.

  • B) Data was collected across multiple years.

  • C) Each year has an equal number of observations.

  • D) Data was collected only in the most recent year


  1. Based on table(penguins$species, penguins$island), which of the following is true?
  • A) All penguins on Torgersen Island are Gentoo.

  • B) Only Adelie species are found on Torgersen Island.

  • C) All species are distributed equally across the islands.

  • D) Only Chinstrap species are found on Torgersen Island.


  1. According to table(penguins$sex, penguins$island), which statement is correct regarding the distribution of penguin sexes across the islands?
  • A) There are no female penguins on Biscoe Island.

  • B) There are only male penguins on Torgersen Island.

  • C) Each island has both male and female penguins.

  • D) Sex is evenly distributed across the islands.


Part II. Data wrangling

#####
# missing values
library(VIM)
aggr(penguins)
# Filter out when missings in 'bill_length_mm'
penguins %>% 
  filter(!is.na(bill_length_mm)) -> penguins_new

# Check the missing values again
aggr(penguins_new)
  1. Using aggr(penguins), you observe the missing value distribution in the penguins dataset. Which variable has the highest number of missing values?
  • A) bill_length_mm

  • B) sex

  • C) body_mass_g

  • D) species


  1. After filtering out rows with missing values in the bill_length_mm variable, which code would give you the updated dataset without those missing values?
  • A) penguins %>% filter(is.na(bill_length_mm))

  • B) penguins %>% filter(!is.na(bill_length_mm))

  • C) penguins %>% filter(is.na(bill_depth_mm))

  • D) penguins %>% filter(!is.na(body_mass_g))


  1. When imputing missing values with the mice package, which method is specified for handling missing data in this code?
# Missing value handling with mice library and check again
library(mice)

penguins_new <- mice(penguins_new, method='rf', seed=1234)
penguins_imputed <- complete(penguins_new, 1)
  • A) Linear regression

  • B) k-Nearest Neighbors

  • C) Random Forest

  • D) Mean substitution


  1. After completing the imputation with mice, you check the missing values again with aggr(penguins_imputed). What would you expect to observe in this visualization?
  • A) All variables now have missing values.

  • B) Some variables still have missing values.

  • C) No variables have missing values anymore.

  • D) Only sex has missing values remaining


  1. After imputation, you compare the sex variable in the original and imputed datasets with table(penguins$sex) and table(penguins_imputed$sex). Which of the following might you expect?
  • A) The imputed dataset will be perfect

  • B) The distribution of sex remains the same as in the original dataset.

  • C) The imputed dataset has more balanced entries between “male” and “female.”

  • D) The original dataset has more balanced entries between “male” and “female.”


  1. Use the group_by and summarise functions, which of the following statements is true about the bill length (bill_len) based on gender?
penguins_imputed %>% 
  group_by(sex) %>% 
  summarise(bill_len=mean(bill_length_mm),
            bill_dep=mean(bill_depth_mm),
            mass=mean(body_mass_g),
            flipper_len=mean(flipper_length_mm)
            )
  • A) Female penguins have a longer average bill length than male penguins.

  • B) Male penguins have a longer average bill length than female penguins.

  • C) Both genders have the same average bill length.

  • D) There is insufficient data to determine the average bill length for each gender.


  1. Based on the result above, which morphological feature shows the smallest average difference between male and female penguins?
  • A) Bill Length

  • B) Bill Depth

  • C) Body Mass

  • D) Flipper Length


  1. Based on the result above, which feature would likely be the most reliable for identifying the gender of a penguin based on averages?
  • A) Bill Depth

  • B) Flipper Length

  • C) Body Mass

  • D) Bill Length


  1. Which island shows the smallest average value in bill_depth_mm (bill_dep) compared to the other islands?
penguins_imputed %>% 
  group_by(island) %>% 
  summarise(bill_len=mean(bill_length_mm),
            bill_dep=mean(bill_depth_mm),
            mass=mean(body_mass_g),
            flipper_len=mean(flipper_length_mm)
  )
  • A) Biscoe

  • B) Dream

  • C) Torgersen

  • D) The average bill_depth_mm is the same for all islands.


  1. Based on these measurements, which feature would likely be most useful to differentiate penguins from Torgersen Island from those on other islands?
  • A) Bill Length

  • B) Bill Depth

  • C) Body Mass

  • D) Flipper Length


Part III. Train the model (Modeling)

This is the code to train the dataset penguins_imputed with the decision tree model.

# 1. Decision Tree
penguins_imputed %>% select(-c("island", "sex", "year")) -> train

library(rpart)

dt_model <- rpart(species~., data = train, method = "class")

summary(dt_model)

This is to visualize the result:

library(rpart.plot)
rpart.plot(dt_model, type=4, extra=100, box.palette ="-YlGnBl", branch.lty = 2)
  1. In the first node of the Decision Tree, which variable is used for the primary split, and what is its splitting threshold?
  • A) bill_length_mm with a threshold of 42.35

  • B) flipper_length_mm with a threshold of 206.5

  • C) bill_depth_mm with a threshold of 16.45

  • D) body_mass_g with a threshold of 4525


  1. Which of the following statements best describes the purpose of using impurity indices, such as the Gini Index or Entropy, in Decision Trees?
  • A) They measure how pure a node is, helping to decide the best feature to split on.

  • B) They increase the complexity of the tree by adding more nodes.

  • C) They are used to compute the model accuracy after each split.

  • D) They are applied only at the root node of the tree.


  1. Suppose you have a node with the following class distribution: Adelie: 60%, Chinstrap: 20%, and Gentoo: 20%. What is the Gini Index for this node?
  • A) 0.4

  • B) 0.36

  • C) 0.48

  • D) 0.56


  1. In a Decision Tree, what is the purpose of the complexity parameter (CP)?
  • A) To control the maximum depth of the tree.

  • B) To limit the number of splits at each node.

  • C) To balance model accuracy and tree size by penalizing each split.

  • D) To adjust the tree’s prediction threshold.


  1. Using predict(), which species is most likely for penguin number 104?
  • A) Adelie

  • B) Chinstrap

  • C) Gentoo


Now, let’s train our data with Random Forest Algorithm.

library(randomForest)
rf_model <- randomForest(species~.,
                         data = train, 
                         mtry = 3,
                         ntree = 200)
rf_model

varImpPlot(rf_model)
  1. After running the randomForest function, you observe the output rf_model. Which of the following statements about the mtry and ntree parameters is correct?
  • A) mtry is the number of trees in the forest, and ntree is the number of features considered at each split.

  • B) mtry is the number of features considered at each split, and ntree is the total number of trees in the forest.

  • C) Both mtry and ntree refer to the number of features considered at each split.

  • D) mtry and ntree do not impact the structure of the Random Forest.


  1. After plotting the variable importance using varImpPlot(rf_model), which variable appears to be the most important in predicting penguin species?
  • A) flipper_length_mm

  • B) bill_length_mm

  • C) bill_depth_mm

  • D) body_mass_g


  1. With ntree = 200, what does this parameter value mean in the context of the Random Forest model?
  • A) The forest consists of 200 trees, each built independently.

  • B) The model will select the top 200 features for classification.

  • C) The model stops training once 200 nodes are created.

  • D) Each tree uses only 200 observations.


 28. Why does Random Forest randomly select a subset of features (mtry) at each split?

  • A) To reduce the likelihood of overfitting and ensure diverse trees in the forest.

  • B) To make the model train faster by using fewer features.

  • C) To ensure that every feature is used in every tree.

  • D) To prioritize the features with the highest variance.


  1. What is the purpose of Out-of-Bag (OOB) error estimation in Random Forest?
  • A) It provides a training error estimate.

  • B) It is used to validate the model on unseen data, reducing the need for a separate test set.

  • C) It computes the variance of the model.

  • D) It calculates the importance of each feature.


Now, let’s train our data with Naive Bayes Algorithm.

# 3. Naive Bayes

library(naivebayes)
nb_model <- naive_bayes(species ~ ., data=train)
summary(nb_model)
  1. Why is Naive Bayes considered a “naive” algorithm?
  • A) It assumes that each feature contributes equally to the final prediction.

  • B) It assumes that all features are independent of each other, given the class.

  • C) It assumes that the prior probabilities are always equal for each class.

  • D) It does not require any training data to make predictions.


  1. If the prior probability of Adelie species is 0.5, what does this imply about the Adelie class in the training data?
  • A) Adelie is the most common species.

  • B) Half of the observations in the training data belong to the Adelie species.

  • C) All observations belong to the Adelie species.

  • D) The Naive Bayes model is biased towards predicting Adelie.


Lastly, let’s train our data with kNN

# 4. kNN

# Normalization

nor <- function(x) { (x - min(x)) / (max(x) - min(x)) }

train %>% mutate(bill_length_mm=nor(bill_length_mm),
                 bill_depth_mm=nor(bill_depth_mm),
                 flipper_length_mm=nor(flipper_length_mm),
                 body_mass_g=nor(body_mass_g)) %>% 
  select(-species)-> train_nor

head(train_nor)

library(class)

kn_model <- knn(train_nor, 
                train_nor[101:120,], 
                cl=train$species, k=13)
tab <- table(kn_model,train[101:120,"species"])
tab
  1. Why is data normalization necessary before applying the k-Nearest Neighbors (kNN) algorithm in this code?
  • A) To reduce overfitting by increasing the values of the features

  • B) To ensure that all features have the same scale, as kNN relies on distance calculations

  • C) To reduce the number of dimensions in the dataset

  • D) To assign higher importance to categorical feature


  1. After running head(train_nor), you examine the first few rows of train_nor. What should you expect the values to be for each feature?
  • A) Between -1 and 1

  • B) Between 0 and 1

  • C) Greater than 1

  • D) They should remain unchanged


  1. In the line kn_model <- knn(train_nor, train_nor[101:120,], cl = train$species, k = 13), what does k = 13 represent?
  • A) The maximum distance allowed for neighbors

  • B) The number of features considered in each prediction

  • C) The number of nearest neighbors considered for classification

  • D) The number of folds in cross-validation


  1. How does increasing the k value generally affect the performance of a kNN model?
  • A) It increases the model’s sensitivity to noise.

  • B) It reduces the complexity and makes the decision boundary smoother.

  • C) It always improves the model’s accuracy.

  • D) It makes the model faster.


Part IV. Model score & Prediction

######################
# Score & Prediction 

# create a train dataset
test <- train[seq(1,300,3),]

# Prediction by using trained models
pred_dt <- predict(dt_model, test, type='class')
pred_rf <- predict(rf_model, test, type='class')
pred_nb <- predict(nb_model, test, type='class')
pred_kn <- knn(train_nor, train_nor[seq(1,300,3),], cl=train$species, k=13)


data.frame(truth=train[seq(1,300,3),"species"],
           dt=pred_dt, 
           rf=pred_rf, 
           nb=pred_nb, 
           kn=pred_kn) %>% 
  mutate(dt=ifelse(dt==truth, 1, 0),
         rf=ifelse(rf==truth, 1, 0),
         nb=ifelse(nb==truth, 1, 0),
         kn=ifelse(kn==truth, 1, 0)) -> score

apply(score[-1], 2, sum)
  1. Which of the following best describes the test dataset?
  • A) It includes all rows from train.

  • B) It includes every third row from the train dataset.

  • C) It contains only rows with missing values.

  • D) It randomly selects rows from train


  1. After applying the apply(score[-1], 2, sum) code, what does the output represent?
  • A) The number of test observations.

  • B) The sum of all predictions across models.

  • C) The number of correct predictions for each model.

  • D) The total accuracy of each model.


  1. Which of the following is a common strategy in ensemble learning for combining predictions from multiple models?
  • A) Choosing the model with the highest accuracy for each prediction

  • B) Using a weighted average of predictions based on each model’s performance

  • C) Ignoring models that disagree with the majority

  • D) Only using models that agree on the prediction


Suppose you have the following confusion matrix for a binary classification problem where the positive class is “Adelie”:

Predicted Adelie Predicted Not Adelie
Actual Adelie 40 10
Actual Not Adelie 15 35
  1. What is the precision for the “Adelie” class?
  • A) 0.80

  • B) 0.73

  • C) 0.60

  • D) 0.50


  1. What is the sensitivity (recall) for the “Adelie” class?
  • A) 0.80

  • B) 0.73

  • C) 0.60

  • D) 0.50


  1. Based on the precision and recall calculated above, what is the F1-score for the “Adelie” class?
  • A) 0.76

  • B) 0.79

  • C) 0.83

  • D) 0.85


  1. If a classifier has a low F1-score, which of the following is most likely true?
  • A) The classifier has high accuracy.

  • B) The classifier struggles to balance precision and recall, resulting in poor classification of the positive class.

  • C) The classifier has a strong balance between sensitivity and specificity.

  • D) The classifier has excellent performance on the negative class.


THE END