load("data/qz_a.RData")
DS101 QZ 2025
Date and Location
Date: 13 May (Tue) 09:30 - 11:30
Location: Class Room
Notice
- Quiz will be administered through Google Forms.
- Please bring your laptop for the quiz.
- You are allowed to access any information through the Internet
- However, communication with others is strictly prohibited.
- Do not use any messaging apps (e.g., KakaoTalk, TikTok, Line, WeChat, etc.) during the quiz.
Submit your answers
Solve the problem and submit your answer by entering it in the Google form at the link below.
https://forms.gle/mnneHybsBPkbNP1y7
QZ content
PART I. Basic Syntax in R
- List
- Vector
- Matrix
- Data.frame
- Data wrangling
PART II. Data Wrangling with tidyverse
PART III. Data Visualization
PART IV. Advanced
PART V. Learning Data Science with chatGPT
QZ
PART I. Basic Syntax in R
Please download the data below
Locate “qz_a.RData” at your working directory and load the file
Then you will see the object “KMP_list_qz_a” in the Environment window of RStudio.
OK, then let’s begin the qz.
List
- How many elements are in the list?
- 4
- 5
- 6
- 7
- What is the name of the second element of the list?
- vector
- matrix
- p20
- None
- What is the data type of the first element in the list?
- vector
- matrix
- data.frame (tibble)
- list
- Which of the following statements correctly creates the sixth element of the list as an empty list?
KMP_list_qz_a[[6]] <- list(0)
KMP_list_qz_a[6] <- list()
KMP_list_qz_a[[6]] <- empty_list(0)
KMP_list_qz_a[6] <- list(0)
- Which of the following correctly describes the creation of a vector {2,4,6,8,10} using the seq function and placing it as the first element of the sixth element of a list?
KMP_list_qz_a[[6]][[1]] <- seq(2,10,2
)KMP_list_qz_a[[6]][1] <- seq(2,10,2)
KMP_list_qz_a[[6]][1] <- c(2,4,6,8,10)
KMP_list_qz_a[6][[1]] <- seq(2,10,2)
- Which of the following options correctly interprets the meaning of the code
KMP_list_qz_a[[6]][[1]][4]
?
- Access the fourth element of the second element of the sixth element of the list KMP_list_qz_a.
- Assign the value 4 to the fourth element of the first element of the sixth element of the list KMP_list_qz_a.
- Retrieve the fourth element of the first sub-list within the sixth element of the list KMP_list_qz_a.
- Remove the fourth element from the first sub-list within the sixth element of the list KMP_list_qz_a.
Vector
Let a vector X be like below
<- KMP_list_qz_a[[3]] X
- How many missing values are in vector X?
- 2
- 3
- 4
- 5
- What is the position of value 41 in X? (Note: it’s not about the value of X[41])
- 33
- 43
- 53
- 63
- Which of the following expressions correctly represents the number of elements in vector X where the values are not missing and are less than 27?
length(X[X < 27 & !is.na(X)])
length(X[X < 27 | !is.na(X)])
length(X[X >= 27 & !is.na(X)])
length(X[X >= 27 | !is.na(X)])
Matrix
Let a matrix M be like below
<- KMP_list_qz_a[[4]] M
- Which of the following statements is true regarding the expression
M[3,3]
,t(M)[3,3]
, andM[9]
?
- M[3,3] is equivalent to t(M)[3,3] but different from M[9].
- M[3,3] is equivalent to M[9] but different from t(M)[3,3].
- M[3,3] is different from both t(M)[3,3] and M[9].
- M[3,3], t(M)[3,3], and M[9] are all equivalent.
- Which of the following correctly describes the output of the given code snippets?
# 1
apply(M, 2, mean)
[1] 2 5 8 11 14
# 2
apply(M, 1, mean)
[1] 7 8 9
- The first snippet calculates the average value of each column in matrix M, while the second snippet calculates the average value of each row.
- The first snippet calculates the average value of each row in matrix M, while the second snippet calculates the average value of each column.
- Both snippets calculate the average value of each element in matrix
- The snippets produce errors because the apply function does not support averaging operations
Data.frame
Let DF1 and DF2 as like below
<- KMP_list_qz_a[[1]]
DF1 <- KMP_list_qz_a[[2]] DF2
- How many variables in DF1?
- 8
- 10
- 12
- 14
- How many people (observations) in DF2?
- 40
- 50
- 60
- 70
- In DF1, how many people are in the category of 3000-4000K of the income?
- 5
- 6
- 7
- 8
- In DF1, how many “Male” & “Never married” people, and how many SKT (Telecom) users among them? (Use the variable: telecom)
- 5, 1
- 5, 3
- 3, 1
- 3, 3
- How many people who used an LG smartphone in 2019 (in DF1) switched to a Samsung smartphone in 2020 (in DF2)?
- 3
- 5
- 7
- 9
- Of those who used an LG smartphone in 2019, how many people still use LG in 2020?
- None of them
- 2
- 4
- 6
The code processes a text containing lyrics of “Let It Be” by The Beatles, removing newline characters and commas using the gsub function, then creates a word cloud visualization to display the frequency of words in the song.
<- c("When I find myself in times of trouble, Mother Mary comes to me
letitbe Speaking words of wisdom, let it be
And in my hour of darkness she is standing right in front of me
Speaking words of wisdom, let it be
Let it be, let it be, let it be, let it be
Whisper words of wisdom, let it be
And when the broken hearted people living in the world agree
There will be an answer, let it be
For though they may be parted, there is still a chance that they will see
There will be an answer, let it be
Let it be, let it be, let it be, let it be
There will be an answer, let it be
Let it be, let it be, let it be, let it be
Whisper words of wisdom, let it be
Let it be, let it be, let it be, let it be
Whisper words of wisdom, let it be, be
And when the night is cloudy there is still a light that shines on me
Shinin' until tomorrow, let it be
I wake up to the sound of music, Mother Mary comes to me
Speaking words of wisdom, let it be
And let it be, let it be, let it be, let it be
Whisper words of wisdom, let it be
And let it be, let it be, let it be, let it be
Whisper words of wisdom, let it be")
<- gsub("\n", "",letitbe)
letitbe_rm <- gsub(",", "",letitbe_rm)
letitbe_rm
<- strsplit(letitbe_rm, split = " ")
word_list <- unlist(word_list)
word_vec
<- sort(table(word_vec), decreasing = T)
word_count <- data.frame(word_count)
df_word_count
library(wordcloud2)
wordcloud2(df_word_count)
- What is the purpose of using the
gsub
function in the provided code snippet?
- It converts the text into a word vector for further analysis.
- It prepares the text data by removing unwanted characters like newline characters and commas.
- It calculates the frequency of each word in the text.
- It generates a word cloud visualization based on the frequency of words in the text.
PART II. Data Wrangling with tidyverse
Download the gapminder package for using gapminder dataset.
# install.packages("gapminder")
Import required libraries
library(gapminder)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The code works like below:
filters only 1962 data from gapminder data,
selects only country and gdpPercap variables,
renames gdpPercap variables to gdp_1962,
and assigns it to object ‘gap_1962’.
- Choose the one that fits the blank (a), (b), (c)
A) (a) filter (b) select (c) rename
B) (a) subset (b) keep (c) relabel
C) (a) arrange (b) pick (c) change_name
D) (a) extract (b) choose (c) modify
The code below did the same procedure with above but for the year 2007. (a), (b), (c) are the same with the answers above.
The code below finds the difference in gdpPercap from 1962 to 2007 and sorts them in descending order of growth (gdp_gap).
- Choose the one that fits the blank (a), (b), (c)
A) (a) inner_join (b) transform (c) setNames
B) (a) left_join (b) mutate (c) desc
C) (a) with (b) modify_if (c) -
D) (a) left_join (b) apply (c) recast
- In the result above, what is the rank of “Korea, Rep.” in terms of
gdp_gap
?
A) 9
B) 11
C) 13
D) 16
- Using the codes above, create a table for population growth from 1962 to 2007, and choose the country with the highest population growth during the period.
A) China
B) India
C) Indonesia
D) Brazil
PART III. Data Visualization
The following is the code to draw the graph below using the gdp_gap_1962_2007
table created above.
- Fill in the blanks.
A) (a) gdp_gap (b) country (c) identity
B) (a) country (b) year (c) “dodge”
C) (a) country (b) gdp_gap (c) “identity”
D) (a) gdp_gap (b) country (c) “identity”
Let’s use ‘mtcars’ dataset
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
- The value obtained by dividing horsepower (
hp
) by weight (wt
) is called marginal horsepower (hp_by_wt
). Choose the car with the largesthp_by_wt
among the cars in the mtcars data.
A) Lotus Europa
B) Ford Pantera
C) Camaro Z28
D) Maserati Bora
- Execute the code below to check the graph, and choose the relationship between the car’s weight (wt) and fuel efficiency (mpg).
%>%
mtcars ggplot(aes(x=wt, y=mpg)) +
geom_point()
A) positive relationship
B) negative relationship
C) no correlation
D) We cannot tell with this graph
- Execute the code below to check the correlation between variables, and choose a variable relationship that is not a positive relationship.
%>%
mtcars select(mpg, disp, hp, drat, wt) %>%
plot
A) disp ~ hp
B) wt ~ hp
C) drat ~ mpg
D) mpg ~ disp
PART IV. Advanced
The code below expresses the correlation between variables in another way.
- Which of the following is not suitable as an advantage compared to the previous graph?
# install.packages("corrplot")
library(corrplot)
corrplot 0.92 loaded
%>%
mtcars select(mpg, disp, hp, drat, wt) %>%
%>%
cor corrplot(method="circle")
A) Easy to find whether the relationship of two variables is positive or negative
B) It can be understood at a glance
C) More colorful
D) It can check each observation in the graph
The code below shows the distribution of petal length variables in iris data using geom_density. Run the code to check the graph and answer the questions.
%>%
iris ggplot(aes(x=Petal.Length, colour=Species, fill=Species)) +
geom_density(alpha=.3) +
geom_vline(aes(xintercept=mean(Petal.Length),
colour=Species),
linetype="dashed", color="grey", size=1)+
xlab("Petal Length (cm)") +
ylab("Density")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
- If by chance you found an iris and the petal length was 1.5 cm, what is the species of this iris?
A) Setosa
B) Versicolor
C) Virginica
The graph below is an interactive graph with color classification for each species in the scatter plot for the relationship between Petal Length and Petal Width.
- Hover your mouse cursor over a dot on the graph and Choose the answer which is not an advantage of this type of graph.
# install.packages("plotly")
library(plotly)
%>%
iris plot_ly(
x = ~Petal.Length,
y = ~Petal.Width,
color = ~Species,
type = "scatter",
mode = "markers") %>%
layout(scene = list(xaxis = list(title = 'Petal length'),
yaxis = list(title = 'Petal width')))
A) Information on the specific observation can be easily obtained
B) You can zoom in by specifying the range
C) You can expand the scale with the mouse
D) You can freely change the X-axis’s and Y-axis’s variables
- Modify the graph code on the left as the graph on the right.
# Graph in the left side
%>%
iris ggplot(aes(x=Species, y=Petal.Length)) +
geom_boxplot() +
geom_jitter()+
theme_bw()
- Modify the graph code on the left as the graph on the right.
# Graph in the left side
<- data.frame(Species="setosa",
add_text Petal.Length=6,
lab="Wonderful!")
%>%
iris ggplot(aes(x=Species, y=Petal.Length)) +
geom_boxplot() +
geom_text(data=add_text, aes(label=lab))+
theme_bw()
- Modify the graph code on the top as the graph on the bottom.
# Graph on the top
%>%
gapminder filter(country %in% c("Korea, Rep.", "Korea, Dem. Rep.")) %>%
ggplot(aes(x=gdpPercap/1000, y=lifeExp, col=country)) +
geom_point()
Leanring Data Science with chatGPT
Comparing the Emotional Language of Two Taylor Swift Songs
You are provided with the results of a text mining analysis conducted using R. The analysis compares the lyrics of two Taylor Swift songs:
The analysis includes:
The most frequently repeated words in each song
Sentiment word counts using the
bing
lexicon (positive vs. negative)
Additionally, you will listen to both songs to reflect on their actual emotional tones, and then critically evaluate the analytical approach.
Your Submission Should Include:
- A short written response (400–600 words) (20 points)
- Listen to both songs
- Pay close attention to the lyrics, mood, emotional tone, and structure of each song.
- Review the analysis provided.
Which song repeats certain keywords more frequently?
Which song contains more negative sentiment words?
- Critically evaluate the sentiment analysis: Why might Shake It Off, a seemingly upbeat and empowering song, contain more “negative” words?
- Propose improvements or alternative methods to better capture the emotional content of the lyrics.
Could ChatGPT or other AI models help?
Would contextual or sentence-level analysis yield better insight?
R Code for the Analysis
# Check if the 'taylor' package is installed; if not, install it
if (!requireNamespace("taylor", quietly = TRUE)) {
install.packages("taylor") # Install the 'taylor' package from CRAN
}
# Load required libraries:
# 'taylor' contains the dataset of Taylor Swift's lyrics and song metadata
# 'tidytext' provides tools for unnesting and analyzing text data
library(taylor)
Warning: package 'taylor' was built under R version 4.4.3
library(tidytext)
# Preview the structure of the dataset to understand its variables
%>% glimpse taylor_album_songs
Rows: 240
Columns: 29
$ album_name <chr> "Taylor Swift", "Taylor Swift", "Taylor Swift", "T…
$ ep <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ album_release <date> 2006-10-24, 2006-10-24, 2006-10-24, 2006-10-24, 2…
$ track_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ track_name <chr> "Tim McGraw", "Picture To Burn", "Teardrops On My …
$ artist <chr> "Taylor Swift", "Taylor Swift", "Taylor Swift", "T…
$ featuring <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ bonus_track <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ promotional_release <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ single_release <date> 2006-06-19, 2008-02-03, 2007-02-19, NA, NA, NA, N…
$ track_release <date> 2006-06-19, 2006-10-24, 2006-10-24, 2006-10-24, 2…
$ danceability <dbl> 0.580, 0.658, 0.621, 0.576, 0.418, 0.589, 0.479, 0…
$ energy <dbl> 0.491, 0.877, 0.417, 0.777, 0.482, 0.805, 0.578, 0…
$ key <int> 0, 7, 10, 9, 5, 5, 2, 8, 4, 2, 2, 8, 7, 4, 10, 5, …
$ loudness <dbl> -6.462, -2.098, -6.941, -2.881, -5.769, -4.055, -4…
$ mode <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
$ speechiness <dbl> 0.0251, 0.0323, 0.0231, 0.0324, 0.0266, 0.0293, 0.…
$ acousticness <dbl> 0.57500, 0.17300, 0.28800, 0.05100, 0.21700, 0.004…
$ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, …
$ liveness <dbl> 0.1210, 0.0962, 0.1190, 0.3200, 0.1230, 0.2400, 0.…
$ valence <dbl> 0.425, 0.821, 0.289, 0.428, 0.261, 0.591, 0.192, 0…
$ tempo <dbl> 76.009, 105.586, 99.953, 115.028, 175.558, 112.982…
$ time_signature <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ duration_ms <int> 232107, 173067, 203040, 199200, 239013, 207107, 24…
$ explicit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ key_name <chr> "C", "G", "A#", "A", "F", "F", "D", "G#", "E", "D"…
$ mode_name <chr> "major", "major", "major", "major", "major", "majo…
$ key_mode <chr> "C major", "G major", "A# major", "A major", "F ma…
$ lyrics <list> [<tbl_df[55 x 4]>], [<tbl_df[33 x 4]>], [<tbl_df[…
# Define the two songs to compare
# These are Taylor Swift's two well-known tracks, both in their re-recorded versions
<- c("Love Story (Taylor's Version)",
selected_songs "Shake It Off (Taylor's Version)")
# Process the lyrics of the selected songs
<- taylor_album_songs %>%
lyrics_words # Filter the dataset to include only the selected songs
filter(track_name %in% selected_songs) %>%
# Unnest the list-column 'lyrics' so that each row corresponds to a single line of lyrics
unnest(lyrics) %>%
# Tokenize the lyrics line-by-line into individual words
# Here, 'lyric' is the column containing the actual text of each lyric line
unnest_tokens(word, lyric) %>%
# Keep only alphabetic words (remove numbers, punctuation, etc.)
filter(str_detect(word, "[a-z]"))
# View the resulting tidy text dataset
# Each row represents a single word from the lyrics, along with the song it belongs to
lyrics_words
# A tibble: 944 × 32
album_name ep album_release track_number track_name artist featuring
<chr> <lgl> <date> <int> <chr> <chr> <chr>
1 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
2 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
3 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
4 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
5 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
6 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
7 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
8 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
9 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
10 Fearless (Taylo… FALSE 2021-04-09 3 Love Stor… Taylo… <NA>
# ℹ 934 more rows
# ℹ 25 more variables: bonus_track <lgl>, promotional_release <date>,
# single_release <date>, track_release <date>, danceability <dbl>,
# energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
# key_name <chr>, mode_name <chr>, key_mode <chr>, line <int>, …
%>%
lyrics_words # Count the frequency of each word for each song
count(track_name, word) %>%
# Group by song title so that we can identify the top 20 words per song
group_by(track_name) %>%
# Select the top 20 most frequent words for each song
slice_max(n, n = 20) %>%
# Begin plotting: x-axis = word (ordered by frequency), y-axis = count
ggplot(aes(x = reorder(word, n), y = n, fill = track_name)) +
geom_col(show.legend = FALSE) +
# Create separate plots for each song using facet_wrap
# 'scales = "free_y"' allows each subplot to use its own y-axis scale
facet_wrap(~ track_name, scales = "free_y") +
# Flip the axes so the words are on the y-axis for better readability
coord_flip() +
# Add title and axis labels
labs(title = "Most Frequently Used Words in Lyrics",
x = "Word",
y = "Frequency") +
# Use a minimal theme for clean visual appearance
theme_minimal()
# Load the 'bing' sentiment lexicon
# This dictionary labels individual words as either "positive" or "negative"
<- get_sentiments("bing")
bing
# Join the lyrics data with the sentiment lexicon
<- lyrics_words %>%
lyrics_sentiment # Perform an inner join to keep only words that are found in the 'bing' sentiment lexicon
inner_join(bing, by = "word") %>%
# Count the number of positive and negative words per song
count(track_name, sentiment)
# Visualize the number of sentiment words per song
%>%
lyrics_sentiment ggplot(aes(x = sentiment, y = n, fill = track_name)) +
# Create side-by-side bar charts (dodge) to compare songs by sentiment type
geom_col(position = "dodge") +
# Add a title and axis labels to the plot
labs(title = "Comparison of Sentiment Word Counts by Song",
x = "Sentiment Type",
y = "Number of Words") +
theme_minimal()