Load packages and import datasets.
library(ggridges)
library(tidyverse)
library(plotly)
library(gridExtra)
combine_df = read_csv("data/combine.csv")
combine_tc_binary = read_csv("data/combine.csv") |>
mutate(
total_cholesterol = case_when(
total_cholesterol < 200 ~ "desirable",
total_cholesterol >= 200 ~ "above desirable",
TRUE ~ as.character(total_cholesterol)
)
)
TC_distri = ggplot(combine_df, aes(x = total_cholesterol)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(
x = "Total Cholesterol",
y = "Frequency",
title = "Distribution of Total Cholesterol"
) +
theme_minimal()
ggplotly(TC_distri)
The presented data illustrates the distribution of
Total Cholesterol
, spans from a minimum of 71 mg/dL to a
maximum of 446 mg/dL. The interquartile range, which captures the
central 181 of the data, extends from 156 mg/dL (25th percentile) to 210
mg/dL (75th percentile), with the median falling at 181 mg/dL. It
reveals a notable right-skewed graph. This skewness is
indicative of a pronounced prevalence of participants whose cholesterol
levels are concentrated within the range of 130 to 200 mg/dL.
According to CDC, high
cholesterol is defined as total cholesterol level greater than 200mg/dL.
Therefore, we use this value to categorize into a binary variable,
Total Cholesterol Cat
with 2 levels: “desirable” (at most
200mg/dL) and “above desirable” (higher than 200mg/dL) in the following
analysis.
sleep_hour_distri = ggplot(combine_df, aes(x = sleep_hour)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Sleep Hours",
x = "Sleep Hours",
y = "Frequency") +
theme_minimal()
ggplotly(sleep_hour_distri)
The distribution of Sleep Hours
depicted above exhibits
characteristics consistent with a normal distribution,
showcasing a bell-shaped curve. This normal distribution suggests that
the majority of individuals in the sample population tend to obtain
approximately 7 to 8 hours of sleep per night in weekdays. The peak of
the curve is centered around this range, indicating a prevalent and
typical sleep duration among the surveyed participants.
TC_sleep =
plot_ly(combine_tc_binary, x = ~total_cholesterol, y = ~sleep_hour, color = ~total_cholesterol, type = "box", colors = "viridis") |>
layout(title = "Total Cholesterol & Sleep Hours",
xaxis = list(title = "Total Cholesterol"),
yaxis = list(title = "Sleep Hours"))
TC_sleep
The presented boxplot visually represents the
distribution of Sleep Hours
in conjunction with
Total Cholesterol Cat
, revealing a remarkable similarity in
distribution patterns between the group categorized as “desirable” and
those classified as “above desirable”. This observation suggests that
individuals within both categories exhibit comparable trends in sleep
duration.
combine_tc_binary =
combine_tc_binary |>
drop_na(physical_activity)
TC_activity = ggplot(combine_tc_binary, aes(x = physical_activity, fill = total_cholesterol)) +
geom_bar(position = "stack") +
scale_fill_brewer(palette = "lightgray") +
labs(
x = "Physical Activity",
y = "Frequency",
fill = "Total Cholesterol",
title = "Relationship between Physical Activity and Total Cholesterol"
) +
theme_minimal()
ggplotly(TC_activity)
The above bar chart illustrates the association
between Physical Activity
and
Total Cholesterol Cat
. It reveals that the majority of
individuals engage in vigorous physical activity. Surprisingly, the
prevalence of individuals with Total Cholesterol groups above the
desirable range remains consistent across three distinct levels of
Physical Activity.
combine_tc_binary =
combine_tc_binary %>%
drop_na(alcohol_use_cat) %>%
mutate(alcohol_use_cat = factor(alcohol_use_cat, levels=c("Light Drinker", "Moderate Drinker", "Heavy Drinker")))
Al_distri = ggplot(combine_tc_binary, aes(x = alcohol_use_cat)) +
geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
labs(
x = "Drinking Habits",
y = "Count",
title = "Distribution of Drinking Habits"
) +
theme_minimal()
ggplotly(Al_distri)
The people who drink less than 11 times last year are defined as ” light drinker”, the people who drink once a month to twice a week are defined as ” moderate drinkers”, the people who drink more than twice a week are defined as ” heavy drinkers”. The distribution of drinking habits is a right skewed one, with more than 3000 people are “light drinker”, around 2500 people are “moderate drinker”, and around 1000 people are heavy drinker.
combine_tc_binary %>%
group_by(alcohol_use_cat,total_cholesterol) %>%
summarise(N = n()) %>%
mutate (Proportion = N/sum(N),
) %>%
plot_ly (x=~alcohol_use_cat, y= ~ Proportion * 100, type = "bar", color = ~total_cholesterol) %>%
layout(barmode = "stack",
title = "Distribution of High Total Cholersterol by Drinking Habits",
xaxis = list(title = "Drinking Habits"),
yaxis = list(title = "Percentage (%)")
)
According to the graph, the percentage of high cholesterol increase as people drink more. 30% of light drinkers have high cholesterol while 40% of heavy drinkers have high cholesterol level.
According to CDC, the cutoff of cotinine for smoker and nonsmoker is 10ng/ml. The serum cotinie level in smoker distribute like this :
combine_tc_binary =
combine_tc_binary %>%
mutate(smoking_status_bi = ifelse(cotinine > 10, "Smoker", "Nonsmoker"),
smoking_status_bi = factor(smoking_status_bi, levels = c("Nonsmoker","Smoker")))
combine_tc_binary_Smoker =
combine_tc_binary %>%
filter (smoking_status_bi == "Smoker") %>%
ggplot (aes(x= cotinine)) +
geom_density() +
labs (title = "Serum Cotinine Value in Smoker",
x="Cotinine")
combine_tc_binary_Nonsmoker =
combine_tc_binary %>%
filter (smoking_status_bi == "Nonsmoker") %>%
ggplot (aes(x= cotinine)) +
geom_density() +
labs (title = "Serum Cotinine Value in Nonsmoker",
x=" Cotinine")
grid.arrange(combine_tc_binary_Smoker, combine_tc_binary_Nonsmoker, ncol = 2)
The cotinine level is a left-skew line in smokers, and it is centered around 0.00 ng/ml in non-smokers.
combine_tc_binary %>%
group_by(smoking_status_bi,total_cholesterol) %>%
summarise(N = n()) %>%
mutate (Proportion = N/sum(N),
) %>%
plot_ly (x=~smoking_status_bi, y= ~ Proportion * 100, type = "bar", color = ~total_cholesterol) %>%
layout(barmode = "stack",
title = "Distribution of High Total Cholersterol by Smoker and Nonsmoker",
xaxis = list(title = "Smoking Status"),
yaxis = list(title = "Percentage (%)")
)
According to the graph, the proportions of total cholesterol level in smoker and non-smoker are almost the same.