Statistical Machine Learning For Data Science
Course Code BAD702 
CIE Marks 50
Teaching Hours/Week (L:T:P: S) 3:0:2:0 
SEE Marks 50
Total Hours of Pedagogy 40 hours Theory + 8-10 Lab slots 
Total Marks 100
Credits 04 
Exam Hours 3
Examination nature (SEE) Theory/practical
MODULE-1
Exploratory Data Analysis: estimates of locations and variability, exploring data distributions, exploring binary
and categorical data, exploring two or more variables.
Textbook : Chapter 1
MODULE-2
Data and Sampling Distributions: Random sampling and bias, selection bias, sampling distribution of statistic,
bootstrap, confidence intervals, data distributions: normal, long tailed, student’s-t, binomial, Chi-square, F
distribution, Poisson and related distributions.
Textbook : Chapter 2
MODULE-3
Statistical Experiments and Significance Testing: A/B testing, hypothesis testing, resampling, statistical
significance & p-values, t-tests, multiple testing, degrees of freedom.
Textbook : Chapter 3
MODULE-4
Multi-arm bandit algorithm, power and sample size, factor variables in regression, interpreting the regression
equation, Regression diagnostics, Polynomial and Spline Regression.
Textbook : Chapter 3 & 4
MODULE-5
Discriminant Analysis: Covariance Matrix, Fisher’s Linear discriminant, Generalized Linear Models, Interpreting
the coefficients and odd ratios, Strategies for Imbalanced Data.
Textbook : Chapter 5
PRACTICAL COMPONENT OF IPCC
Experiments
1 A dataset contains the prices of houses in a city. Find the 25th and 75th percentiles and calculate the
interquartile range (IQR). How does the IQR help in understanding the price variability?
2 You are given a dataset with categorical variables about customer satisfaction levels (Low, Medium, High)
and whether customers made repeat purchases (Yes/No). Create visualizations such as bar plots or
stacked bar charts to explore the relationship between satisfaction level and repeat purchases. What can
you infer from the data?
3 A dataset contains information about car models, including the engine size (in Liters), fuel efficiency (miles
per gallon), and car price. Use a pair plot or correlation matrix to explore the relationships between these
variables. Which variables seem to have the strongest relationships, and what might be the practical
significance of these findings?
4 You want to estimate the mean salary of software engineers in a country. You take 10 different random
samples, each containing 50 engineers, and calculate the sample mean for each. Plot the distribution of
these sample means. How does the Central Limit Theorem explain the shape of this sampling distribution,
even if the underlying salary distribution is skewed?
5 A researcher conducts an experiment with a sample of 20 participants to determine if a new drug affects
heart rate. The sample has a mean heart rate increase of 8 beats per minute and a standard deviation of 2
beats per minute. Perform a hypothesis test using the t-distribution to determine if the mean heart rate
increase is significantly different from zero at the 5% significance level.
6 A company is testing two versions of a webpage (A and B) to determine which version leads to more sales.
Version A was shown to 1,000 users and resulted in 120 sales. Version B was shown to 1,200 users and
resulted in 150 sales. Perform an A/B test to determine if there is a statistically significant difference in the
conversion rates between the two versions. Use a 5% significance level.
7 You are comparing the average daily sales between two stores. Store A has a mean daily sales value of
$1,000 with a standard deviation of $100 over 30 days, and Store B has a mean daily sales value of $950
with a standard deviation of $120 over 30 days. Conduct a two-sample t-test to determine if there is a
significant difference between the average sales of the two stores at the 5% significance level.
8 A company collects data on employees' salaries and records their education level as a categorical variable
with three levels: "High School", "Bachelor's", and "Master's". Fit a multiple linear regression model to
predict salary using education level (as a factor variable) and years of experience. Interpret the
coefficients for the education levels in the regression model.
9 You have data on housing prices and square footage and notice that the relationship between square
footage and price is nonlinear. Fit a spline regression model to allow the relationship between square
footage and price to change at 2,000 square feet. Explain how spline regression can capture different
behaviours of the relationship before and after 2,000 square feet.
10 A hospital is using a Poisson regression model (a type of GLM) to predict the number of emergency room
visits per week based on patient age and medical history. The model is given by:
Log(λ) =2.5-0.03*Age+0.5*condition
where λ is the expected number of visits per week, Age is the patient's age, and condition is a binary
variable (1 if the patient has a chronic condition, 0 otherwise).
Interpret the coefficients of Age and condition.
What is the expected number of visits per week for a 60-year-old patient with a chronic condition?
How would the expected number of visits change if the patient did not have a chronic condition?
11 A bakery claims that its new cookie recipe is lower in calories compared to the old recipe, which had a
mean calorie count of 200. You sample 40 new cookies and find a mean of 190 calories with a standard
deviation of 15 calories. Perform a one-tailed t-test to determine if the new recipe has significantly fewer
calories at a 5% significance level.
Suggested Learning Resources:
Books
1. Peter Bruce, Andrew Bruce and Peter Gadeck, “Practical Statistics for Data Scientists”, 2nd
edition, O’Reilly Publications, 2020.

.png)
0 Comments