Dr. Nancy l. mati Department of mathematics, college of science tarlac state university January 9, 2024

Published on Jan 03, 2024

Scene 1 (0s)

[Audio] Good afternoon, dear administration, faculty and students of Don Mariano Marcos Memorial State University. Thank you for inviting me to give you a lecture on the topic in Logistic Regression..

Scene 2 (14s)

[Audio] In this lecture, I will be discussing what a logistic regression is, the three types of logistic regression, how these are calculated, and most importantly, how the results of the analysis can be interpreted….

Scene 3 (27s)

[Audio] When we want to understand the relationship between one or more predictor variables and a continuous response variable, that is, the dependent variable, we often use linear regression. However, when the response variable is categorical, we can instead use logistic regression. So, what is exactly a logistic regression? Logistic regression is a type of classification algorithm because it attempts to "classify" observations from a dataset into distinct categories. Logistic regression is a method that we use to fit a regression model when the response variable is binary, or categorical. Logistic regression sounds very similar to linear regression. It is essentially a transformation of the linear regression into an equation such that it has limiting values of 0 and 1 (binary). When to use logistic regression: Suppose we want to classify patients into two groups on the basis of blood test results, body mass index, and blood pressure. The classes are diabetic and non-diabetic. This is a typical classification problem. We have a binary dependent variable such that 1 means diabetic and 0 means non-diabetic..

Scene 4 (1m 47s)

[Audio] Here are a few examples of when we might use logistic regression: Notice that the response variable in each of these examples can only take on one of two values. Contrast this with linear regression in which the response variable takes on some continuous value, where there would be more possible outcomes in the dependent variable...

Scene 5 (2m 16s)

[Audio] Logistic regression uses a method known as maximum likelihood estimation to find an equation of the following form: The first formula which is a transformation of log of odds is also known as the Logit function, and it is the basis of the Logistic Regression. The symmetry attained via this transformation improves the interpretability of log odds: with a negative value indicating the odds of failure and a positive value showing higher chances of success. The second equation is the functional form of the Logistic regression equation and it is called the Sigmoid. This is the standard logistic regression equation that transforms a linear regression to give the probability of getting a positive in terms of various dependent variables. Thus, when we fit a logistic regression model, we can use the following equation to calculate the probability that a given observation takes on a value of 1:.

Scene 6 (3m 14s)

[Audio] In a binary logistic regression, the response variable can only belong to one of two categories, it means that the dependent variable is a dichotomous variable. Dichotomous variables are variables with only two values. For example: whether a customer buys or does not buy a particular product, whether a patient is diseased or not, whether a student passes or fails an examination, whether a candidate wins or loses an election, whether a basketball player in a team is selected to play or not, etc. Multinomial logistic regression is used when you have a categorical dependent variable with more than two unordered levels with discrete outcomes. Examples of which are: Predicting which major a college student will choose based on their grades and preferences. Predicting which blood type a person has based on diagnostic tests. Predicting which ice cream flavor a person will choose based on their preferences and characteristics. Predicting which candidate a person will vote for based on demographic factors. Predicting which position an employee should be awarded based on his/her performance, length of service, and educational background, and others. Ordinal logistic regression models are a type of logistic regression in which the response variable can belong to one of three or more categories and there is a natural ordering among the categories. An example of which is an academic registrar wants to use the predictor variables (1) General weighted average, (2) Admission test score, and (3) score on the interview to predict the probability that a student will get into a university that can be categorized into "poor", "fair", "very good", or "excellent." Since there are more than two possible outcomes (there are four classifications of school quality) for the response variable and there is a natural ordering among the outcomes, the academic registrar would use an ordinal logistic regression model..

Scene 7 (5m 15s)

[Audio] Let us now take a look at what binary logistic regression does…. In a binary logistic regression, the dependent variable should be dichotomous (as discussed earlier, dichotomous variables have only two categories, usually coded as 0 and 1 . The independent variables, or predictors, can be interval level or categorical. If categorical, they should be dummy or indicator coded. There must be two or more independent variables. The independent variables can be continuous (interval/ratio) or categorical (ordinal/nominal)..

Scene 8 (5m 50s)

Assumptions of a binary logistic regression. Assumption #1: The Response Variable is Binary Assumption #2: The Observations are Independent Assumption #3: There is No Multicollinearity Among Explanatory Variables Assumption #4: There are No Extreme Outliers Assumption #5: There is a Linear Relationship Between Continuous Explanatory Variables and the Logit of the Response Variable Assumption #6: The Sample Size is Sufficiently Large Reference: The 6 Assumptions of Logistic Regression (With Examples) (statology.org).

Scene 9 (6m 13s)

[Audio] In a linear regression plot, the x-axis represents the independent variable and the y-axis represents the dependent variable. The goal of linear regression is to find the best-fitting straight line through the data points. This line is used to predict the value of a dependent variable based on the value of an independent variable. In a logistic regression plot, the x-axis represents the independent variable and the y-axis usually represents the probability of a particular outcome. The curve shows the probability that the dependent variable equals a "success" or "failure", based on the value of the independent variable. The key difference between these two is that linear regression is used when the dependent variable is continuous and numeric, and logistic regression is used when the dependent variable is binary. In the plot of a linear regression, a straight line is fitted to the data points, whereas in a logistic regression, a logistic (S-shaped) curve is fitted to the data points..

Scene 10 (7m 17s)

[Audio] This tutorial explains how to perform logistic regression in Excel. In our research question, the dependent variable which is draft has only two categories: 0 for not drafted and 1 for drafted. Drafted is the event when the person is included in the selection of players in a team. There are three predictor or independent variables, these are: points, rebounds, and assists. The dependent variable is nominal while the independent variables are all continuous. Also, we will assume that all assumptions have been satisfied. Use the following steps to perform logistic regression in Excel for a dataset that shows whether or not college basketball players got drafted into the NBA (draft: 0 = no, 1 = yes) based on their average points, rebounds, and assists in the previous season..

Scene 11 (8m 17s)

[Audio] Since we have three explanatory variables in the model (points, rebounds, and assists), we will create cells for three regression coefficients plus one for the intercept in the model. We will set the values for each of these to 0.001, but we will optimize for them later. Use the following steps to perform logistic regression in Excel for a dataset that shows whether or not college basketball players got drafted into the NBA (draft: 0 = no, 1 = yes) based on their average points, rebounds, and assists in the previous season..

Scene 12 (8m 56s)

[Audio] Next, we will have to create a few new columns that we will use to optimize for these regression coefficients including the logit, elogit, probability, and log likelihood. We will create the logit column by using the following formula enclosed in a box on the screen..

Scene 13 (9m 14s)

[Audio] Next, we will create values for elogit by using the indicated formula in cell G2..

Scene 14 (9m 25s)

[Audio] Next, we will create values for probability by using the formula in cell H2..

Scene 15 (9m 34s)

[Audio] Next, we will create values for log likelihood. The log likelihood is the natural logarithm of the probability of an observation..

Scene 16 (9m 45s)

[Audio] Lastly, we will find the sum of the log likelihoods, which is the number we will attempt to maximize to solve for the regression coefficients..

Scene 17 (9m 55s)

[Audio] If you have not already installed the Solver in Excel, use the following steps to do so: Click File. Click Options. Click Add-Ins. Click Solver Add-In, then click Go. In the new window that pops up, check the box next to Solver Add-In, then click Go. Once the Solver is installed, go to the Analysis group on the Data tab and click Solver. Enter the following information: Set Objective: Choose cell H14 that contains the sum of the log likelihoods. By Changing Variable Cells: Choose the cell range B15:B18 that contains the regression coefficients. Make Unconstrained Variables Non-Negative: Uncheck this box. Select a Solving Method: Choose GRG Nonlinear. Then click Solve. The Solver automatically calculates the regression coefficient estimates..

Scene 18 (11m 2s)

[Audio] By default, the regression coefficients can be used to find the probability that draft = 0. However, typically in logistic regression we're interested in the probability that the response variable = 1. So, we can simply reverse the signs on each of the regression coefficients. Now these regression coefficients can be used to find the probability of success, which is the event of the player being drafted, is equal to 1..

Scene 19 (11m 32s)

How to perform a binary logistic regression in excel.

Scene 20 (12m 0s)

[Audio] Let us now perform the binary logistic regression analysis using SPSS. We will use the same research question given previously but with different variables and data..

Scene 21 (12m 17s)

[Audio] Let us now perform the binary logistic regression analysis using SPSS. We will use the same research question given previously but with different variables and data..

Scene 22 (12m 30s)

[Audio] Click the Analyze tab, then Regression, then Binary Logistic Regression:.

Scene 23 (12m 41s)

[Audio] In the new window that pops up, drag the binary response variable draft into the box labelled Dependent. Then drag the two predictor variables points and division into the box labelled Block 1 of 1. Leave the Method set to Enter. Then click OK..

Scene 24 (13m 2s)

[Audio] Once you click OK, the output of the logistic regression will appear: Here is how to interpret the output: Model Summary: The most useful metric in this table is the Nagelkerke R Square, which tells us the percentage of the variation in the response variable [drafted (1) or not drafted (0)] that can be explained by the predictor variables (points and division). In this case, points and division are able to explain 72.5% of the variability in draft. Classification Table: The most useful metric in this table is the Overall Percentage, which tells us the percentage of observations that the model was able to classify correctly. In this case, the logistic regression model was able to correctly predict the draft result of 85.7% of players. Variables in the Equation: This last table provides us with several useful metrics, including: Wald: The Wald test statistic for each predictor variable, which is used to determine whether or not each predictor variable is statistically significant. Sig: The p-value that corresponds to the Wald test statistic for each predictor variable. We see that the p-value for points is .039 and the p-value for division is .557. Exp(B): The odds ratio for each predictor variable. This tells us the change in the odds of a player getting drafted associated with a one unit increase in a given predictor variable. For example, the odds of a player in division 2 getting drafted are just .339 of the odds of a player in division 1 getting drafted. Similarly, each additional unit increase in points per game is associated with an increase of 1.319 in the odds of a player getting drafted..

Scene 25 (14m 59s)

How to perform a binary logistic regression in SPSS.

Scene 26 (15m 59s)

[Audio] Further, we can now use the coefficients to predict the probability that a given player will get drafted, using the following formula: For example, the probability that a player who averages 20 points per game and plays in division 1 gets drafted can be calculated as: Since this probability is greater than 0.5, we would predict that this player would get drafted..

Scene 27 (16m 24s)

[Audio] Lastly, we want to report the results of our logistic regression. Here is an example of how to do so: These results were all based on the table: Variables in the Equation..

Scene 28 (16m 55s)

[Audio] To assess how well a logistic regression model fits a dataset, we can look at the following two metrics: One way to visualize these two metrics is by creating a ROC curve, which stands for "receiver operating characteristic" curve. This is a plot that displays the sensitivity along the y-axis and (1 – specificity) along the x-axis..

Scene 29 (17m 26s)

[Audio] One way to quantify how well the logistic regression model does at classifying data is to calculate AUC, which stands for "area under curve." What is a good AUC score? The answer: There is no specific threshold for what is considered a good AUC score. Obviously the higher the AUC score, the better the model is able to classify observations into classes. However, there is no magic number that determines if an AUC score is good or bad. If we must label certain scores as good or bad, we can reference the following rule of thumb from Hosmer and Lemeshow in Applied Logistic Regression (p. 177): A "Good" AUC Score Varies by Industry It is important to keep in mind that what is considered a "good" AUC score varies by industry. For example, in medical settings researchers often seeking AUC scores above 0.95 because the cost of being wrong is so high. If we have a logistic regression model that predicts whether or not a patient will develop cancer, the price of being wrong (incorrectly telling a patient they do not have cancer when they do) is so high that we want a model that is correctly nearly every time. Conversely, in other industries like marketing a lower AUC score may be acceptable for a model. For example, if we have a model that predicts whether or not a customer will be a repeat customer or not, the price of being wrong is not life-altering so a model with an AUC as low as 0.6 could still be useful..

Scene 30 (19m 9s)

[Audio] Multinomial logistic regression (often just called "multinomial regression") is used to predict a nominal dependent variable given one or more independent variables. It is sometimes considered an extension of binomial logistic regression to allow for a dependent variable with more than two categories. As with other types of regression, multinomial logistic regression can have nominal and/or continuous independent variables and can have interactions between independent variables to predict the dependent variable. For example, you could use multinomial logistic regression to understand which type of drink consumers prefer based on a specific location and age (i.e., the dependent variable would be "type of drink", with four categories – Coffee, Soft Drink, Tea and Water – and your independent variables would be the nominal variable, "location", assessed using three categories – Luzon, Visayas, and Mindanao – and the continuous variable, "age", measured in years). Alternately, you could use multinomial logistic regression to understand whether factors such as employment duration within the firm, total employment duration, qualifications and gender affect a person's job position (i.e., the dependent variable would be "job position", with three categories – junior management, middle management and senior management – and the independent variables would be the continuous variables, "employment duration within the firm" and "total employment duration", both measured in years, the nominal variables, "qualifications", with four categories – no degree, undergraduate degree, master's degree and PhD, "gender", which has two categories: "males" and "females")..

Scene 31 (20m 51s)

[Audio] The first equation calculates the probability of a dependent variable Y being equal to j which is a particular category. The second equation calculates the probability of Y being equal to 0 which is the base or reference category. Within the framework of the multinomial model, a control category must be selected. Ideally, we will choose what corresponds to the "basic" or "classic" or "normal" situation. The estimated coefficients will be interpreted according to this control category..

Scene 32 (21m 27s)

[Audio] Assumption 1: Your dependent variable should be measured at the nominal level. Examples of nominal variables include ethnicity (e.g., with three categories: Caucasian, African American and Hispanic), transport type (e.g., with four categories: bus, car, jeep and train), profession (e.g., with five groups: surgeon, doctor, nurse, dentist, therapist), and so forth. Multinomial logistic regression can also be used for ordinal variables, but you might consider running an ordinal logistic regression instead. Assumption 2: You have one or more independent variables that are continuous, ordinal, or nominal (including dichotomous variables). However, ordinal independent variables must be treated as being either continuous or categorical. They cannot be treated as ordinal variables when running a multinomial logistic regression. Examples of continuous variables include age (measured in years), revision time (measured in hours), income (measured in pesos), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. Examples of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), among other ways of ranking categories (e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much", to "It is OK", to "Yes, a lot"). Examples of nominal variables were provided in the previous bullet. Assumption #3: You should have independence of observations and the dependent variable should have mutually exclusive and exhaustive categories. Assumption #4: There should be no multicollinearity. Multicollinearity occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which variable contributes to the explanation of the dependent variable and technical issues in calculating a multinomial logistic regression. Determining whether there is multicollinearity is an important step in multinomial logistic regression. Assumption #5: There needs to be a linear relationship between any continuous independent variables and the logit transformation of the dependent variable. Assumption #6: There should be no outliers, high leverage values or highly influential points..

Scene 33 (24m 13s)

[Audio] In the graph, the four categories of the dependent nominal variable are distinct: these are highlighted in blue, red, yellow, and green, in ascending order..

Scene 34 (24m 39s)

[Audio] We need to check the assumptions to ensure that it is a valid model. The assumptions of the Ordinal Logistic Regression are as follow and should be tested in order. The dependent variable are ordered. One or more of the independent variables are either continuous, categorical or ordinal. No multi-collinearity. To check if multi-collinearity exists, a Variance Inflation Factor (VIF) test should be performed. The general rule of thumb for VIF test is that if the VIF value is greater than 10, then there is multi-collinearity. Proportional odds This assumption basically means that the relationship between each pair of outcome groups has to be the same. If the relationship between all pairs of groups is the same, then there is only one set of coefficient, which means that there is only one model. If this assumption is violated, different models are needed to describe the relationship between each pair of outcome groups. The test of parallel lines (also known as the test of proportional odds) is a test of the assumption that the relationship between each pair of outcome groups is statistically the same. A p-value which is greater than 0.05 lets us fail to reject the null hypothesis, suggesting that the odds are proportional and the assumption of parallel lines holds. This indicates a good fit for the ordinal logistic regression model..

Scene 35 (26m 14s)

Results of a logistic regression analysis. 35. Dr. Nancy L. Mati Tarlac State University, College of Science, Department of Mathematics January 9, 2024.

Scene 36 (26m 59s)

How to report and interpret Results of a logistic regression analysis.

Scene 37 (27m 13s)

references. 37. Dr. Nancy L. Mati Tarlac State University, College of Science, Department of Mathematics January 9, 2024.

Scene 38 (27m 21s)

Thank you….