[Audio] Dr. Nancy l. mati Department of mathematics, college of science tarlac state university January 9, 2024 Good afternoon, dear administration, faculty and students of Don Mariano Marcos Memorial State University. Thank you for inviting me to give you a lecture on the topic in Logistic Regression..
[Audio] Introduction to Logistic Regression Binary Logistic Regression Multinomial Logistic Regression Ordinal Logistic Regression References Contents In this lecture, I will be discussing what a logistic regression is, the three types of logistic regression, how these are calculated, and most importantly, how the results of the analysis can be interpreted….
[Audio] Introduction to logistic regression Logistic regression is a type of classification algorithm because it attempts to “classify” observations from a dataset into distinct categories. Logistic regression is a method that we use to fit a regression model when the response variable is binary, or categorical. Logistic regression sounds very similar to linear regression. It is essentially a transformation of the linear regression into an equation such that it has limiting values of 0 and 1 (binary). When to use logistic regression: Suppose we want to classify patients into two groups. Diabetic (>0.50) = 1 Blood Test Results Body Mass Index Blood Pressure Estimated Probability Classification Threshold 0.50 Non-Diabetic (<0.50) = 0 Dr. Nancy L. Mati Tarlac State University, College of Science, Department of Mathematics January 9, 2024 When we want to understand the relationship between one or more predictor variables and a continuous response variable, that is, the dependent variable, we often use linear regression. However, when the response variable is categorical, we can instead use logistic regression. So, what is exactly a logistic regression? Logistic regression is a type of classification algorithm because it attempts to "classify" observations from a dataset into distinct categories. Logistic regression is a method that we use to fit a regression model when the response variable is binary, or categorical. Logistic regression sounds very similar to linear regression. It is essentially a transformation of the linear regression into an equation such that it has limiting values of 0 and 1 (binary). When to use logistic regression: Suppose we want to classify patients into two groups on the basis of blood test results, body mass index, and blood pressure. The classes are diabetic and non-diabetic. This is a typical classification problem. We have a binary dependent variable such that 1 means diabetic and 0 means non-diabetic..
[Audio] Here are a few examples of when we might use logistic regression: • We want to use credit score and bank balance to predict whether or not a given customer will default on a loan. (Response variable = “Default” or “No default”) • We want to use average rebounds per game and average points per game to predict whether or not a given basketball player will get drafted into the NBA (Response variable = “Drafted” or “Not Drafted”) • We want to use square footage and number of bathrooms to predict whether or not a house in a certain city will be listed at a selling price of P5M or more. (Response variable = “Yes” or “No”) Here are a few examples of when we might use logistic regression: Notice that the response variable in each of these examples can only take on one of two values. Contrast this with linear regression in which the response variable takes on some continuous value, where there would be more possible outcomes in the dependent variable...
[Audio] The logistic regression formula Logit function The probability that a given observation takes on a value of 1 is given as: Sigmoid Logistic regression uses a method known as maximum likelihood estimation to find an equation of the following form: The first formula which is a transformation of log of odds is also known as the Logit function, and it is the basis of the Logistic Regression. The symmetry attained via this transformation improves the interpretability of log odds: with a negative value indicating the odds of failure and a positive value showing higher chances of success. The second equation is the functional form of the Logistic regression equation and it is called the Sigmoid. This is the standard logistic regression equation that transforms a linear regression to give the probability of getting a positive in terms of various dependent variables. Thus, when we fit a logistic regression model, we can use the following equation to calculate the probability that a given observation takes on a value of 1:.
[Audio] 3 Types of logistic regression Binary logistic regression. The response variable can only belong to one of two categories. Multinomial logistic regression. The response variable can belong to one of three or more categories and there is no natural ordering among the categories. Ordinal logistic regression. The response variable can belong to one of three or more categories and there is a natural ordering among the categories. In a binary logistic regression, the response variable can only belong to one of two categories, it means that the dependent variable is a dichotomous variable. Dichotomous variables are variables with only two values. For example: whether a customer buys or does not buy a particular product, whether a patient is diseased or not, whether a student passes or fails an examination, whether a candidate wins or loses an election, whether a basketball player in a team is selected to play or not, etc. Multinomial logistic regression is used when you have a categorical dependent variable with more than two unordered levels with discrete outcomes. Examples of which are: Predicting which major a college student will choose based on their grades and preferences. Predicting which blood type a person has based on diagnostic tests. Predicting which ice cream flavor a person will choose based on their preferences and characteristics. Predicting which candidate a person will vote for based on demographic factors. Predicting which position an employee should be awarded based on his/her performance, length of service, and educational background, and others. Ordinal logistic regression models are a type of logistic regression in which the response variable can belong to one of three or more categories and there is a natural ordering among the categories. An example of which is an academic registrar wants to use the predictor variables (1) General weighted average, (2) Admission test score, and (3) score on the interview to predict the probability that a student will get into a university that can be categorized into "poor", "fair", "very good", or "excellent." Since there are more than two possible outcomes (there are four classifications of school quality) for the response variable and there is a natural ordering among the outcomes, the academic registrar would use an ordinal logistic regression model..
[Audio] binary logistic regression The dependent variable, or outcome should be dichotomous. The independent variables, or predictors can be interval level or categorical. Let us now take a look at what binary logistic regression does…. In a binary logistic regression, the dependent variable should be dichotomous (as discussed earlier, dichotomous variables have only two categories, usually coded as 0 and 1 . The independent variables, or predictors, can be interval level or categorical. If categorical, they should be dummy or indicator coded. There must be two or more independent variables. The independent variables can be continuous (interval/ratio) or categorical (ordinal/nominal)..
[Audio] Assumptions of a binary logistic regression Assumption #1: The Response Variable is Binary Assumption #2: The Observations are Independent Assumption #3: There is No Multicollinearity Among Explanatory Variables Assumption #4: There are No Extreme Outliers Assumption #5: There is a Linear Relationship Between Continuous Explanatory Variables and the Logit of the Response Variable Assumption #6: The Sample Size is Sufficiently Large Reference: The 6 Assumptions of Logistic Regression (With Examples) (statology.org).
[Audio] Assumptions of a logistic regression vs. linear regression In contrast to linear regression, logistic regression does not require: • A linear relationship between the explanatory variable(s) and the response variable. • The residuals of the model to be normally distributed. • The residuals to have constant variance, also known as homoscedasticity. In a linear regression plot, the x-axis represents the independent variable and the y-axis represents the dependent variable. The goal of linear regression is to find the best-fitting straight line through the data points. This line is used to predict the value of a dependent variable based on the value of an independent variable. In a logistic regression plot, the x-axis represents the independent variable and the y-axis usually represents the probability of a particular outcome. The curve shows the probability that the dependent variable equals a "success" or "failure", based on the value of the independent variable. The key difference between these two is that linear regression is used when the dependent variable is continuous and numeric, and logistic regression is used when the dependent variable is binary. In the plot of a linear regression, a straight line is fitted to the data points, whereas in a logistic regression, a logistic (S-shaped) curve is fitted to the data points..
[Audio] How to perform a binary logistic regression in excel • Research Question: Will a college basketball player get drafted or not based on his average points, rebounds, and assists in the previous season? Step 1: Input the data. This tutorial explains how to perform logistic regression in Excel. In our research question, the dependent variable which is draft has only two categories: 0 for not drafted and 1 for drafted. Drafted is the event when the person is included in the selection of players in a team. There are three predictor or independent variables, these are: points, rebounds, and assists. The dependent variable is nominal while the independent variables are all continuous. Also, we will assume that all assumptions have been satisfied. Use the following steps to perform logistic regression in Excel for a dataset that shows whether or not college basketball players got drafted into the NBA (draft: 0 = no, 1 = yes) based on their average points, rebounds, and assists in the previous season..
[Audio] Step 2: Enter cells for regression coefficients. Since we have three explanatory variables in the model (points, rebounds, and assists), we will create cells for three regression coefficients plus one for the intercept in the model. We will set the values for each of these to 0.001, but we will optimize for them later. Use the following steps to perform logistic regression in Excel for a dataset that shows whether or not college basketball players got drafted into the NBA (draft: 0 = no, 1 = yes) based on their average points, rebounds, and assists in the previous season..
[Audio] Step 3: Create values for the logit. Next, we will have to create a few new columns that we will use to optimize for these regression coefficients including the logit, elogit, probability, and log likelihood. We will create the logit column by using the following formula enclosed in a box on the screen..
[Audio] Step 4: Create values for elogit Next, we will create values for elogit by using the indicated formula in cell G2..
[Audio] Step 5: Create values for probability. Next, we will create values for probability by using the formula in cell H2..
[Audio] Step 6: Create values for log likelihood. Next, we will create values for log likelihood. The log likelihood is the natural logarithm of the probability of an observation..
[Audio] Step 7: Find the sum of the log likelihoods. Lastly, we will find the sum of the log likelihoods, which is the number we will attempt to maximize to solve for the regression coefficients..
[Audio] Step 8: Use the Solver to solve for the regression coefficients. If you have not already installed the Solver in Excel, use the following steps to do so: Click File. Click Options. Click Add-Ins. Click Solver Add-In, then click Go. In the new window that pops up, check the box next to Solver Add-In, then click Go. Once the Solver is installed, go to the Analysis group on the Data tab and click Solver. Enter the following information: Set Objective: Choose cell H14 that contains the sum of the log likelihoods. By Changing Variable Cells: Choose the cell range B15:B18 that contains the regression coefficients. Make Unconstrained Variables Non-Negative: Uncheck this box. Select a Solving Method: Choose GRG Nonlinear. Then click Solve. The Solver automatically calculates the regression coefficient estimates..
[Audio] In the research question, we are interested in the probability that the response variable = 1 (the event that the player would be drafted). By default, the regression coefficients can be used to find the probability that draft = 0. However, typically in logistic regression we're interested in the probability that the response variable = 1. So, we can simply reverse the signs on each of the regression coefficients. Now these regression coefficients can be used to find the probability of success, which is the event of the player being drafted, is equal to 1..
[Audio] For example, suppose a player averages 14 points per game, 4 rebounds per game, and 5 assists per game. The probability that this player will get drafted into the NBA can be calculated as: P(draft = 1) = e3.681193 + 0.11283*(14) -0.395671*(4) – 0.679537*(5) / (1 + e3.681193 + 0.11283*(14) -0.395671*(4) – 0.679537*(5) ) = 0.57. Result: Since this probability (0.57) is greater than 0.50 (cut-off value), we predict that this player would get drafted into the PBA..
[Audio] How to perform a binary logistic regression in SPSS Research Question: Will a college basketball player get drafted (1) or not (0) based on his average points per game and division level? Step 1: Input the data. Let us now perform the binary logistic regression analysis using SPSS. We will use the same research question given previously but with different variables and data..
[Audio] Let us now perform the binary logistic regression analysis using SPSS. We will use the same research question given previously but with different variables and data..
[Audio] Step 2: Perform logistic regression. Click the Analyze tab, then Regression, then Binary Logistic Regression:.
[Audio] In the new window that pops up, drag the binary response variable draft into the box labelled Dependent. Then drag the two predictor variables points and division into the box labelled Block 1 of 1. Leave the Method set to Enter. Then click OK..
[Audio] Points and division are able to explain 72.5% of the variability in draft. The logistic regression model was able to correctly predict the draft result of 85.7% of players. Step 3. Interpret the output. Once you click OK, the output of the logistic regression will appear: Here is how to interpret the output: Model Summary: The most useful metric in this table is the Nagelkerke R Square, which tells us the percentage of the variation in the response variable [drafted (1) or not drafted (0)] that can be explained by the predictor variables (points and division). In this case, points and division are able to explain 72.5% of the variability in draft. Classification Table: The most useful metric in this table is the Overall Percentage, which tells us the percentage of observations that the model was able to classify correctly. In this case, the logistic regression model was able to correctly predict the draft result of 85.7% of players. Variables in the Equation: This last table provides us with several useful metrics, including: Wald: The Wald test statistic for each predictor variable, which is used to determine whether or not each predictor variable is statistically significant. Sig: The p-value that corresponds to the Wald test statistic for each predictor variable. We see that the p-value for points is .039 and the p-value for division is .557. Exp(B): The odds ratio for each predictor variable. This tells us the change in the odds of a player getting drafted associated with a one unit increase in a given predictor variable. For example, the odds of a player in division 2 getting drafted are just .339 of the odds of a player in division 1 getting drafted. Similarly, each additional unit increase in points per game is associated with an increase of 1.319 in the odds of a player getting drafted..
[Audio] Step 3. Interpret the output: Variables in the Equation. Points: The coefficient (B) is 0.277, which means for each additional point, the log-odds of the outcome increases by 0.277. The p-value (Sig.) is 0.039, which is less than 0.05, indicating that the effect of points is statistically significant at the 5% level. The odds ratio (Exp(B)) is 1.319, suggesting that for each additional point, the odds of the outcome happening increase by about 32%. Division: The coefficient (B) is -1.082, implying that being in a certain division decreases the log-odds of the outcome by 1.082. However, the p-value (Sig.) is 0.557, which is greater than 0.05, indicating that the effect of division is not statistically significant at the 5% level. The odds ratio (Exp(B)) is 0.339, suggesting that being in a certain division decreases the odds of the outcome happening by about 66%, although this effect is not statistically significant. Constant: The constant (B) is -3.152, which is the log-odds of the outcome when all predictors are zero. However, the p-value (Sig.) is 0.355, which is greater than 0.05, indicating that the constant is not statistically significant at the 5% level..
[Audio] Probability (P) of an Outcome Using the Coefficients P = e-3.152 + 0.277*(points) – 1.082*(division) / (1 + e-3.152 + 0.277*(points) – 1.082*(division)) If points = 20 and division = 1, then P = e-3.152 + 0.277*(20) – 1.082*(1) / (1 + e-3.152 + 0.277*(20) – 1.082*(1)) = 0.787 Further, we can now use the coefficients to predict the probability that a given player will get drafted, using the following formula: For example, the probability that a player who averages 20 points per game and plays in division 1 gets drafted can be calculated as: Since this probability is greater than 0.5, we would predict that this player would get drafted..
[Audio] Results: Logistic regression was performed to determine how points per game and division level affect a basketball player’s probability of getting drafted. A total of 14 players were used in the analysis. The model explained 72.5% of the variation of the predictor variables contributed to the draft result and correctly classified 85.7% of cases. The odds of a player in division 2 getting drafted were just .339 of the odds of a player in division 1 getting drafted. Each additional unit increase in points per game was associated with an increase of 1.319 in the odds of a player getting drafted. Step 4. Report the results. Lastly, we want to report the results of our logistic regression. Here is an example of how to do so: These results were all based on the table: Variables in the Equation..
[Audio] ROC curve Two measures of the accuracy of a model: Sensitivity: The probability that the model predicts a positive outcome for an observation when indeed the outcome is positive. This is also called the “true positive rate.” Specificity: The probability that the model predicts a negative outcome for an observation when indeed the outcome is negative. This is also called the “true negative rate.” A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model (can be used for multi-class classification as well) at varying threshold values. To assess how well a logistic regression model fits a dataset, we can look at the following two metrics: One way to visualize these two metrics is by creating a ROC curve, which stands for "receiver operating characteristic" curve. This is a plot that displays the sensitivity along the y-axis and (1 – specificity) along the x-axis..
[Audio] What is a good AUC score? We can reference the following rule of thumb from Hosmer and Lemeshow in Applied Logistic Regression (p. 177): • Less than 0.5 = No discrimination • 0.5 to 0.7 = Poor discrimination • 0.7 to 0.8 = Acceptable discrimination • 0.8 to 0.9 = Excellent discrimination • More than 0.9 = Outstanding discrimination By these standards, a model with an AUC score below 0.7 would be considered poor and anything higher would be considered acceptable or better. In medical settings, researchers often seek AUC scores above 0.95. In marketing, a lower AUC score may be acceptable for a model. Area under the curve AUC stands for “Area Under the ROC Curve”. It measures the entire two-dimensional area underneath the entire ROC curve, which plots True Positive Rate vs. False Positive Rate at different classification thresholds. AUC provides an aggregate measure of performance across all possible classification thresholds. The value for AUC ranges from 0 to 1. A model that has an AUC of 1 is able to perfectly classify observations into classes while a model that has an AUC of 0.5 does no better than a model that performs random guessing. One way to quantify how well the logistic regression model does at classifying data is to calculate AUC, which stands for "area under curve." What is a good AUC score? The answer: There is no specific threshold for what is considered a good AUC score. Obviously the higher the AUC score, the better the model is able to classify observations into classes. However, there is no magic number that determines if an AUC score is good or bad. If we must label certain scores as good or bad, we can reference the following rule of thumb from Hosmer and Lemeshow in Applied Logistic Regression (p. 177): A "Good" AUC Score Varies by Industry It is important to keep in mind that what is considered a "good" AUC score varies by industry. For example, in medical settings researchers often seeking AUC scores above 0.95 because the cost of being wrong is so high. If we have a logistic regression model that predicts whether or not a patient will develop cancer, the price of being wrong (incorrectly telling a patient they do not have cancer when they do) is so high that we want a model that is correctly nearly every time. Conversely, in other industries like marketing a lower AUC score may be acceptable for a model. For example, if we have a model that predicts whether or not a customer will be a repeat customer or not, the price of being wrong is not life-altering so a model with an AUC as low as 0.6 could still be useful..
[Audio] Multinomial logistic regression is used to predict a nominal dependent variable given one or more independent variables. sometimes considered an extension of binomial logistic regression to allow for a dependent variable with more than two categories can have nominal and/or continuous independent variables and can have interactions between independent variables to predict the dependent variable Examples: type of drink with 4 categories, job position with 3 categories, etc. Multinomial logistic regression (often just called "multinomial regression") is used to predict a nominal dependent variable given one or more independent variables. It is sometimes considered an extension of binomial logistic regression to allow for a dependent variable with more than two categories. As with other types of regression, multinomial logistic regression can have nominal and/or continuous independent variables and can have interactions between independent variables to predict the dependent variable. For example, you could use multinomial logistic regression to understand which type of drink consumers prefer based on a specific location and age (i.e., the dependent variable would be "type of drink", with four categories – Coffee, Soft Drink, Tea and Water – and your independent variables would be the nominal variable, "location", assessed using three categories – Luzon, Visayas, and Mindanao – and the continuous variable, "age", measured in years). Alternately, you could use multinomial logistic regression to understand whether factors such as employment duration within the firm, total employment duration, qualifications and gender affect a person's job position (i.e., the dependent variable would be "job position", with three categories – junior management, middle management and senior management – and the independent variables would be the continuous variables, "employment duration within the firm" and "total employment duration", both measured in years, the nominal variables, "qualifications", with four categories – no degree, undergraduate degree, master's degree and PhD, "gender", which has two categories: "males" and "females")..
[Audio] Multinomial logistic regression MODEL The first equation calculates the probability of a dependent variable Y being equal to j which is a particular category. The second equation calculates the probability of Y being equal to 0 which is the base or reference category. Within the framework of the multinomial model, a control category must be selected. Ideally, we will choose what corresponds to the "basic" or "classic" or "normal" situation. The estimated coefficients will be interpreted according to this control category..
[Audio] Assumptions of A multinomial logistic regression 1: Your dependent variable should be measured at the nominal level. 2: You have one or more independent variables that are continuous, ordinal, or nominal (including dichotomous variables). 3: You should have independence of observations and the dependent variable should have mutually exclusive and exhaustive categories. 4: There should be no multicollinearity. 5: There needs to be a linear relationship between any continuous independent variables and the logit transformation of the dependent variable. 6: There should be no outliers, high leverage values or highly influential points. Assumption 1: Your dependent variable should be measured at the nominal level. Examples of nominal variables include ethnicity (e.g., with three categories: Caucasian, African American and Hispanic), transport type (e.g., with four categories: bus, car, jeep and train), profession (e.g., with five groups: surgeon, doctor, nurse, dentist, therapist), and so forth. Multinomial logistic regression can also be used for ordinal variables, but you might consider running an ordinal logistic regression instead. Assumption 2: You have one or more independent variables that are continuous, ordinal, or nominal (including dichotomous variables). However, ordinal independent variables must be treated as being either continuous or categorical. They cannot be treated as ordinal variables when running a multinomial logistic regression. Examples of continuous variables include age (measured in years), revision time (measured in hours), income (measured in pesos), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. Examples of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), among other ways of ranking categories (e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much", to "It is OK", to "Yes, a lot"). Examples of nominal variables were provided in the previous bullet. Assumption #3: You should have independence of observations and the dependent variable should have mutually exclusive and exhaustive categories. Assumption #4: There should be no multicollinearity. Multicollinearity occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which variable contributes to the explanation of the dependent variable and technical issues in calculating a multinomial logistic regression. Determining whether there is multicollinearity is an important step in multinomial logistic regression. Assumption #5: There needs to be a linear relationship between any continuous independent variables and the logit transformation of the dependent variable. Assumption #6: There should be no outliers, high leverage values or highly influential points..
[Audio] Ordinal logistic regression Ordinal Logistic Regression is a statistical analysis method that can be used to model the relationship between an ordinal response variable and one or more explanatory variables. An ordinal variable is a categorical variable for which there is a clear ordering of the category levels. Examples of suitable variables include: Opinion polls (agree, neutral, disagree) Socioeconomic status (low, medium, high) Scores on a test (excellent, average, poor) Product sizes ordered (large, medium, small) In the graph, the four categories of the dependent nominal variable are distinct: these are highlighted in blue, red, yellow, and green, in ascending order..
[Audio] Assumptions of an Ordinal logistic regression The dependent variable are ordered. One or more of the independent variables are either continuous, categorical or ordinal. No multi-collinearity. Proportional odds We need to check the assumptions to ensure that it is a valid model. The assumptions of the Ordinal Logistic Regression are as follow and should be tested in order. The dependent variable are ordered. One or more of the independent variables are either continuous, categorical or ordinal. No multi-collinearity. To check if multi-collinearity exists, a Variance Inflation Factor (VIF) test should be performed. The general rule of thumb for VIF test is that if the VIF value is greater than 10, then there is multi-collinearity. Proportional odds This assumption basically means that the relationship between each pair of outcome groups has to be the same. If the relationship between all pairs of groups is the same, then there is only one set of coefficient, which means that there is only one model. If this assumption is violated, different models are needed to describe the relationship between each pair of outcome groups. The test of parallel lines (also known as the test of proportional odds) is a test of the assumption that the relationship between each pair of outcome groups is statistically the same. A p-value which is greater than 0.05 lets us fail to reject the null hypothesis, suggesting that the odds are proportional and the assumption of parallel lines holds. This indicates a good fit for the ordinal logistic regression model..
[Audio] Results of a logistic regression analysis In a logistic regression analysis in SPSS, you will find several key results: Model Summary: This includes the Nagelkerke R Square, which tells us the percentage of the variation in the response variable that can be explained by the predictor variables. Classification Table: This table provides the Overall Percentage, which tells us the percentage of observations that the model was able to classify correctly. Variables in the Equation: This table provides several useful metrics: Wald: The Wald test statistic for each predictor variable, which is used to determine whether or not each predictor variable is statistically significant. Sig: The p-value that corresponds to the Wald test statistic for each predictor variable. Exp (B): The odds ratio for each predictor variable. This tells us the change in the odds of an event occurring associated with a one unit increase in a given predictor variable. Coefficients (B): These can be used to predict the probability that a given event will occur, using the formula for the logistic regression..
[Audio] How to report and interpret Results of a logistic regression analysis Model Summary: Classification Table: Variables in the Equation: Wald: Sig: Exp (B): Coefficients (B):.
[Audio] references. references. 37. Dr. Nancy L. Mati Tarlac State University, College of Science, Department of Mathematics January 9, 2024.
[Audio] Thank you…. Thank you….