[Audio] Thank you Sir Christian, to continue. Statistics.
[Audio] I'll be reporting on Analysis Distributions which compose of Percentiles, Quartiles, Empirical Rule, Identifying Outliers.
[Audio] Distributions are very useful for interpreting and analyzing data. A distribution describes the overall variability of the observed values of a variable..
[Audio] Percentile - A value such that approximately p% of the observations have values less than the pth percentile; hence, approximately (100 - p)% of the observations have values greater than the pth percentile. The 50th percentile is the median..
[Audio] Quartiles - The 25th, 50th, and 75th percentiles, referred to as the first quartile, second quartile (median), and third quartile, respectively. The quartiles can be used to divide a data set into four parts, with each part containing approximately 25% of the data. Interquartile Range, or IQR - The difference between the third and first quartiles..
[Audio] z-Scores - A value computed by dividing the deviation about the mean (xi - x) by the standard deviation s. A z-score is referred to as a standardized value and denotes the number of standard deviations that xi is from the mean..
[Audio] Empirical Rule - A rule that can be used to compute the percentage of data values that must be within 1, 2, or 3 standard deviations of the mean for data that exhibit a bell-shaped distribution..
[Audio] Identifying Outliers Outliers - An unusually large or unusually small data value..
[Audio] Boxplots Boxplots - A graphical summary of data based on the quartiles of a distribution..
[Audio] Measures of association between two variables.
[Audio] Scatter Charts - A graphical presentation of the relationship between two quantitative variables. One variable is shown on the horizontal axis and the other on the vertical axis (scatter chart or scatter plot)..
[Audio] Covariance - A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship..
[Audio] Correlation Coefficient - A standardized measure of linear association between two variables that takes on values between 21 and 11. Values near 21 indicate a strong negative linear relationship, values near 11 indicate a strong positive linear relationship, and values near zero indicate the lack of a linear relationship..
[Audio] Data Cleansing. Cleansing. 08. Data. MENU.
[Audio] Data Cleansing The data in a data set are often said to be "dirty" and "raw" before they have been put into a form that is best suited for investigation, analysis, and modeling. Data preparation makes heavy use of the descriptive statistics and data-visualization methods to gain an understanding of the data. Common tasks in data preparation include treating missing data, identifying erroneous data and outliers, and defining the appropriate way to represent variables..
[Audio] Missing Data Legitimately Missing Data - Missing data that occur naturally. Illegitimately Missing Data - Missing data that do not occur naturally..
[Audio] Missing Completely at Random (MCAR) - The tendency for an observation to be missing a value of some variable is entirely random. Missing at Random (MAR) - The tendency for an observation to be missing a value of some variable is related to the value of some other variable(s) in the data. Missing not at Random (MNAR) - The tendency for an observation to be missing a value of some variable is related to the missing value..
[Audio] Imputation - Systematic replacement of missing values with values that seem reasonable..
[Audio] Identification of Erroneous Outliers and Other Erroneous Values Examining the variables in the data set by use of summary statistics, frequency distributions, bar charts and histograms, z-scores, scatter charts, correlation coefficients, and other tools can uncover data-quality issues and outliers. Closer examination of outliers and potential erroneous values may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis. A conservative approach is to create two data sets, one with and one without outliers and potentially erroneous values, and then construct a model on both data sets. If a model's implications depend on the inclusion or exclusion of outliers and erroneous values, then you should spend additional time to track down the cause of the outliers..
[Audio] Variable Representation Dimension Reduction - The process of removing variables from the analysis without losing crucial information..
[Audio] To Summarized. 09. Summary. MENU. MENU. ANALYSIS.
[Audio] Descriptive statistics refers to a branch of statistics that involves the collection, organization, presentation, and interpretation of data. It focuses on summarizing and describing the main characteristics and patterns within a dataset. Descriptive statistics aim to provide a clear and concise understanding of the data, allowing researchers or analysts to gain insights, identify trends, and make informed decisions..
[Audio] This is Edwin Villa together with Mr Christian Bumatay THANKS for Listening!.