I. I what is Pattern Recognition? In pattern recognition, we assign labels to patterns. In Figure 1.1, there are patterns of Class X and Class O. Pattern P is a new sample which has to be assigned either to Class X or Class O. In a system that is built to classify humans into tall, medium and short, the abstractions, learnt from examples, facilitate assigning one of these class labels ("tall", "medium" or "short") to a newly encountered human. Here, the class labels are semantic; they convey some meaning. In the case of clustering, we group a collection of unlabelled patterns; here, the labels assigned to each group of patterns are syntactic, or simply the cluster identity. It is possible to directly use a classification rule without generating any abstraction. In such a case, the notion of proximity/similarity (or distance) is used to classify patterns. Such a similarity function is computed based on the representation of the.
[Audio] Humans are represented as vectors of feature values, and the features used to represent them are crucial. For instance, considering the data in Table 1.1, humans are categorized into two groups, "tall" and "short", using the feature "weight". However, this approach is limited due to the lack of correlation between weight and class labels. A more appropriate feature, such as "height", is needed..
[Audio] Each element of the vector corresponds to an attribute of the pattern, allowing us to capture the essence of the pattern in a concise manner. By combining these elements, we can create a unique identifier for each pattern, enabling efficient comparison and classification. This vector-based approach enables us to distill complex patterns down to their essential characteristics, facilitating meaningful analysis and decision-making..
[Audio] Patterns can be viewed as strings, similar to sentences in a language. For instance, a DNA sequence can be seen as a string of nucleotides, such as GAAGTCCAG... Genes can be defined as regions of chromosomal DNA constructed from these nucleotides. The gene data is arranged in a sequence like GAAGTCCAG..., illustrating how patterns can be represented as strings. Additionally, patterns can also be described logically, using a format like (Cl = al..a2) A (c2 bl..b2) A A (Xd = VI..V2), where Cl, X2, and are attributes, and ai, bi, and Vi are their corresponding values. This logical description consists of a conjunction of logical disjunctions, exemplified by (colour = red V white) A (make = leather) A (shape = sphere) to describe a cricket ball..
[Audio] Patterns can be represented in various forms, including vectors, strings, logical descriptions, fuzzy sets, rough sets, trees, and graphs. These representations enable the use of different techniques for processing and analyzing the patterns, allowing for more effective and efficient decision-making..
[Audio] Patterns can be represented in various ways, including hierarchical structures like trees and graphs. The R-tree is one such structure, which splits space into hierarchically nested and possibly overlapping minimum bounding rectangles or bounding boxes. This allows for efficient insertion, deletion, and search operations. The R-tree can also be used to represent a set of patterns as a single tree, where following a path in the tree gives one of the patterns in the set. Additionally, the minimum spanning tree is another example, which connects points in space to form a tree that covers all the nodes in a graph. This tree minimizes the sum of the distances between the linked nodes..
[Audio] This tree structure is particularly useful in identifying associations between items within a transaction database. By examining the presence of certain items in a transaction, we can infer the likelihood of other items being present in the same transaction. The FP growth algorithm leverages this data structure to efficiently mine frequent item sets in large databases. To construct this tree, we first calculate the frequency of each item in the database and sort them in descending order. Each entry in the database is then reordered to match this frequency. We start by creating the root node, labeling it null, and adding the first transaction. Subsequent transactions are added based on their ordering, with the common prefix shared by each transaction following the existing path, incrementing the count by one. New nodes are created for the remaining parts of the transactions, resulting in a compressed tree containing information about the frequent items in the original database..
[Audio] In the table, we see the frequency of every item, which is obtained by scanning the transaction database. When sorted from largest to smallest, we get (l : 5), (a : 4), (d : 4), (p : 4), (h : 3), (i : 3), (c : 3), (e : 3). Only items with a frequency of three and above are listed here, with ties resolved arbitrarily..
[Audio] In this step, we create an FP tree, also known as a prefix tree, which represents the frequency of items in our transaction database. We select the most frequent item, 'a', and create a header node for it. Next, we identify the next most frequent item, 'i', and create a new link from 'a' to 'i', with a count of one indicating how often 'i' appears after 'a' in our transactions. We repeat this process for 'd', 'p', and 'h', creating header nodes that point to all nodes containing these items. This structure enables efficient mining of frequent patterns in our data..
[Audio] In feature extraction, we detect and isolate the desired features of patterns. This process involves extracting features from data to identify or interpret meaningful information. In the case of image data, feature extraction enables automatic recognition of various features. As an essential pre-processing step in pattern recognition, feature extraction sets the stage for successful identification and interpretation of patterns..
[Audio] Principal Component Analysis, or PCA, is a mathematical technique that transforms a set of correlated variables into a smaller set of uncorrelated variables, known as principal components. These components account for as much of the variability in the data as possible, with the first component capturing the greatest amount of variation, and subsequent components accounting for the remaining variation. By projecting the data onto these components, PCA finds the most accurate representation of the data in a lower-dimensional space. The resulting representation is based on the eigenvectors and eigenvalues of the covariance matrix, which describe the directions of maximum variance in the data. By selecting only the most significant eigenvectors, we can simplify the representation while preserving the essential characteristics of the data..
[Audio] In this example, we project the original data onto the coordinate axes having the dimension K. By selecting the eigenvectors with the largest eigenvalues, we minimize the loss of information and simplify the representation by reducing the dimension of the representation. When we transform the patterns onto the eigenvector, some information is lost, but the most significant eigenvector captures the majority of the information. This process enables us to reduce the dimensionality of the data while preserving the essential characteristics of the patterns..
[Audio] The features used for classification may not always be meaningful. In fact, removal of some of these features may lead to better classification accuracy. By identifying and discarding features that are useless for classification, we can optimize the process while reducing costs. Feature selection achieves this by simplifying both the representation of patterns and the complexity of classifiers, making them faster and more efficient. Additionally, it improves classification accuracy by minimizing the impact of irrelevant features on the outcome. As we explore the relationship between sample sizes, number of features, and classifier complexity, we realize that beyond a certain point, including too many features can actually degrade performance due to the finite number of training samples. This phenomenon is known as the peaking effect. Furthermore, as dimensionality increases, the distance to the nearest data point approaches the distance to the furthest data point, affecting the results of the nearest neighbor problem. By reducing dimensionality, we can achieve better outcomes. Feature selection algorithms involve searching through different feature subsets, but when dealing with large numbers of features, these searches become impractically complex. Optimal algorithms like exhaustive search and branch and bound techniques require significant computational resources, so we often resort to faster, albeit suboptimal, procedures..
[Audio] Feature selection is a straightforward approach where we exhaustively search through all possible subsets of features and select the one with the smallest classification error. This involves searching through all possible combinations of features, which can be computationally expensive. A table illustrates the different subsets of features that can be formed from a dataset with five features. Each row represents a subset, and the columns indicate whether each feature is included (1) or excluded (0) in that subset. This approach serves as a baseline for comparing other feature selection methods..
[Audio] The branch and bound scheme provides an efficient approach to circumvent exhaustive search by leveraging intermediate results to establish bounds on the final criterion value. By discarding features to form an m-feature subset, this search method examines only sequences of distinct values. The feature selection criterion is optimized, ensuring that a sub-set of features does not surpass any larger set. A lower bound is established on the optimum value of the criterion, permitting the algorithm to prune sub-trees that cannot lead to the optimal solution. Whenever a sub-optimal partial sequence or node is recognized, the sub-tree under that node is implicitly rejected, enabling the exploration of alternative partial sequences..
[Audio] The branch and bound algorithm is designed to efficiently explore the space of possible solutions by pruning branches that are guaranteed to lead to inferior results. By maintaining a bound on the quality of the solution, the algorithm can terminate early once it becomes clear that further exploration would not improve the result. This approach is particularly useful when dealing with complex optimization problems where the number of possible solutions is extremely large..
[Audio] In this branch and bound algorithm, the representation of patterns plays a crucial role. The algorithm utilizes the representation to prune branches and diminish the search space. Through examination of the examples, we can comprehend how the algorithm functions and how it employs the representation to make decisions..
[Audio] We evaluate the performance of each feature individually, selecting the best one first. Then, we continue adding features until we reach our desired number of features. However, although this method is computationally simple, it neglects the relationships between features, making it prone to errors..
[Audio] When considering the classification of a new pattern, the k-Nearest Neighbour algorithm takes into account not only the single nearest neighbour, but rather a group of k nearest neighbours. This approach reduces the impact of noise in the training patterns, allowing for more accurate classifications. The choice of k is critical, as it determines the effectiveness of the algorithm. By selecting the value of k that yields the lowest error rate, we can optimize the performance of the k-NN algorithm.
[Audio] Patterns like Figure P can be accurately classified using the kNN algorithm. By considering the five nearest neighbours, we see that X17 and X16 belong to Class 3, while X8, X7, and XII belong to Class 2. Applying the majority class rule, we classify Figure P as belonging to Class 2..
[Audio] The Bayes classifier is a powerful tool in pattern recognition, allowing us to make informed decisions even when faced with uncertain data. By employing Bayes theorem, we can calculate the posterior probability of a pattern belonging to a particular class, taking into account both prior probabilities and the likelihood of the observed features. This approach enables us to minimize the average probability of error, making it an optimal classification method. With the Bayes classifier, we can confidently assign a class label to a pattern, knowing that our decision is based on the most accurate analysis of the available data..
[Audio] The Bayes theorem is a powerful tool for updating our beliefs based on new evidence. Regardless of the value of x, the posterior probability, P(H|X), is based on X and other information such as background knowledge, whereas the prior probability, P(H), is the probability of H regardless of X. Similarly, P(X) is defined as a weighted combination of P(X|H) and P(X|~H). The Bayes theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from P(H), P(X), and P(X|H). It states that P(H|X) = P(H) \* P(X|H) / P(X)..
[Audio] In pattern recognition, we often seek to minimize the average probability of error, which corresponds to a weighted probability of errors corresponding to the prior probability of a class. By utilizing the training patterns, we can determine the posterior probability for a pattern and class. For a test pattern, we can calculate the posterior probability for that class. If the test pattern is classified as belonging to that class, the probability of error will be one minus the posterior probability. It is evident that the test pattern should be classified as the one for which the posterior probability is maximum. Moreover, a minimum error rate classifier can be characterized as follows: considering the expected probability of error, which is represented by the formula. For a fixed probability of the test pattern, this is minimized when the posterior probability is maximum for every test pattern. When patterns assume discrete values, the average probability of error is calculated according to the formula..
[Audio] Using Bayes' theorem, we can calculate the posterior probabilities for each class label. A collection of pencils, pens, and papers with equal probabilities is considered. The Bayes classifier decides the corresponding class labels. The posterior probability for "green" is calculated as the product of the prior probability and the likelihood ratio. Similarly, posterior probabilities for "blue" and "red" classes are computed. By comparing these probabilities, the most likely class label for each object is determined. For instance, considering a pencil, the posterior probability for "green" is higher than those for "blue" and "red", indicating that the pencil belongs to class "green". Similarly, class labels for pen and paper are determined. The resulting probabilities of error are also calculated..
[Audio] The average probability of error is calculated by considering the probability of error for different input devices, including paper, pencil, and pen. The formula shows that the average probability of error is a weighted sum of the individual probabilities of error for each device, with the weights being represented by the coefficients x ---, x ---, and x ---. As a result, the average probability of error is equal to --- + +. This value is further simplified to --- + +, indicating that the average probability of error depends on the relative frequencies of the three input devices. According to the numerical values provided, the average probability of error is approximately 4.17..
[Audio] The Bayes classifier is a simple probabilistic classifier that assumes every feature is class-conditionally independent. This simplification makes the computation easier, but the assumption is strong and may not always apply. Despite this, studies have shown that the naive Bayesian classifier performs comparably well to other classification algorithms, such as classification trees and neural networks, especially when dealing with large datasets. By reformulating the probability model using Bayes theorem, we can express the probability distribution over the class variable C as p(C) p(F1, •, FDIC). This allows us to calculate the posterior probability, p(C|F1, •, FD), by multiplying the prior probability, p(C), by the likelihood, p(F1, •, FD|C)..
[Audio] The joint probability model can be rewritten by repeatedly applying the definition of conditional probability. Since each feature is conditionally independent of every other feature, the joint model can be expressed as a product of individual probability distributions. This simplification allows us to model complex relationships between features and classes using prior probabilities and independent probability distributions. As a result, naive Bayes models become more manageable and practical for real-world applications, requiring fewer parameters..
[Audio] Pattern recognition can represent and solve decision problems under uncertainty, where influence plays a crucial role. In a Bayesian network, if there is an arc from node A to another node B, A is referred to as a parent of B, and B is considered an effect of A. The set of parent nodes of a node X is denoted by its parents, Xi. An acyclic graph is a Bayesian network relative to a set of variables if the joint distribution values can be written as the product of the local distributions of each node and its parents. The value of a node is observed, then the node is said to be an evidence node. With a conditional probability table, the value for a particular set of values for the variables can be easily computed..
[Audio] Variables S and T are influenced by G in this belief network. The conditional probability tables provide the necessary information to calculate the probability of various events. For example, if we want to know the probability that it rains, Ram does not have money, and does not watch television, we can use the conditional probability tables to obtain the required values. By multiplying these values together, we get a probability of 0.028. This demonstrates that the probability of any combination of variables can be calculated using the belief network and the conditional probability tables..