Chapter 2:Data Mining Issues, Evaluation and Terminologies.
Data Mining Mining Methodology and User Interaction Issues Performance Issues Diverse Data Types Issues.
Data Mining. Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources..
Data Mining issues.
Mining Methodology and User Interaction Issues. Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task. Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results. Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction. Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining..
Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable. Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor. Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty..
Performance Issues. Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable. Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch..
Diverse Data Types Issues. Handling of relational and complex types of data − The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. Mining information from heterogeneous databases and global information systems − The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore, mining the knowledge from them adds challenges to data mining..
2.2 Data Warehouse Evaluation. Data Warehouse Data Warehousing Query Driven Approach Update Driven Approach From Data Warehousing (OLAP) to Data Mining (OLAM) Importance of OLAM.
Data Warehouse. A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data..
Data Warehousing. Data warehousing is the secure electronic storage of information by a business or other organization. The goal of data warehousing is to create a trove of historical data that can be retrieved and analyzed to provide useful insight into the organization's operations..
Query Driven Approach. This is the traditional approach to integrate heterogeneous databases. This approach is used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators..
Update Driven Approach. Today's data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In the update-driven approach, the information from multiple heterogeneous sources is integrated in advance and stored in a warehouse. This information is available for direct querying and analysis..
From Data Warehousing (OLAP) to Data Mining (OLAM).
Importance of OLAM. OLAM is important for the following reasons High quality of data in data warehouses The data mining tools are required to work on integrated, consistent, and cleaned data. These steps are very costly in the preprocessing of data. The data warehouses constructed by such preprocessing are valuable sources of high quality data for OLAP and data mining as well..
2.3 Data Mining Terminologies. Data Mining Engine Graphical Users Interface Knowledge Based.
Data mining architecture. The significant components of data mining system are a data source, data mining engine, data warehouse server, the pattern evaluation module, graphical user interface, and knowledge base..
Data Mining Engine. It is very important to the data mining system. Also, it consists of too many set of function modules. They perform the following functions. •Characterization •Association and Correlation Analysis •Classification •Prediction •Cluster analysis •Outlier analysis •Evolution analysis.
Graphical User Interface. The graphical user interface (GUI) module communicates between the data mining system and the user. •Interact with the system by specifying a data mining query task. •Providing information to help focus the search. Mining based on the intermediate data mining results. Evaluate mined patterns. Visualize the patterns in different forms..
Knowledge based. Used to store complex structured and unstructured information used by a computer system. We can say this is domain knowledge. We used this to guide the search.
•Knowledge discovery •Cleaning of data •Data integration •Selection of data •Transformation of data.