KDD Overview Notes

KDD - Knowledge Discovery in Databases

This is a summary of three papers by Fayyad, Piatetsky-Shapiro,Smyth. From Data Mining to Knowledge Discovery in Databases; AI Magazine , Fall 96, pp37-54 From Data Mining to Knowledge Discovery: an Overview, Advances in Knowldege Discovery and Data Mining, pp 1-30 Knowldege Discovery and Data Mining: Towards a Unifying Framework; Proceedings KDD-96, pp 92-88

I have intentinally not gone into detail about the data mining step. This is the subject of a seperate related summary.

What is KDD?

computational theories and tools to assist humans in extracting useful information (ie knowledge) from digital data
development of methods and techniques for making sense of data
maps low-level data into other forms that are:

- more compact (ie short reports)

- more abstract (ie model of process generating the data)

- more useful (ie a predictive model of future cases)

core of KDD process employs "data-mining"

Why KDD?

the size of datasets are growing extremely large - billions of records - hundreds to thousands of fields
analysis of data must be automated
computers enable us to generate amounts of data too large for humans to digest, thus we should use computers to discover meaningful patterns and structures from the data.

Current KDD Applications

A few examples from many fields

Science - SKYCAT: used to aid astronomers by classifying faint sky objects
Marketing - AMEX: used customer group identification and forecasting. Claims 10%-15% increase in card usage.
Investment - Many use. Few tell. - LBS Capitol Management: uses and expert system/neural network to manage $600 million portfolio. Results outperform market.
Fraud Detection - HNC Falcon, Nestor Prism: credit card fraud detection - FAIS: US Treasury money-laundering detection system
Manufacturing - CASSIOPEE: a trouble-shooting system used in Europe to diagnose 737 problems by deriving families of faults by clustering
Telecommunications - TASA (Telecommunications Alarm-Sequence Analyzer): locates patterns of frequently occurring alarm episodes and represents the patterns as rules
Data Cleaning - MERGE-PURGE: used by Washington State to locate and remove duplicate welfare claims
Sports - ADVANCED SCOUT: helps NBA coaches analyse data to organize and interpret game data ==> player selection and team management
Information Retrieval - Intelligent Agents have been designed to navigate the internet and return information pertinent to some non-trivial query

KDD vs Data-Mining They are not the same thing

KDD: The nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data [Fayyad, et al]

KDD is the overall process of discovering useful knowledge from data.

Data mining: An application of specific algorithms for extracting patterns from data.

Data mining is a step in the KDD process.

Interdisciplinary Nature of KDD

Merges machine learning, pattern recognition, statistics, database, high performance computing with unified goal of extracting high-level knowledge from low-level data.in the context of large datasets.
Differs from much of ML, etc. in that it places special emphasis on finding understandable patterns that can be interpreted as useful or interesting knowledge.
Fundamentally a statistical endeavor. Statistics provide a language and framework for quantifying the uncertainty that results when one tries to infer general patterns from a particular sample of an overall population.
Because data-mining algorithms typically assume data are in main memory, KDD relies on database techniques for gaining efficient data access to large datasets.
A set of principles from the database field for dealing with large datasets is OLAP, Online Analytical Processing. OLAP tools focus on simplifying and suporting interactive data analysis; the goal of KDD tools is to automate as much of the process as possible.

KDD focuses on the overall process including how data is stored and accessed, how algorithms can be scaled to massive datasets, interpretation and visualization of results and how the overall man-machine interaction can be usefully modeled and supported.

KDD Process

Develop an understanding for the application domain and identify the goal.
Create a target dataset

selecting a dataset or focusing on a subset of samples or variables on which to make discoveries

Data cleaning and preprocessing

removal of noise and outliers
collecting necessary information to model or account for noise
handling of missing data
accounting for time sequence info

Data reduction and projection

finding useful features to represent the data relative to the goal
dimensionality reduction/transformation ==> reduce number of variables
identification of invariant representations

Selection of appropriate data-mining task

summarization, classification, regression, clustering, etc.

Selection of data-mining algorithm(s)

methods to search for patterns
decision of which models and parameters may be appropriate
match method to goal of KDD process

Data-mining

searching for patterns of interest in one or more representational forms

Interpretation and visualization

interpretation of mined patterns
visualization of extracted patterns and models
visualization of the data given the extracted models

Consolidating discovered knowledge

incorporating the discovered knowledge into another system
documenting and reporting knowledge to interested parties
checking for inconsistencies with other prior extracted or believed knowledge

( Iterate the above steps as needed )

Data-Mining Step

Goals defined by intended uses of the system

Verification: verify users hypotheses about the data
Discovery: autonomously finds new patterns in the data discovered patterns are used to

prediction: make predictions about future events
description: presentation information to the user in a human understandable form.

Common data mining tasks:

classification
clustering

probability density estimation

regression
summarization
dependency modeling
change and deviation detection

Data mining involves fitting models to or determining patterns from observed data.
Two classes of model are available

logical models: purely deterministic
statistical models: non-deterministic, most widely used due to uncertainty in real world data

Data mining algorithms have 3 primary components

model representation
model evaluation criteria
search

Common model representations

polynomials
splines
kernel and basis functions
threshold-Boolean functions

Model evaluation criteria (fit functions)

prediction tasks: empirical predictive accuracy
description tasks: predictive accuracy, novelty, utility, understandability

Search methods -- an optimization task

parameter search: find parameters that optimize the model evaluation criteria for some fixed model representation
model search: consider alternative model from the same family of model representations

Primary Research and Application Challenges for KDD

Larger Databases
High Dimensionality
Overfitting
Assessment of Statistical Significance
Nonstationarity of Data and Knowledge
Mission an Noisy Data * Complex Relationships Between Fields
Understandability of Patterns
User Interaction and Prior Knowledge
Integration with Other Systems

Role of AI in KDD

Natural Language Processing (NLP)

free-form text mining
interface design
queries
knowledge explanation

Planning

data access
data transformation
constraining satisfaction

Intelligent Agents

data collection
remote operation
parallel execution and communication

Uncertainty in AI

managing uncertainty
inference
reasoning about causality

Knowledge Representation

ontologies
use of prior human knowledge

First Draft, Steven Templeton Oct '96