KDD - Knowledge Discovery in Databases
This is a summary of three papers by Fayyad, Piatetsky-Shapiro,Smyth.
From Data Mining to Knowledge Discovery in Databases; AI Magazine , Fall
96, pp37-54 From Data Mining to Knowledge Discovery: an Overview, Advances
in Knowldege Discovery and Data Mining, pp 1-30 Knowldege Discovery and
Data Mining: Towards a Unifying Framework; Proceedings KDD-96, pp 92-88
I have intentinally not gone into detail about the data mining step.
This is the subject of a seperate related summary.
What is KDD?
- computational theories and tools to assist humans in extracting useful
information (ie knowledge) from digital data
- development of methods and techniques for making sense of data
- maps low-level data into other forms that are:
- - more compact (ie short reports)
- - more abstract (ie model of process generating the data)
- - more useful (ie a predictive model of future cases)
- core of KDD process employs "data-mining"
Why KDD?
- the size of datasets are growing extremely large - billions of records
- hundreds to thousands of fields
- analysis of data must be automated
- computers enable us to generate amounts of data too large for humans
to digest, thus we should use computers to discover meaningful patterns
and structures from the data.
Current KDD Applications
A few examples from many fields
- Science - SKYCAT: used to aid astronomers by classifying faint sky
objects
- Marketing - AMEX: used customer group identification and forecasting.
Claims 10%-15% increase in card usage.
- Investment - Many use. Few tell. - LBS Capitol Management: uses and
expert system/neural network to manage $600 million portfolio. Results
outperform market.
- Fraud Detection - HNC Falcon, Nestor Prism: credit card fraud detection
- FAIS: US Treasury money-laundering detection system
- Manufacturing - CASSIOPEE: a trouble-shooting system used in Europe
to diagnose 737 problems by deriving families of faults by clustering
- Telecommunications - TASA (Telecommunications Alarm-Sequence Analyzer):
locates patterns of frequently occurring alarm episodes and represents
the patterns as rules
- Data Cleaning - MERGE-PURGE: used by Washington State to locate and
remove duplicate welfare claims
- Sports - ADVANCED SCOUT: helps NBA coaches analyse data to organize
and interpret game data ==> player selection and team management
- Information Retrieval - Intelligent Agents have been designed to navigate
the internet and return information pertinent to some non-trivial query
KDD vs Data-Mining They are not the same thing
- KDD: The nontrivial process of identifying valid, novel, potentially
useful and ultimately understandable patterns in data [Fayyad, et al]
KDD is the overall process of discovering useful knowledge from data.
Data mining: An application of specific algorithms for extracting
patterns from data.
Data mining is a step in the KDD process.
Interdisciplinary Nature of KDD
- Merges machine learning, pattern recognition, statistics, database,
high performance computing with unified goal of extracting high-level knowledge
from low-level data.in the context of large datasets.
- Differs from much of ML, etc. in that it places special emphasis on
finding understandable patterns that can be interpreted as useful or interesting
knowledge.
- Fundamentally a statistical endeavor. Statistics provide a language
and framework for quantifying the uncertainty that results when one tries
to infer general patterns from a particular sample of an overall population.
- Because data-mining algorithms typically assume data are in main memory,
KDD relies on database techniques for gaining efficient data access to
large datasets.
- A set of principles from the database field for dealing with large
datasets is OLAP, Online Analytical Processing. OLAP tools focus on simplifying
and suporting interactive data analysis; the goal of KDD tools is to automate
as much of the process as possible.
KDD focuses on the overall process including how data is stored and
accessed, how algorithms can be scaled to massive datasets, interpretation
and visualization of results and how the overall man-machine interaction
can be usefully modeled and supported.
KDD Process
- Develop an understanding for the application domain and identify
the goal.
- Create a target dataset
- selecting a dataset or focusing on a subset of samples or variables
on which to make discoveries
- Data cleaning and preprocessing
- removal of noise and outliers
- collecting necessary information to model or account for noise
- handling of missing data
- accounting for time sequence info
- Data reduction and projection
- finding useful features to represent the data relative to the goal
- dimensionality reduction/transformation ==> reduce number of
variables
- identification of invariant representations
- Selection of appropriate data-mining task
- summarization, classification, regression, clustering, etc.
- Selection of data-mining algorithm(s)
- methods to search for patterns
- decision of which models and parameters may be appropriate
- match method to goal of KDD process
- Data-mining
- searching for patterns of interest in one or more representational
forms
- Interpretation and visualization
- interpretation of mined patterns
- visualization of extracted patterns and models
- visualization of the data given the extracted models
- Consolidating discovered knowledge
- incorporating the discovered knowledge into another system
- documenting and reporting knowledge to interested parties
- checking for inconsistencies with other prior extracted or believed
knowledge
( Iterate the above steps as needed )
Data-Mining Step
- Goals defined by intended uses of the system
- Verification: verify users hypotheses about the data
- Discovery: autonomously finds new patterns in the data discovered patterns
are used to
- prediction: make predictions about future events
- description: presentation information to the user in a human understandable
form.
- Common data mining tasks:
- classification
- clustering
- probability density estimation
- regression
- summarization
- dependency modeling
- change and deviation detection
- Data mining involves fitting models to or determining patterns from
observed data.
Two classes of model are available
- logical models: purely deterministic
- statistical models: non-deterministic, most widely used due to uncertainty
in real world data
- Data mining algorithms have 3 primary components
- model representation
- model evaluation criteria
- search
- Common model representations
- polynomials
- splines
- kernel and basis functions
- threshold-Boolean functions
- Model evaluation criteria (fit functions)
- prediction tasks: empirical predictive accuracy
- description tasks: predictive accuracy, novelty, utility, understandability
- Search methods -- an optimization task
- parameter search: find parameters that optimize the model evaluation
criteria for some fixed model representation
- model search: consider alternative model from the same family of model
representations
Primary Research and Application Challenges for KDD
- Larger Databases
- High Dimensionality
- Overfitting
- Assessment of Statistical Significance
- Nonstationarity of Data and Knowledge
- Mission an Noisy Data * Complex Relationships Between Fields
- Understandability of Patterns
- User Interaction and Prior Knowledge
- Integration with Other Systems
Role of AI in KDD
- Natural Language Processing (NLP)
- free-form text mining
- interface design
- queries
- knowledge explanation
- Planning
- data access
- data transformation
- constraining satisfaction
- Intelligent Agents
- data collection
- remote operation
- parallel execution and communication
- Uncertainty in AI
- managing uncertainty
- inference
- reasoning about causality
- Knowledge Representation
- ontologies
- use of prior human knowledge
First Draft, Steven Templeton Oct '96