The vision of the Semantic Web of the future is that it will be the medium for the exchange, sharing, and retrieval of information in a meaningful and effective way. However, before the vision of the Semantic Web is realized, we have to deal with large volumes of "legacy" documents that are still marked up in HTML.
Transforming all the HTML documents on the Web into a representation carrying more semantic information is an intractable task. Therefore, we focus our attention to topic-specific HTML documents, i.e. documents pertaining to a specific topic, authored by different people from the Web. These documents follow certain typical style, which makes extraction of structures feasible. An example would be resumes gathered by some focussed crawler such as RCentral from IBM's Grand Central Station.
The goal of Quixote is to facilitate the processing of topic-specific HTML documents through the data exchange standard eXtensible Markup Language (XML) which provides means to enrich documents by more structural and semantic information.
We address the problems of (1) converting HTML documents into XML documents marked up in topic-specific tags carrying semantic information about the documents (2) discovering a schema that describes the common underlying structures of the documents (3) integrating the documents into a unifying view based on the schema.
As to contributions to the user community, Quixote is an automated topic-specific HTML documents mining and integration tool based on a coherent and integrated framework. As to contributions to the research community, Quixote proposes several novel approaches.
With respect to converting HTML documents to XML documents, Quixote is fully automatic and is not tied to any specific data source. It employes a broad spectrum of general heuristics based on HTML markup tags, format clues and tree structures of HTML documents, and semantics of the XML tags. It uses a simple, extensible and explicit constraints mechanism to incorporate domain knowledge in the extraction process.
With respect to schema discovery, Quixote proposes the notion of majority schema, approximate schema of low coverage, to describe structures shared by the 'majority' of the documents. Existing schema discovery approaches either infer exact schemas (that describe structures found in any of the documents) or approximate schemas of low relevance (that cover structures not found in any of the documents). Documents gathered over the Web are too heterogeneous for either type of schema, which may lead to an exact schema too large in size or an approximate schema with too little information.
With respect to integration of documents, Quixote presents an approach that automates the integration process by making use of domain knowledge on topic specific documents. Existing schema integration approaches developed for relational and object-oriented data are not directly applicable to XML data.
Quixote adapts integration techniques of these approaches to XML data. It also addresses the unique challenge of preserving semantics of the documents in the integration process since a majority schema does not cover all structures found in the documents.
Dissertation (.ps.gz) |
Dissertation (.zip) |
Acknowledgements
Christina Yip Chung, Neel Sundaresan, "Method and System for Discovering a Majority Schema in Semi-Structured Data". IBM Patent Application ARC9-2000-0117-US1, Filed
Christina Yip Chung, Neel Sundaresan, "System and Method for Discovering Schematic Structure in Hypertext Documents". IBM Patent Application AM9-99-0173, Filed
Despite the necessity of misuse detection in DBS, means to guard the information stored in the database system against misuse are seldom used by security officers because security policies of the organization are either imprecise or not known at all. With the increasing complexity and dynamics of information systems, security policies may be disabled because of changing requirements. We propose a misuse detection system called DEMIDS which is tailored to relational database systems. DEMIDS uses audit logs to discover profiles that describe typical behavior of users working with the DBS. The profiles computed can serve as a valuable tool for security re-engineering of an organization by helping the security officers to define/refine security policies and to verify existing security policies, if there are any.
Essential to the presented approach is that the access patterns of users typically form some working scopes, captured as interesting itemsets, which comprise sets of feature/value pairs that describe user's behavior. DEMIDS considers domain knowledge about the data structures and semantics encoded in a given database schema through the notion of an interestingness measure. Interestingness measure is used to guide the search for interesting itemsets. Unlike existing data mining approaches, we tightly integrate the computation of interesting itemsets with the database system's data management and query processing features.
Components of the interestingness measure include the semantic 'closeness' of features, the amount of information conveyed by a set of features, the number of records in the audit log supporting the profile. Existing data mining techniques often discover many rules that are too fine-grained, which makes it difficult for an administrator to implement appropriate security enforcing mechanisms. We propose a noval framework of concept hierarchies to capture the idea that concepts may be organized into different layers of abstraction. By incorporating concept hierarchies in the interestingness measure, DEMIDS is capable of computing concise profiles at the right level of granularity and detail.
Christina Yip Chung, Michael Gertz, Karl Levitt,
"Discovery of Multi-Level Security Policies" .
The Fourteenth Annual IFIP WG 11.3 Working Conference on Database Security, 2000.
Christina Yip Chung, Michael Gertz, Karl
Levitt,
" DEMIDS: A Misuse Detection System
for Database Systems ",
The Third Annual IFIP TC-11 WG 11.5 Working Conference on Integrity
and Internal Control in Information Systems, 1999
Christina Yip Chung, Michael Gertz, Karl
Levitt,
"Misuse Detection in Database Systems Through User
Profiling",
The Second
International Workshop on the Recent Advances in Intrusion Detection,
1999
A Survey On Misuse
Detection Systems