Date of Award

Winter 1997

Project Type

Dissertation

Program or Major

Computer Science

Degree Name

Doctor of Philosophy

First Advisor

Directors: R Daniel Bergeron

Abstract

Scientific data, which include multimedia (e.g., images, audio, and video) and non-standard data (e.g., finger prints and DNA sequences), is characterized by rich and complex inter-instance relationships in addition to the inter-entity relationships found in traditional data. Conventional data models are insufficient for modeling such inter-instance relationships. This thesis proposes a metric-based scientific data model from the notions of data-as-functions and pseudo-quasimetrics, which are used to model inter-entity and inter-instance relationships respectively. Compared to other scientific data models, the metric-based conceptual model can be applied to many data sets where geometric views might not otherwise be available.

A detailed approach is outlined for exploring and deriving pseudo-quasimetrics to represent inter-instance relationships in a wide variety of data. In particular, we introduce the notion of observable properties and show how it can be applied with ideas from point set topology to systematically derive metrics from nonmetric data components such as categorical data. We also demonstrate the use of continuity as a mathematically precise tool to validate metrics derived through the proposed approach.

In order to support the metric-based model at the physical level, we developed two simple mechanisms, the multipolar mapping, for transforming a pseudo-metric space into a multidimensional space, and the median transformation, for deriving a pseudo-metric from a pseudo-quasimetric. After application of multipolar mapping and (possibly) median transformation, it is easy to use existing point spatial data structures such as quadtree or octree for metric data storage and access. The results of our performance analysis demonstrate that the multipolar approach is robust and stable over a wide range of data parameters for data sets with intrinsic dimensionality of 10 or less. While it is still unclear whether the multipolar approach offers significant performance advantage on proximity queries for data sets of very high dimensionality, preliminary results for 100 dimensional data still show excellent performance on nearest neighbor queries.

Share

COinS