A Framework for Provenance Management in eScience

Satya S. Sahoo
Assistant Professor, Center for Clinical Investigations
Case Western Reserve University
White 411
11:30am - 12:30pm
October 7, 2010

Provenance metadata, describing the history or lineage of an entity, is essential for ensuring data quality, correctness of process execution, and computing trust values. Existing provenance systems are inadequate to address the requirements of an emerging set of applications in the new eScience or Cyberinfrastructure paradigm and the Semantic Web. Provenance in these applications incorporates complex domain semantics on a large scale with a variety of uses, including accurate interpretation by software agents, trustworthy data integration, reproducibility, attribution for commercial or legal applications, and trust computation. In this talk, we will introduce the notion of “semantic provenance” to address these requirements for eScience and Semantic Web applications.

We will describe a framework for management of semantic provenance by addressing the three issues of, (a) provenance representation, (b) query & analysis, and (c) scalable implementation. First, we will introduce a foundational model of provenance called Provenir to serve as an upper-level reference ontology to facilitate provenance interoperability. Second, we will define a classification scheme for provenance queries based on the query characteristics and use this scheme to define a set of specialized provenance query operators. Third, we will describe the implementation of a highly scalable query engine to support the provenance query operators, which uses a new class of materialized views based on the Provenir ontology, called Materialized Provenance Views (MPV), for query optimization.

We will also define a novel provenance tracking approach called Provenance Context Entity (PaCE) for the Resource Description Framework (RDF) model used in Semantic Web applications. PaCE, defined in terms of the Provenir ontology, is an effective and scalable approach for RDF provenance tracking in comparison to the currently used RDF reification vocabulary.

Satya Sahoo is an assistant professor in the Center for Clinical Investigations at the Case Western Reserve University School of Medicine. He has been collaborating with biomedical and life science researchers for the past six years to create Semantic Web informatics solutions for ontology-driven data integration, services-based scientific workflows, provenance metadata management, and knowledge discovery. His research has led to creation of three publicly available biomedical ontologies listed at the National Center for Biomedical Ontologies; a data exchange standard, called GLYDE, for the glycoproteomics community; and a framework for provenance management with applications in biomedicine, sensor networks, and scientific workflow engines developed by the University of Manchester, UK (Taverna) and Microsoft Research (Trident).