Describing Data Format Instances using the Bitstream Segment Graph (BSG)
Introduction
Storage and exchange of digital information strongly depends on the data format that defines the syntax and semantics of format-compliant data. Loosing applicable data format knowledge can render digital information inaccessible:
- The current practice of human-readable format specification depends on human labour to both understand the specification and apply it to a specific problem correctly - a significant problem, given the tremendous complexity of some state-of-the-art data formats.
- An application can automate the format-compliant processing of data, but it is typically procedural and tightly bound to a specific technological environment and thus slowly becomes obsolete due to rapid technological change.
If we could delegate the problem of identifying the composition of data according to some arbitrary data format to machines, this would solve aspects of these problems - yet, how to describe a data format in a declarative, machine-processible manner? This requires formal means to describe syntax and semantics of a data format instance (such as a bit sequence) as well as a data format, which is a (potentially infinite) set of such instances.
Contributions
I present two contributions for describing data formats, the Bitstream Segment Graph (BSG) model for describing the composition of data, and the BSG Reasoning approach for describing the composition of (potentially infinite) sets of data:
- The BSG model describes a bijective mapping between a bit sequence and its contained information as primitive elements; a BSG instance is a causality graph.
- The BSG Reasoning approach describes a (potentially infinite) set of BSG instances through logic rules that govern the composition of each element of the set.
Applications
These contributions have applications in different domains such as Digital Preservation and IT Security:
-
Among others, IT Security is also concerned with the security of applications that process data. Errors in the implementation of applications can lead to vulnerabilities that may be exploited and lead to unauthorized access, data theft or even data loss. Documenting the delivery mechanisms of known exploits and the analysis of security-related implications of data format design are two examples where data format-related knowledge comes into play.
-
The domain of Digital Preservation focuses on how to preserve access to digital-born objects for future generations. This includes items such as digital photographs, songs, videos, text documents and more. When knowledge to the data format of such items is solely available as implemented software, then its applicability is constantly threatened by ongoing technological change. Describing data formats in a declarative, standardized manner only requires a format-agnostic software implementation to be maintained rather than a large number of format-specific software implementations.
Related Publications
Dissertation
- Michael Hartle: A Formal, Declarative Approach to Data Format Description. Dissertation at the Telecooperation research group, Department of Computer Science, Technische Universität Darmstadt, 2010.
PDF
Journal Article
- Michael Hartle, Andreas Fuchs, Markus Ständer, Daniel Schumann, Max Mühlhäuser: Data Format Description and its Application in IT Security.
In: International Journal on Advances in Security (IJAS), 2009 Vol. 2 Nr. 1
Conference Papers
- Michael Hartle, Arsene Botchak, Daniel Schumann, Max Mühlhäuser: A Logic-based Approach to the Formal Description of Data Formats.
In: Proceedings of The Fifth International Conference on Preservation of Digital Objects (iPRES), London, United Kingdom. pp. 292-299, The British Library, 2008.
PDF, Bibtex - Michael Hartle, Daniel Schumann, Arsene Botchak, Erik Tews, Max Mühlhäuser: Describing Data Format Exploits Using Bitstream Segment Graphs.
In: Proceedings of the Third International Multi-Conference on Computing in the Global Information Technology (ICCGI), Athens, Greece. pp. 119-124, IEEE Press, New York, NY, 2008.
Note: This publication received a Best Paper Award.
PDF, Bibtex - Michael Hartle, Friedrich-Daniel Möller, Slaven Travar, Benno Kröger, Max Mühlhäuser: Using Bitstream Segment Graphs for Complete Description of Data Format Instances.
Jose Cordeiro, Boris Shishkov, Alpesh Kumar Ranchordas and Markus Helfert (Editors). In: Proceedings of the Third International Conference on Software and Data Technologies (ICSOFT), Porto, Portugal. pp. 198-205, Institute for Systems and Technologies of Information, Control and Communication (ISTICC), 2008, ISBN 978-989-8111-53-1.
PDF, Bibtex