SummaryIn recent years the amount of biological data has exploded to the point where much useful information can only be extracted by complex computational analyses. Such analyses are greatly facilitated by metadata standards, both in terms of the ability to compare data originating from different sources, and in terms of exchanging data in standard forms, e.g. when running processes on a distributed computing infrastructure. However, standards thrive on stability whereas science tends to constantly move, with new methods being developed and old ones modified. Therefore maintaining both metadata standards, and all the code that is required to make them useful, is a non-trivial problem. Memops is a framework that uses an abstract definition of the metadata (described in UML) to generate internal data structures and subroutine libraries for data access (application programming interfaces -APIs -currently in Python, C and Java) and data storage (in XML files or databases). For the individual project these libraries obviate the need for writing code for input parsing, validity checking or output. Memops also ensures that the code is always internally consistent, massively reducing the need for code reorganisation. Across a scientific domain a Memops-supported data model makes it easier to support complex standards that can capture all the data produced in a scientific area, share them among all programs in a complex software pipeline, and carry them forward to deposition in an archive. The principles behind the Memops generation code will be presented, along with example applications in Nuclear Magnetic Resonance (NMR) spectroscopy and structural biology.
IntroductionIn recent times, the combination of digitization, high-throughput approaches and modern computing techniques has revolutionized the relationship between scientists and data in terms of size and access. These advances present great opportunities but also create considerable problems. Most data now exists in electronic form at some point in its life, and it is therefore extremely important that data can be passed seamlessly between the many different programs that might be used to process and analyse it. If all scientific software was always written to some common data standard then this would not be difficult. In practice, however, this is a non-trivial problem. Science is primarily driven by the need to generate results rather than conform to standards, even if such standards existed and could be agreed upon in constantly evolving fields. The need for standards remains, however. As high throughput methodologies have proliferated, and networks have made it increasingly simple to move data to wherever it is needed, there has been increased interest in defining data standards across a large number of fields where there are immense amounts of data that need to be organised and exploited. Recent reviews by Brazma et al.,[1] on data standards and by Swertz and Jansen [2] on software infrastructure give a good account of both current efforts and the underlying...