SSDBM 2010 in Heidelberg, 30th June-2nd July

Scientific and Statistical Database Management Conference has a long tradition of over 20 years. The conference started with an excellent keynote by Daniel Abadi from Yale. He talked about HadoopDB. Hadoop is a data management and query solution for very large data sets based on the Map Reduce paradigm (MR). It is excelent in splitting a query over many processing nodes and then assembling the results. It is free and scalable. However, in comparison with relational databases, it is not as fast as the commercial products in some types of searching. HadoopDB tries to get the best of both worlds, fast storage layer with indexing and compression from relational platforms and parallelisation for free from Hadoop. The future of scientific data processing may well lie in hybrid solutions, such as HadoopDB or openBIS which is being developed by the CISD and SyBIT in Basel. MR is a paradigm that might be adapted to the type of data processing we support in our scientific workflows.

Another important development in scientific databases is the development and management of workflows. Two conference sessions were devoted to this problem, presenting both research issues and existing solutions and innovations. I particulary liked the talk about Taverna given by Paolo Missier who debunked a lot of the myths surrounding Taverna. In particular, Taverna now supports workflow provenance via its ProvenanceDB. The structure of Taverna is similar to the division of roles in the systems we are building – openBIS and iBRAIN2. One database is normally responsible for the management of experimental results data, both raw and derived databasets, and another database manages provenance, i.e. workflow history for a data set.

SysMO2 meeting, near Noordwijkerhout, Netherlands, June 7-9th 2010

As a member of the SysMO2 scientific advisory board, see, I attended the presentations given by the consortia funded by SysMO (Systems Biology of Model Organisms). The aim of SysMO is the development of system models. In most cases the consortia reported on new metabolomics datasets and the resulting models.

From the data management perspective this translates into two main deliverables supported by the SysMODB project First, the data are deposited in spreadsheets based on shared templates designed in collaboration between SysMODB team and each project, and then either submitted to a central repository called SysMO-SEEK or linked to that database. Then, models are deposited in a model database called JWS, This website supports simulation.

Projects interact with SysMODB team via PALs (Project Area Liasons) who attend consortium wide meetings where they jointly discuss their data and processing requirements. Current work focuses on making SysMO-SEEK available as a toolbox for local installation and on extending JWS to cater for a wider variety of models.