BioIT World 2013

Last week I have attended the BioIT World in Boston http://www.bio-itworldexpo.com/ again, for the second time since we run SyBIT. This is the U.S. main ‘edition’ of the BioIT World conference series, there are spin-offs in Europe and Asia, but they are much less well attended.

There were 12! parallel tracks focusing on topics from IT infrastructure, Bioinformatics through Clinical Omics and Visualization – see the conference site for details. For me, the highlights were definitely the talks by the BioTeam people, Chris Dagdigian and Chris Dawn. Dag has given a bit of insight into their project with Sanofi and Chris Dawn into the setup of the new New York Genomics Center. The key take-home messages they gave were:

  • If you are an infrastructure resource provider and you do not have a cloud strategy yet, it is almost too late – researchers simply make use of their credit cards on amazon behind your back. The pricing needs to be competitive with that.
  • Private clouds make sense in certain environments, especially for flexibility and manageability, but always in conjunction with a local cluster and data infrastructure. Every cluster should have a virtualization component now. This is great when users want to do their own processing close to the data – just give them an IaaS to run their own VM, but with very close ties to the storage and cluster, delivering the necessary punch. And it makes cloudbursting into the public clouds better controllable.
  • Cloud APIs need to be carefully evaluated – beware of vendor lock-in.
  • DevOps is the future – ‘Infrastructure is Code’. IT infrastructure needs to be programmable through APIs otherwise scalability cannot be achieved. This means that the classical sysadmin job disappears and IT managers need to be very good scripters in Chef, Puppet and other tools. A lot of tooling comes out of the cloud communities for this. Multi vendor and hybrid cloud usage will be the rule not the exception.
  • Small local clusters disappear, being superseded by fat nodes with a lot of memory. This makes sense, a fat node has more CPU and memory in one than the whole small cluster put together previously.
  • Storage is still the biggest expense in the life sciences. Consequence: ‘Data flows downhill’, ie to wherever the cheapest storage can be found, and is left in ‘puddles’ there.
  • Storage arrays are getting smarter now, being able to run applications close to the storage, to be watched.
  • Hadoop/HDFS needs a different hardware design! Classical cluster with lightweight local nodes and parallel FS is NOT for Hadoop – it needs relatively large local disks, going back to the pizzabox model and away from blades again. But real Hadoop use cases are rare, do not just do it without a real need!
  • Software defined networking is still ‘absolutely useless hype’ at this stage.

But there have been a lot of other interesting talks, for example the keynote by Aral Butte from Stanford, who has showcased how treatment for various deseases can be found just by mining the known properties of existing drugs, finding new uses for them. He had some spectacular results. What also struck me that he has given several examples of companies in the U.S. providing not only human tissue samples for deseases online (in a shopping cart!), but also analysis and even lab rat experiments on demand. He could basically outsource all clinical trials, testing the impact of the ‘old’ drug for the ‘new’ disease, in record time, working now on FDA approval.

On the various panels there were several interesting people. One worth mentioning was including Andrea Norris, CIO of the NIH, who gave very interesting insights on how the NIH is thinking on data sharing and publications thereof in the near future. They intend to start funding for centers of excellence who are able to sustain data sharing, working on turning big data into knowledge (bd2k), implementing policies given by the funding agencies.

 

Posted in Trip Reports. No Comments »

BioIT World Europe

I have attended the BioIT World Europe conference, in Hannover Germany. It was co-hosted with Biotechnica, one of the biggest biotech trade fairs in Europe. Although the topics were interesting the conference was not very well attended and it was relatively expensive for the content. Also, the speakers were just coming for their talks and disappeared mostly just afterwards, so there was only a short time window to network with them, but because of the low attendance it was not a problem.

In more detail:

  • Andrew Lyall: ELIXIR. The ELIXIR project is a very large endeavor, with the aim to provide comprehensive access to public research data for all of Europe, federating all data providers. In Switzerland the SIB is a member (of course as the producers of Swissprot) and also aims to become one of the major data providers, but i have not seen on Andrew Lyall’s slides that CH would have signed the initial agreements yet. Maybe Ioannis can tell us the status of this later.
  • Etzard Stolte, former Roche now HP ‘CTO Life Science’: He gave a current state of the art in technology talk, and voiced the opinion that in the future we can just ask high level questions of a dataset in natural language and the computer would give a set of logical answers, like the Watson Jeopardy project or Mathematica’s new Alpha platform. I personally think that science questions cannot be properly voiced in natural language and that this will not work for that context. I will not be able to tell it to write my nobel-prize publication for me..
  • Chris Taylor, EBI: He discussed standards in general, there are many Minimal Information metadata standards now, almost too many, so he set up the MIBBI Project to give information on all of these. I asked him whether i should use it already, and he advised to wait a couple more years for this to settle down.
  • HP Cloud: The first commercial sponsored talk was by HP. They offer both private cloud solutions and have their own HP public cloud into which people can scale. The public cloud is still in beta stadium and will be so until next year April at least, the speaker said.
  • Folker Meyer, Argonne. He gave a reasonable overview of bioinformatics using cloud resources. They built their own Argonne Workflow to manage their jobs.
  • DDN: Second vendor presentation, their Web Object Scaler is a very nice technology, if it works it would save us copying stuff back and forth over the WAN
  • Arrysisbio: Vendor presentation of a genomics platform for data analysis, nicely done. The webpage is a good example for eye candy.
  • Carol Goble, UManchester on using clouds for resesarch. They did a simple ‘naive’ job on putting it on the cloud, but they had lots of technology that they just could reuse, like a ‘Smart Data Upload’. The technology is Ruby on Rails for the web interface. They also reuse a self-made orchestrator, that is deciding to ramp up or tear down instances. She says they built it in 4 days. Taverna was put on the cloud in that time, including tests etc. The costs for development was around 600$, one run on this taverna workflow costs around $5. BUT: Their large data was not shipped! A lot of preprocessing was done locally and only compressed and necessary data was shipped. This kept costs down. Pre and postprocessing on the data server locally which then submits jobs to the cloud. The reference dataset (Ensembl) needs to be available on the cloud. It was on the US Amazon space but not the EU space so they had to run jobs there. She did experience failures and performance variations. She also says that you need to scale with many CPUs if you really want to profit from a time boost by using the cloud, ie. getting your job done more quickly. Interesting comment on the side: You cannot just prototype at home using Eucalyptus and expect it to work on Amazon! The mechanisms are very different, even if the public API is identical.
  • Misha Kapushesky, EBI : “Mapping identifyers is bioinformatics’ biggest problem” Expression Atlas: Transcriptomics data /ebi.ac.uk/gxa .. you can use any ID to look things up anyone can use the R Cloud at EBI : http://www.ebi.ac.uk/tools/rcloud They have also the ability to roll in anyone else’s analysis on their internal cloud
  • Ola Spjuth: Uppsala university, UPPNEX – A Solution for Next Generation Sequencing Data Management and Analysis. Community webpages look nice.
  • Reinhard Schneider, Head Bioinformatics Core Facility, Luxembourg Center for Systems Biomedicine, University of Luxembourg. He talks about ‘exaflops’ in biology or what the applications could be in the future. His point is that supercomputers of today are not really suitable for day-to-day analysis of bench data because the bench scientist and bioinformatician has not time to port the codes to the complex supercomputing platforms. High throughput is needed rather than high capacity. He argues that bioinformatics will not be a supercomputing community. He also gave an interesting reference to a high-throughput IO system, ParaMedic. But his points are:
    • ..problems are data driven, high I/O requirements
    • ..problems do not scale beyond few 100 cores
    • ..applications are constantly under development
    • ..codes are scripts more often than not
    • ..bioinformaticians are not software engineers
  • Hermann Lederer: European Exascale software initiative. Lifescience demands on supercomputing, he basically argues against the previous speaker that there are a lot of life science use cases on supercomputers. He shows the results of the DEISA project and further plans with PRACE. Structure prediction mostly, but also tissue simulation etc. The previous speaker pointed out that these are not data-intensive applications and that his point was that for I/O bound processes supercomputers do not provide too much added value.
  • BlueArc commercial presentation. BlueArc makes use of FPGAs to scale their metadata head servers. He makes a point that NFS (NAS) and pNFS (Cluster) usage puts very different strains and requirements on the system and that it is important to balance the setup between these two usages, as the end-user will not care.
  • Daniel McLean, Sainsbury Norwich bioinformatics team. They have a small ’embedded bioinformatician’ team model that works pretty well for them locally. He makes the point that this is an important service to the scientists. He had a nice slide on the required know-how for certain tasks that i think is reasonable:
    • Understanding Biology: Scientist 100%, Bioinformatician 50%
    • Understanding Analysis: Scientist 50%, Bioinformatician 50%
    • Understanding Management tools: Scientist 20%, Bioinformatician 100%
  • His point is to enable the scientist and make their life easy. They mount their local NetApp to the users desktops using SSHFS. People can change things, but there are monitoring scripts that sends people email if they do not adhere to certain rules and conventions, like annotating data in a README file in data directories, whose structure is also predefined. They make heavy use of Galaxy and scripting (ruby, perl, etc). They have built a nice education site and activities around it, training the scientists in the usage of their tools.
  • Daniel James: Fiji and ImageJ-2 project. See http://fiji.sc . ImageJ is very successful, many people use it but it is ill-engineered. ImageJ2 is an effort to re-do it and this time do it cleanly. It is now done, and it is possible to extend it with tools and things, this is the Fiji project. It is a collaborative development approach that works very well. They communicate through IRC, have dedicated mailinglists, a youtube channel, GIT as code repository.. Sharing code is easy, just write a plugin and people will receive it through the autmatic updater. Very nice.
  • Urben Liebel, Karlsruhe Institute of Technology KIT. They built a dedicated local ‘HPC server’ that processes the data output immediately where the instrument is, in this case a microscope. This gives them real-time processing and feedback to the researcher. The analysis engine needs to be sized such that the real-time experience is not lost. Interestingly they have developed some auxiliary services as well:
    • Harvester, this can query online resources through one portal
    • Image finder, this can search for images in publications, instead of text – nice if you are interested in the diagrams in a paper
    • Sciencenet, like a social media science publication network – vote on the papers you read
Posted in Trip Reports. 1 Comment »

International Conference on Systems Biology, Edinburgh, 10-14th Oct 2010

ICSB is a bioinformatics conference with a focus on system models. On Monday I attended a talk on SBML model merging, by Francois Fages, Inria. Fages presented a method for model merging and hierarchy building that is based on constraint programming. The source data were from biomodels.org, and the method used four merging primitives: species deletion, reaction deletion, species merging, and reaction merging. For more information see http://gprolog.inria.fr and http://contraintes.inria.fr. The approach is purely structural and does not work for larger models (due to time out).

I presented a SyBIT poster on Monday. This gives an overview of our work, with stress on workflows  driven by PGRADE and iBRAIN2, and openBIS database. Several UK and German groups expressed interest in the software and its portability.

On Tuesday Tom Freeman from Roslyn Institute spoke on BioLayoutExpress, http://www.biolayout.org. The software supports 3D visualisation and detailed analysis of expression data. It allows one to cluster a dataset or two datasets and examine the results of the clustering. This provides a method for data integration.

On Wednesday Richard Baldock (MRC HGU Edinburgh) gave a very interesting talk on bioatlasses. He gave examples for the mouse brain atlas, where 3D images are annotated manually, with reference to an ontology of anatomical terms. Christian von Mering gave a talk about protein interactions – relationships between protein (MS) measured abundance and expression and protein abundance in various species (stored in pax-db).

Posted in Trip Reports. No Comments »

Broad Institute Visit , 28-29th July 2010, Anne Carpenter’s lab

Pauli Raemoe from Lucas Pekmans’ lab at the ETH and myself visited Anne and her team. Anne’s team is developing Cell Profiler and Cell Profiler Analyst. Those tools are used to process large batches of cell images. Pauli presented iBRAIN and Cell Classifier and I gave a talk on SyBIT, openBIS and iBRAIN, which included demos. We were given demos of Cell Profiler and Cell Profiler Analyst, and discussed the software architecture in detail. The current challenge to Cell Profiler seems to be in cell tracking, for which ImageJ may be used.
We will collaborate with the Broad Institute in developing our tools further, to support data flow between all the relevant applications. The modular nature of the systems and the use of web data protocols make this feasible. There was significant interest in openBIS, and the question to be decided is how much image viewing is supported in Cell Profiler Analyst and how much in openBIS. Similarly, integration with graph drawing software can be done in either Cell Profiler (or Analyst) or in openBIS. Further discussions will decide on how best to integrate the toolsets. We envisage sharing wiki links and lists of to-do items, to maximize benefits to both sides.

Posted in Trip Reports. No Comments »

SSDBM 2010 in Heidelberg, 30th June-2nd July

Scientific and Statistical Database Management Conference has a long tradition of over 20 years. The conference started with an excellent keynote by Daniel Abadi from Yale. He talked about HadoopDB. Hadoop is a data management and query solution for very large data sets based on the Map Reduce paradigm (MR). It is excelent in splitting a query over many processing nodes and then assembling the results. It is free and scalable. However, in comparison with relational databases, it is not as fast as the commercial products in some types of searching. HadoopDB tries to get the best of both worlds, fast storage layer with indexing and compression from relational platforms and parallelisation for free from Hadoop. The future of scientific data processing may well lie in hybrid solutions, such as HadoopDB or openBIS which is being developed by the CISD and SyBIT in Basel. MR is a paradigm that might be adapted to the type of data processing we support in our scientific workflows.

Another important development in scientific databases is the development and management of workflows. Two conference sessions were devoted to this problem, presenting both research issues and existing solutions and innovations. I particulary liked the talk about Taverna given by Paolo Missier who debunked a lot of the myths surrounding Taverna. In particular, Taverna now supports workflow provenance via its ProvenanceDB. The structure of Taverna is similar to the division of roles in the systems we are building – openBIS and iBRAIN2. One database is normally responsible for the management of experimental results data, both raw and derived databasets, and another database manages provenance, i.e. workflow history for a data set.

OME Users Meeting, Paris 15-16 June 2010

I attended the European Open Microscopy Environment (OME) User Meeting in Paris. There were around 50 attendees showcasing their tools, giving demos and gathering new user requirements. OME produce OMERO, a database and visualisation software suite for microscopy, and the new release of this software was presented in detail. OME software is being developed by a team from Dundee, led by Jason Swedlow and funded by the Wellcome Trust. Two main areas of application for OME software are high content sreening and large 3D imaging of organs, tissues and organisms. Current focus is on image storage and processing, but this is slowly shifting towards providing support for workflows which can automate data analysis, and towards results storage in OMERO. More information is available at the OME website.

Posted in Trip Reports. No Comments »

SysMO2 meeting, near Noordwijkerhout, Netherlands, June 7-9th 2010

As a member of the SysMO2 scientific advisory board, see http://www.sysmo.net, I attended the presentations given by the consortia funded by SysMO (Systems Biology of Model Organisms). The aim of SysMO is the development of system models. In most cases the consortia reported on new metabolomics datasets and the resulting models.

From the data management perspective this translates into two main deliverables supported by the SysMODB project http://www.sysmo-db.org. First, the data are deposited in spreadsheets based on shared templates designed in collaboration between SysMODB team and each project, and then either submitted to a central repository called SysMO-SEEK or linked to that database. Then, models are deposited in a model database called JWS, http://jjj.biochem.sun.ac.za/database/index.html. This website supports simulation.

Projects interact with SysMODB team via PALs (Project Area Liasons) who attend consortium wide meetings where they jointly discuss their data and processing requirements. Current work focuses on making SysMO-SEEK available as a toolbox for local installation and on extending JWS to cater for a wider variety of models.

EDBT/ICDT 2010

I attended those two conferences in Lausanne, 22-26 March 2010. ICDT is a database theory conference while EDBT focuses on practical database topics, with less theory and more papers involving the benchmarking of new algorithms and solutions which at some later stage get incorporated into software products. The conference featured keynotes, regular papers and demos, and several workshops. The most interesting keynotes in my opinion were by Ian Horrocks on scalable semantic systems – those systems are not currently scalable at all, and by Pierre Fraigniaud on the importance of succinct labeling in XML data storage in databases. The demo session featured a B-Fabric demo by Fuat Akal et al., B-Fabric: The Swiss Army Knife for Life Sciences. This was a very comprehensive presentation of the rich set of databases supporting biological data analysis at the FGCZ in Zurich. I also attended some talks on the efficient querying of workflows, data provenance and data integration. New trends in databases include also the use of new architectures. I found a tutorial by the ETH Systems group to be of particular interest – on the use of FPGAs in databases, and a paper on the use of multicores in suffix tree construction. The PhD workshop had an interesting talk on web page archiving (by French Television/Radio) and also a talk on research funding by Moira Norrie of the ETH. Via informal discussions I also learned about some interesting databases and tools which could be used for prototyping in SystemsX.ch work, in particular I liked the BaseX from Konstanz.

Posted in Trip Reports. No Comments »

SyBIT Retreat

We had a very productive retreat in the first week of February in Rigi-Kaltbad. It was productive on all levels – we got to know each other, each other’s work but we also discussed basics like the mission of SyBIT and next steps to be taken. The results of those discussions are already reflected in the public wiki pages of SyBIT.

We will organize more focussed and technical meetings regularly for all SyBIT members to assure that we keep track of each other’s work.

The full program and information is only available on the internal wiki pages.

Posted in Trip Reports. No Comments »

Alliance for Permanent Access Conference

Today i am in the annual conf of the Alliance of Permanent Access. I hope they will make the slides available as some are reflecting my own findings very well.

In summary, everyone is having the same issues and we in CH are not less advanced than anyone else. Quite the opposite, SyBIT is a unique project in many respects in that we do work on a very specific solution for a very specific community where the take-up of what we do is there by construction.

People here struggle with

  • Policies coming from authorities to enforce preservation and data reusability
  • Funding for such a new infrastructure seems impossible to get
  • Change culture of publishing ie. good data is also worth publishing
  • Tools are there but almost all are proof-of-concepts and unmaintained after project finished
  • What does it mean to hand over data to a data archive or library? Does it make sense?
  • There are a lot of ‘recommendations’ but actually nobody DOES much

At the same time people realize that NOT addressing data preservation issues will have very serious repercussions as well. As usual, i have again learned a lot about what is going on in Switzerland by talking to our European friends.

Links

Swiss Links I was given:

Projects:

Talk Highlights

David Giaretta – Parse.Insight plenary says the focus should be on services. Infrastructure is essential. Chokepoints need to be identified early and addressed since they prevent progress.

Salvatore Mele – CERN plenary says that NOT doing data curation, archiving, preservation would be more expensive than building the LHC. First, the data can be reused many times. And, Scientists asked know that the data will be reused for things that the constructors of the system did not think of, and which makes the data of larger! value than the LHC. But he also says that researchers will not deal with data curation if they are not made to by the funding agencies. Incentives are missing.

Veronica Guidetti – ESA plenary also shows the need of her community for data preservation and access to historic data. She stresses the need for end-to-end solutions as users don’t care for details. Support and education for the tools is mandatory for them to be useful. She also identifies and explicitly spells out many barriers for take-up. (see slides)

Andreas Rauber – Univ. of Vienna digital preservation, has given an overview on tools for the archives. There are many such tools, most are just proof-of-concepts. Tools are not sustainable, many prototypes are built and not followed up, research stops at POC, the rest is ‘engineering’ and not publishable. Everyone does a metdata proof-of-concept for common image file formats (exif readers) and all stop there. Training needed to involve more people, awareness needs to be raised. But also experts have to be trained for challenges in their respective domains. Expertise on internals of preservation systems and methods has to be developed everywhere. He recommends actually what SyBIT tries to do: Identify existing tools, take over support, develop the specific additional needs. Only then can ‘toy examples’ be put out to the community and will be actually used and trusted. But that is not enough: He says we also need to research tomorrow’s problems and prepare for it. Current solutions will not last long. The domains should focus on their core competences and not do generic do-it-alls.

Chris Rubridge – DCC, Blue Ribbon Task Force. Lot of interesting points on his slides. He also highlights that the value of data is different to everyone (example by Brian Laurence / brian’s fishy example). Currently data is used only within a community, ie by the same people who deposit it. For research data IPR is not an issue (at least for the communities they deal with so far).

Barbara Sierman – Koninglijke Bibliothek Manager for Digital Preservation. She talks about the policies needed for preserve data. How to measure (audit) and certify preservation entities? She says that audit is needed for the organization, the digital object management, and the repository itself. There are many projects discussing how to set up digital repositories, what policies are needed, how it should operate, etc. Drambora: this is a page helping with such audits. She also elaborates about the importance to establish trust among all stakeholders. Those are the digital repository users, funders and the organization running it. Many projects claim to establish ‘best practices’ but what are those really? Based on what authority? This also needs to be put into policies according to her (if i understood correctly). Standards: Common file formats may be easily addressed, but obscure formats need to be addressed by the organizations needing them.

Discussion: R&D and technical tools

  • Common formats for future data
  • Data management is important, to preserve data, education of young researchers still possible
  • Weakness of library tools is their focus on published data. Datasets do not have a publishing model.
  • Setting up templates to publish is important
  • Referring into a point in a dataset needs to be done – as a structure
  • Thin metadata layer that is necessary to publish data should be given, but every domain should be allowed to specify their own data publishing schema
  • Standards will only work up to a point
  • Flexibility of tools needed, people should be able to come up with tools
  • Library needs to come up with ‘container standards’ just like AVI for video. Inside there can be anything. However, only living standards can converge, if it is just a standard to archive, that is not good. So libraries need to update constantly.
  • Longevity: there needs to be a migration to professionally maintained services.
Posted in Trip Reports. No Comments »