BioIT World 2013

Last week I have attended the BioIT World in Boston http://www.bio-itworldexpo.com/ again, for the second time since we run SyBIT. This is the U.S. main ‘edition’ of the BioIT World conference series, there are spin-offs in Europe and Asia, but they are much less well attended.

There were 12! parallel tracks focusing on topics from IT infrastructure, Bioinformatics through Clinical Omics and Visualization – see the conference site for details. For me, the highlights were definitely the talks by the BioTeam people, Chris Dagdigian and Chris Dawn. Dag has given a bit of insight into their project with Sanofi and Chris Dawn into the setup of the new New York Genomics Center. The key take-home messages they gave were:

  • If you are an infrastructure resource provider and you do not have a cloud strategy yet, it is almost too late – researchers simply make use of their credit cards on amazon behind your back. The pricing needs to be competitive with that.
  • Private clouds make sense in certain environments, especially for flexibility and manageability, but always in conjunction with a local cluster and data infrastructure. Every cluster should have a virtualization component now. This is great when users want to do their own processing close to the data – just give them an IaaS to run their own VM, but with very close ties to the storage and cluster, delivering the necessary punch. And it makes cloudbursting into the public clouds better controllable.
  • Cloud APIs need to be carefully evaluated – beware of vendor lock-in.
  • DevOps is the future – ‘Infrastructure is Code’. IT infrastructure needs to be programmable through APIs otherwise scalability cannot be achieved. This means that the classical sysadmin job disappears and IT managers need to be very good scripters in Chef, Puppet and other tools. A lot of tooling comes out of the cloud communities for this. Multi vendor and hybrid cloud usage will be the rule not the exception.
  • Small local clusters disappear, being superseded by fat nodes with a lot of memory. This makes sense, a fat node has more CPU and memory in one than the whole small cluster put together previously.
  • Storage is still the biggest expense in the life sciences. Consequence: ‘Data flows downhill’, ie to wherever the cheapest storage can be found, and is left in ‘puddles’ there.
  • Storage arrays are getting smarter now, being able to run applications close to the storage, to be watched.
  • Hadoop/HDFS needs a different hardware design! Classical cluster with lightweight local nodes and parallel FS is NOT for Hadoop – it needs relatively large local disks, going back to the pizzabox model and away from blades again. But real Hadoop use cases are rare, do not just do it without a real need!
  • Software defined networking is still ‘absolutely useless hype’ at this stage.

But there have been a lot of other interesting talks, for example the keynote by Aral Butte from Stanford, who has showcased how treatment for various deseases can be found just by mining the known properties of existing drugs, finding new uses for them. He had some spectacular results. What also struck me that he has given several examples of companies in the U.S. providing not only human tissue samples for deseases online (in a shopping cart!), but also analysis and even lab rat experiments on demand. He could basically outsource all clinical trials, testing the impact of the ‘old’ drug for the ‘new’ disease, in record time, working now on FDA approval.

On the various panels there were several interesting people. One worth mentioning was including Andrea Norris, CIO of the NIH, who gave very interesting insights on how the NIH is thinking on data sharing and publications thereof in the near future. They intend to start funding for centers of excellence who are able to sustain data sharing, working on turning big data into knowledge (bd2k), implementing policies given by the funding agencies.

 

Posted in Trip Reports. No Comments »

Cloud Proof of Concept Project

SyBIT is engaged in a proof-of-concept project in academic cloud computing together with the ETH, UZH and SWITCH. We provide several of the use cases to the project. The motivation behind it has to do with the sustainability effort of SyBIT – all the services and tools that we provide for SystemsX.ch could be stored in an academic cloud for further use beyond SystemsX in the future. And also in the coming years we would greatly profit of an academic cloud infrastructure for servers and collaboration tools for our projects.

The project is called Academic Cloud Computing and Provisioning and is funded by a SWITCH-AAA grant. There are three testbeds that are set up as part of the project: A commercial ‘private cloud appliance’ from HP called ‘HP CloudSystem Matrix’ – this system is already operational at the ETH; a self-built private cloud also at the ETH and a self-built private cloud at the UZH. In addition, also SWITCH is building their own cloud testbed. All of these are built such that we can run complementary experiments on them both for the infrastructure and application components.

The choice between the many Cloud Stacks is not easy, but after a long and detailed evaluation it was decided to go with OpenStack for all testbeds. We have also hooked up with the Zurich University of Applied Sciences’ Cloud team at ICCLab.

Posted in Infrastructure. No Comments »

Tech Day on Proteomics

The SyBIT Tech Day on Proteomics was held yesterday in Bern. I’d like to thank everyone who contributed and attended, especially our friends from Geneva who had the longest trip!

We had a very broad spectrum of topics that were covered. It was especially interesting to hear about the latest status of the standardization efforts at Swissprot and SIB. We had also a good share of technical talks and more detailed proteomics bioinformatics related presentations.

Posted in Updates. No Comments »

SystemsX.ch Autumn School

The SystemsX.ch autumn school will be organized by us this year, we will focus on HCS and Proteomics data analysis. We have also invited KNIME to teach their tool to the students that is potentially a very useful integrator for any of our tools and lowers the barrier of their usage by non-expert bioinformaticians.

For details see http://de.amiando.com/systemsx-autumnschool2012

Posted in Updates. No Comments »

IEEE eScience Workshop Stockholm

I have attended this week the annual eScience workshop, this time it was in Stockholm. The Program was shared with the Microsoft Research eScience workshop. There have also been several side-workshops. I attended one on computing advances in live sciences.

The idea was to get a current snapshot of the international landscape in eScience, which is really what SyBIT is also doing. There were a lot of reference to Jim Gray, who advocated the ’embedded’ model very early on. Many people referred to his slides,  which he has already shown at the ETH in 2006 when i invited him to give a talk. The ideas and problems he referred to are still the same, but not much has changed in the past 5 years. But at least now the whole community repeats his statements, i remember at the time people were much more skeptical. Still there are voices who say that all this support given to other sciences is not what computer scientists should do (since this is not computer science anymore). True, but i personally view computer science like mathematics – super useful if applicable, but most of the research done is only for the pure beauty of it and is restricted to the ivory tower. Luckily most want to do things that are applicable. But real examples of working with the researchers are still rare, everyone was quoting the SDSS work of Alex Szalay and Jim Gray, which i had the luck of being part of as well – but this was 1999-2001, then there was refinement until 2004 and some crowdsourcing (galaxy zoo). So really not much happened in the past 5 years in terms of collaborations. The UK eScience program was of course quite successful in the early 2000s and Tony Hey gave a nice talk on that.

It was however a bit disheartening that many of the presented papers in the research tracks of the conference were just about yet another workflow, yet another automation, yet another data mining effort. When asked why they did not use existing tools, the answer was always ‘interesting, i did not know about that’. Which means to me that at least the review process is broken..

Still i have collected a few pointers that can be useful in sustainability discussions when talking about SyBIT’s future beyond SystemsX.ch (in addition to Jim’s slides).

There were also interesting discussions about clouds, grids and e-Infrastructures in general. Basically what we try with SyBIT is very much in line what people are thinking to do, so we are already ahead of the crowd. But most everyone is hitting the same problem concerning sustained funding, like the EGI. There is money to develop and build something but there are no business models available to fund sustained support and operations of the developed tools. My personal conclusion is that it is a misguided effort to try to unify all domains (well maybe some, see the MIT Convergence text) on the e-Infrastructure level. It is much more efficient to let the scientific domains to sort out their problems individually. And then if they want to converge further, let them sort it out. Imposing it from the outside is not working. So we need a Life Science specific effort to support data intensive life science research.

Posted in Updates. No Comments »

Retreat with IMSB

This week we had a small focused retreat or ‘code camp’ with Lars Malmström’s group who is working very tightly with SyBIT on providing the tools and services for proteomics analysis in PhosphoNetX in the Aebersold lab.

We have been 11 people, 5 from Lars’ group and 5 from SyBIT and myself. The idea was to get everyone up to speed with the status of all available components of the toolbox at IMSB and to write usable workflows that can be put to use immediately.

We have been able to go through all components of the toolbox and we have documented a ‘test’ workflow which is a simple genomics search using bowtie. We identified the weak points of the integration of the components, improving them on the spot. Work on other workflows has also started and will progress now back in the office.

The retreat has surpassed my expectations on what can be achieved – one can make a lot of progress in such a way – but we also were at it from morning till late. Of course we also made a couple of excursions to the beautiful landscape of the Vierwaldstätter See.

Posted in Updates. No Comments »

SCI-BUS Project Kickoff

SCI-BUS is an EU FP7 framework project that started Oct. 1st 2011. The ETH is a partner and has signed up to further develop and enhance the proteomics portal solution that is already in place. Since this is work we need to do anyway in SyBIT this is a perfect match, we get an extra person from the EU to perform a technology upgrade to our installation. Most of the work we will do in the first 6-8 months, then it is about extending and operating the portal, and to help others to set it up for their own purpose. The project has 11 different gateways that are to be produced, ours is just one of these 11. The others come from different scientific domains – from physics to citizen web gateways.

See http://www.sci-bus.eu/ for more information on the project.

Posted in Updates. No Comments »

BioIT World Europe

I have attended the BioIT World Europe conference, in Hannover Germany. It was co-hosted with Biotechnica, one of the biggest biotech trade fairs in Europe. Although the topics were interesting the conference was not very well attended and it was relatively expensive for the content. Also, the speakers were just coming for their talks and disappeared mostly just afterwards, so there was only a short time window to network with them, but because of the low attendance it was not a problem.

In more detail:

  • Andrew Lyall: ELIXIR. The ELIXIR project is a very large endeavor, with the aim to provide comprehensive access to public research data for all of Europe, federating all data providers. In Switzerland the SIB is a member (of course as the producers of Swissprot) and also aims to become one of the major data providers, but i have not seen on Andrew Lyall’s slides that CH would have signed the initial agreements yet. Maybe Ioannis can tell us the status of this later.
  • Etzard Stolte, former Roche now HP ‘CTO Life Science’: He gave a current state of the art in technology talk, and voiced the opinion that in the future we can just ask high level questions of a dataset in natural language and the computer would give a set of logical answers, like the Watson Jeopardy project or Mathematica’s new Alpha platform. I personally think that science questions cannot be properly voiced in natural language and that this will not work for that context. I will not be able to tell it to write my nobel-prize publication for me..
  • Chris Taylor, EBI: He discussed standards in general, there are many Minimal Information metadata standards now, almost too many, so he set up the MIBBI Project to give information on all of these. I asked him whether i should use it already, and he advised to wait a couple more years for this to settle down.
  • HP Cloud: The first commercial sponsored talk was by HP. They offer both private cloud solutions and have their own HP public cloud into which people can scale. The public cloud is still in beta stadium and will be so until next year April at least, the speaker said.
  • Folker Meyer, Argonne. He gave a reasonable overview of bioinformatics using cloud resources. They built their own Argonne Workflow to manage their jobs.
  • DDN: Second vendor presentation, their Web Object Scaler is a very nice technology, if it works it would save us copying stuff back and forth over the WAN
  • Arrysisbio: Vendor presentation of a genomics platform for data analysis, nicely done. The webpage is a good example for eye candy.
  • Carol Goble, UManchester on using clouds for resesarch. They did a simple ‘naive’ job on putting it on the cloud, but they had lots of technology that they just could reuse, like a ‘Smart Data Upload’. The technology is Ruby on Rails for the web interface. They also reuse a self-made orchestrator, that is deciding to ramp up or tear down instances. She says they built it in 4 days. Taverna was put on the cloud in that time, including tests etc. The costs for development was around 600$, one run on this taverna workflow costs around $5. BUT: Their large data was not shipped! A lot of preprocessing was done locally and only compressed and necessary data was shipped. This kept costs down. Pre and postprocessing on the data server locally which then submits jobs to the cloud. The reference dataset (Ensembl) needs to be available on the cloud. It was on the US Amazon space but not the EU space so they had to run jobs there. She did experience failures and performance variations. She also says that you need to scale with many CPUs if you really want to profit from a time boost by using the cloud, ie. getting your job done more quickly. Interesting comment on the side: You cannot just prototype at home using Eucalyptus and expect it to work on Amazon! The mechanisms are very different, even if the public API is identical.
  • Misha Kapushesky, EBI : “Mapping identifyers is bioinformatics’ biggest problem” Expression Atlas: Transcriptomics data /ebi.ac.uk/gxa .. you can use any ID to look things up anyone can use the R Cloud at EBI : http://www.ebi.ac.uk/tools/rcloud They have also the ability to roll in anyone else’s analysis on their internal cloud
  • Ola Spjuth: Uppsala university, UPPNEX – A Solution for Next Generation Sequencing Data Management and Analysis. Community webpages look nice.
  • Reinhard Schneider, Head Bioinformatics Core Facility, Luxembourg Center for Systems Biomedicine, University of Luxembourg. He talks about ‘exaflops’ in biology or what the applications could be in the future. His point is that supercomputers of today are not really suitable for day-to-day analysis of bench data because the bench scientist and bioinformatician has not time to port the codes to the complex supercomputing platforms. High throughput is needed rather than high capacity. He argues that bioinformatics will not be a supercomputing community. He also gave an interesting reference to a high-throughput IO system, ParaMedic. But his points are:
    • ..problems are data driven, high I/O requirements
    • ..problems do not scale beyond few 100 cores
    • ..applications are constantly under development
    • ..codes are scripts more often than not
    • ..bioinformaticians are not software engineers
  • Hermann Lederer: European Exascale software initiative. Lifescience demands on supercomputing, he basically argues against the previous speaker that there are a lot of life science use cases on supercomputers. He shows the results of the DEISA project and further plans with PRACE. Structure prediction mostly, but also tissue simulation etc. The previous speaker pointed out that these are not data-intensive applications and that his point was that for I/O bound processes supercomputers do not provide too much added value.
  • BlueArc commercial presentation. BlueArc makes use of FPGAs to scale their metadata head servers. He makes a point that NFS (NAS) and pNFS (Cluster) usage puts very different strains and requirements on the system and that it is important to balance the setup between these two usages, as the end-user will not care.
  • Daniel McLean, Sainsbury Norwich bioinformatics team. They have a small ’embedded bioinformatician’ team model that works pretty well for them locally. He makes the point that this is an important service to the scientists. He had a nice slide on the required know-how for certain tasks that i think is reasonable:
    • Understanding Biology: Scientist 100%, Bioinformatician 50%
    • Understanding Analysis: Scientist 50%, Bioinformatician 50%
    • Understanding Management tools: Scientist 20%, Bioinformatician 100%
  • His point is to enable the scientist and make their life easy. They mount their local NetApp to the users desktops using SSHFS. People can change things, but there are monitoring scripts that sends people email if they do not adhere to certain rules and conventions, like annotating data in a README file in data directories, whose structure is also predefined. They make heavy use of Galaxy and scripting (ruby, perl, etc). They have built a nice education site and activities around it, training the scientists in the usage of their tools.
  • Daniel James: Fiji and ImageJ-2 project. See http://fiji.sc . ImageJ is very successful, many people use it but it is ill-engineered. ImageJ2 is an effort to re-do it and this time do it cleanly. It is now done, and it is possible to extend it with tools and things, this is the Fiji project. It is a collaborative development approach that works very well. They communicate through IRC, have dedicated mailinglists, a youtube channel, GIT as code repository.. Sharing code is easy, just write a plugin and people will receive it through the autmatic updater. Very nice.
  • Urben Liebel, Karlsruhe Institute of Technology KIT. They built a dedicated local ‘HPC server’ that processes the data output immediately where the instrument is, in this case a microscope. This gives them real-time processing and feedback to the researcher. The analysis engine needs to be sized such that the real-time experience is not lost. Interestingly they have developed some auxiliary services as well:
    • Harvester, this can query online resources through one portal
    • Image finder, this can search for images in publications, instead of text – nice if you are interested in the diagrams in a paper
    • Sciencenet, like a social media science publication network – vote on the papers you read
Posted in Trip Reports. 1 Comment »

International Conference on Systems Biology, Edinburgh, 10-14th Oct 2010

ICSB is a bioinformatics conference with a focus on system models. On Monday I attended a talk on SBML model merging, by Francois Fages, Inria. Fages presented a method for model merging and hierarchy building that is based on constraint programming. The source data were from biomodels.org, and the method used four merging primitives: species deletion, reaction deletion, species merging, and reaction merging. For more information see http://gprolog.inria.fr and http://contraintes.inria.fr. The approach is purely structural and does not work for larger models (due to time out).

I presented a SyBIT poster on Monday. This gives an overview of our work, with stress on workflows  driven by PGRADE and iBRAIN2, and openBIS database. Several UK and German groups expressed interest in the software and its portability.

On Tuesday Tom Freeman from Roslyn Institute spoke on BioLayoutExpress, http://www.biolayout.org. The software supports 3D visualisation and detailed analysis of expression data. It allows one to cluster a dataset or two datasets and examine the results of the clustering. This provides a method for data integration.

On Wednesday Richard Baldock (MRC HGU Edinburgh) gave a very interesting talk on bioatlasses. He gave examples for the mouse brain atlas, where 3D images are annotated manually, with reference to an ontology of anatomical terms. Christian von Mering gave a talk about protein interactions – relationships between protein (MS) measured abundance and expression and protein abundance in various species (stored in pax-db).

Posted in Trip Reports. No Comments »

Broad Institute Visit , 28-29th July 2010, Anne Carpenter’s lab

Pauli Raemoe from Lucas Pekmans’ lab at the ETH and myself visited Anne and her team. Anne’s team is developing Cell Profiler and Cell Profiler Analyst. Those tools are used to process large batches of cell images. Pauli presented iBRAIN and Cell Classifier and I gave a talk on SyBIT, openBIS and iBRAIN, which included demos. We were given demos of Cell Profiler and Cell Profiler Analyst, and discussed the software architecture in detail. The current challenge to Cell Profiler seems to be in cell tracking, for which ImageJ may be used.
We will collaborate with the Broad Institute in developing our tools further, to support data flow between all the relevant applications. The modular nature of the systems and the use of web data protocols make this feasible. There was significant interest in openBIS, and the question to be decided is how much image viewing is supported in Cell Profiler Analyst and how much in openBIS. Similarly, integration with graph drawing software can be done in either Cell Profiler (or Analyst) or in openBIS. Further discussions will decide on how best to integrate the toolsets. We envisage sharing wiki links and lists of to-do items, to maximize benefits to both sides.

Posted in Trip Reports. No Comments »