GC Home

Genomic Experimental Cycle and relevant software requirements.

Data import
Experiment design
Experimental procedures
Data collection & analysis
Publications & presentations

Data import, including annotations import
Analysis & user annotation capabilities
Information storage & intermediate presentations
Experimental data import
Experimental data analysis
Export into presentation & publications software

1. Database search, data import, annotationAn enormous amount of information is now stored in multiple biological databases. Ability to find and to import existing data is essential for proper analysis and experiment design.

Importing data may be not as simple as it looks. "Cut and paste" efforts may be OK for some very small projects, but are unacceptable for the majority of cases. Having the nucleotide sequence only, for example, deprives the user from all analysis and annotations available in the databases, like coding sequence, exon/intron structure, polyadenilation sites etc. As the goal of the project is usually to analyze/use particular part(s) of the sequence and not to study something at random, keeping available annotations is essential.

Data import may be complicated by different formats used by different databases, by multiplicity of the platforms, operation systems and programs, all with multiple versions. As the IT world evolves quickly, so do the individual databases, and so whatever was compatible yesterday may not be so tomorrow. It places a substantial and permanent burden on the software developing teams.

Search capabilities incorporated into desktop applications remain optional as advanced databases provide sufficient tools for search.

After a specified record is found, the file must be transferred to the desktop (on-line analysis and storage systems designed for individual users are out of the scope of this article). There are two ways to do so:

to open the record form the desktop application and save it inside the application
to save the file somewhere on the hard drive using options provided by database interface and then to transfer it into desktop application

Although both methods are acceptable the first one is much more convenient as you do not have to go through directories' lists first to save the file and then to open or to drag the file into the program. If you transfer one file a year it probably does not matter that much, but if you have to do it several times per hour it does get time consuming and annoying.

Transferred files must retain, in useful and easily accessible form, all original information available in the database. If it does not, you may have to add annotations yourself and that may get really unpleasant. Imagine that you have a gene with 50 exons. To only annotate them, you will have to specify 100 coordinates, type in 50 annotations specifying for each one that it is, indeed, an exon. And after you get tired, think how many errors you could make on the way...

After the file was transferred with all available information you may start annotating and labeling regions and features you are particularly interested in. For example, if you are interested in a particular exon you may want to make it easily recognizable in sequence or graphic view by applying different color. This additional annotating process must be simple, easy and intuitive as you will need it a lot.

From a historical prospective, online database import capabilities were not originally essential parts of the biological software, for obvious reasons. There was nothing close to today Internet capabilities two decades ago and the amount of information stored in the databases was incomparably smaller (GenBank original release was on CD; it's terabytes now).

Today, software that has failed to incorporate import capabilities into its original design or has lost compatibility over time will have very limited use.

2. Experiment Design During this stage, areas of interest from the imported sequence are chosen and specific tools developed. For example, the standard protocol for DNA amplification for the purpose of mutation search, protein expression etc. includes PCR primer design and sequencing primers design.

Features like ORF(Open Reading Frames) search are essential for imported sequences without annotations. Similar features will be necessary for the experimental data analysis.

For both pre-experimental and post-experimental analysis software must be able to perform multiple functions, including, but not limited to:

Sequence editing and conversion
Primer design
Motif search
ORF analysis
Translation
Exon splicing
Restriction sites maps
Sequence alignment
Dot plot
Annotating capabilities.

Sequence editing is an obvious function and all programs include it, but the ways the operations are performed differ dramatically among the programs, and some may really surprise you, not necessarily in a good way.

For example, you would like to edit an imported sequence by deleting or inserting a particular region or nucleotide (because your splice variant is different, or because of allelic variations, or because you would like to make a mutation in an expressed protein). Some programs will do it easily and correctly, but some... One particular high profile software will destroy the link between the annotations and their corresponding sequences if you do so. Apparently, annotations in this software are assigned to the nucleotide number, not the nucleotide sequence itself. For example, if you cut the second half of an annotated sequence, the annotations will remain, even where the sequence has been removed.

Sequence conversion is another essential option that must be performed in the simplest way possible. If you need a complementary strand sequence, you just click "reverse & complement". Again, it sounds simple, but you may be surprised by what some programs do...

Primer design is another essential function that is expected to be very simple and intuitive, but the reality is much different. To give you some idea of the problems with the scientific basis used for primer design... Several programs list "salt" concentration as a condition for primer annealing temperature determination. A search through the supporting literature, as well as multiple requests to technical support, do not bring any clarification to the exact chemical formula of such "salt.". Apparently, the software was designed at the dinner table where only one "salt" was available; in the lab, there is a large variety of "salts."

Primer design in general is not as simple as it could be given decades of oligonucleotides synthesis and use. There are still very different approaches, ranging from complete denial of any benefits of primer design rules to overanalyzing the issue. Differences between algorithms currently present in software applications, online and desktop alike, do not help much either. Some of the current applications predict the melting temperature of the same primer with up to a huge 40 degree Celsius discrepancy. The nearest neighbor method, combined with the proper set of conditions, helps to provide reasonable estimations. The best approach is probably to choose a program with reasonable primer design features and to test several designed primers by PCR and sequencing.

One of the most user-friendly and scientifically sound software that can be used for primer analysis is pDRAW32. It is also free, supporting an old saying that the best things in life are free.

3. Experimental Procedures Experimental equipment these days is saturated with computers, processors and different types of software, but their description is outside the scope of the current topic.

It is important to remember that experimental errors and failures are often a result of: improper use of original data, failure in design and, in many cases, ambiguous directions given to the operator. A misunderstood task is a much more likely source of failure than a reasonably experienced operator's pipeting errors. Reasonable efforts to present experiment design in the clearest possible form are essential to ensure that the experiment is conducted as designed.

4.Data collection & analysisImporting capabilities from the application running on experimental equipment, such as sequencing machines, are important. There are relatively light programs specialized in the import of sequence information, but they are rather incapable of doing anything else well. There are also attempts to incorporate experimental data import into larger "universal' programs, but the jury is still out if it makes the user's life any easier.

Analysis & annotation tools for experimental data are similar to the ones described for experiment design in Analysis & user's annotation capabilities

5.Publications & Presentations Publications & Presentations functionality was often neglected or misguided in software design. On one hand, some programs had no reasonable way to export or even to print necessary materials. On the other hand, in other programs, there were serious attempts to design complex graphic editors inside the bio-software itself. Both approaches had obvious problems.

Users would like to have their major results/conclusions available in a standard format independent from the original application in which the results or graphics were generated. Why is an export in a universal format necessary?

Bio-software costs hundreds, if not thousands of dollars per license. You may not have it on all computers in your lab, and your collaborators or conference hosts may not have it either. They should be able to view your conclusions without having to purchase and learn each particular software.
If your division/institution has a "site license," it is often difficult to use it at a distant location (e.g. conference, home, plane). Exported data in a universal format could be viewed on most computers.
Incompatibilities between platforms and operating systems are always a concern, even if the software is available for different platforms and installed on different computers.
During a presentation, it is more efficient to use a single, standard presentation program (such as PowerPoint), rather than being forced to switch between multiple applications.
Journals have quite strict standards for publication formats. Nothing exotic, please!

An attempt to create serious graphics software as a part of biological applications is problematic as well. Anything reasonably good will take resources ($, time, expertise) which could be used for more focused activities. Even if good graphic design was achievable inside bio-software, it would be quite complex. Who has the time to learn it? In general, it is very unlikely that a small bio-informatics team may create something comparable to Adobe Illustrator or Microsoft Power Point, industry standards for years, and it is probably not worth trying.

The best approach is to ensure that the program creates reasonable graphics based on original biological data and has options to export it into mainstream formats, including editable (vectorized) ones (Illustrator and Power Point compatible in particular). It greatly simplifies and speeds up preparation of the presentations and publication materials. Several products currently allow to do exactly that, and they should be given preference.

Business & financial considerations To prevent serious disappointments, it helps to remember that software developing companies are real-life private entities and live and die in the Business Universe. If you spent thousands of dollars and a lot of time learning and adjusting to the particular software, would you be disappointed if the company goes out of business? Actually, there are plenty of possible events with similar consequences for the user. The product may disappear for a variety of reasons, some rational and predictable, some not. Events like ownership changes, "refocusing", reorganizations, management or board changes may have unfortunate consequences; leading developers could also be fired or could leave to start their own company, etc.

It pays to know at least the basic facts about the companies that make the products you are choosing from. Small emerging companies are exposed to much higher financial risks than their larger competitors, but they are usually the ones to bring the most innovative technologies into the BioIT field. They are dedicated, not only because they are enthusiasts of their field, but also because the company's life, and its management/employees' futures, depend upon the product success. Small companies are a lot faster, a lot more responsive and do not require you to fill out forms and to get user names and passwords in order to access technical support.

Some established companies, on the other hand, who have been in the field for more than two decades, earned their reputation and have established a pool of users, many of whom are reluctant to replace the software they are used to. These companies are a lot more predictable, have much deeper pockets and can afford to support development through rough times. They have established marketing channels and field work forces and use them to increase their chances to establish their software as an industry standard.

The IT world evolves very fast, so do the applications and databases. Incorporating emerging technologies and maintaining compatibility with the general environment (OS, antiviral, antispyware, browsers) and with specific systems used for data import/ export (databases, graphics) is going to be a permanent challenge. The appearance of new tools, services or any changes made for good or bad reasons, in software or hardware, will require frequent updates.

To assess the financial health of a company, management capabilities,and team expertise, one may apply standard techniques developed by the investment community. This will be discussed separately.

GC - Genomic Core

Genomic Experimental Cycle