r18 - 24 Apr 2006 - 21:29:45 - EricPuryearYou are here: TWiki >  IBG Web  > ProjectList

Bio-Grid Project Ideas

Folding

IBG Data Model - Entry level

Contribute to the IBG data model for proteins, including the read/write functions for outside formats (PDB, FASTA, Tinker XYZ/SEQ, Gromacs GRO/TOP)

IBG User Interface – Advanced

Design and write a graphical interface for users to design proteins in either web applications or stand alone applications and uses the IBG protein model.

IBG Data Model Manipulation – Intermediate

Write functions for the IBG data model including manipulation of dihedral angles, bond angles, bond lengths, etc

IBG Sampling Algorithms – Intermediate

Write modules for sampling large search terrains: Monte-Carlo, Genetic Algorithms, Branch and Bound, Dead End Elimination.

IBG Artificial Intelligence – Advanced

Write modules for sampling large search terrains: Neural Networks, Bayesian Networks, K-nearest Neighbor, Decision Trees, etc.

Energy Functions - Intermediate / Advanced

Port existing/published energy functions and design new energy functions with the help of molecular biologists to score proteins models. Physical and Statistical.

PDB mining

Mine the PDB for statistical propensities useful in protein folding and inverse protein folding.

Structure Comparison – Intermediate

Write function to compare multiple protein structure, by multiple means.

Structure Analysis – Intermediate / Advanced

Write tools to analyze protein structures: Hydrogen boding, salvation, statistical analysis, contact map, etc.

Performance Monitoring – Intermediate / Advanced

Develop tool to monitor the IBG performance across the cluster and ultimately the grid.

HDX in Folding

Using Deuterium exchange data of proteins in Monte Carlo protein folding experiments.

NMR in Folding

Using NMR data of protein structures to decrease the search space in Monte Carlo protein folding experiments.

NMR in secondary structure prediction

Design and implement an algorithm to more accurately predict secondary structure using data from NMR.

Approximating the square root function for folding

Speed versus precision, an analysis of the tradeoffs generated by approximating the square root function in OPS, an open source protein folding simulator.

Parallelizing folding

Analysis and comparison of strategies to parallelize OPS, an open source protein folding simulator, across a desktop grid platform.

Interpolating continuous energy surface for folding

in order to construct our statistical potential or energy function, we have a large set of data points (obtained from pdb mining), from which we want to interpolate a continuous energy surface. For our particular problem, these points live in 5 dimensions, and, in principle, the surface doesn't need to have continuous derivatives. An efficient algorithm is required to construct the interpolated surface given the data points.

Cleaning up the Protein Library

  • Improving the programming conventions, documentation, and clarity of the code
  • Finding and implementing any optimimizations in the core routines Particularly, examine the impact of precision/speed trade-offs on double/float and the distance calculations

Protein Design

Adding a protein design module:
  • Protein design requires mapping a three-dimensional protein structure to a sequence whose probabilistic/optimal folding gives that structure. I understand the core routines are sufficiently modular to facilitate this.

Genetic Algorithms

Genetic Algorithms module:
  • For use in particular with protein design, but hopefully a versatile enough application to allow application to other problems in OOPS domain. The interface to this module may need to be designed to correspond with common GA paralell packages, such as the mythical Cactus GA thorn, in order to allow the implementation of GA to be substituted.
  • Please note that we have such a function in SVN in Cactus (written by Nigel)

Gradient descent

Improvement in Gradient descent implementation:
  • Rewrite the gradient descent implementation w/ respect to published work whose implementation runs significantly faster than that currently in OOPS.

Porting OOPS

  • The code needs to be altered in such a way as to give reproducible results independantly of the compilation platform. It may or may not contain platform dependent code, but variable numerical precision could be a source of difficulties.

Allowing for substitution of Energy functions:

  • Presumably involving adding a module and altering several existing ones, give OOPS the capacity to apply different energy functions to assess the quality of intermediate/final predictions. This should hopefully be done in such a way as to incur minimal time penalty, since these tend to be calculated in inner loops. This requires a survey of the existing research to find commonly used energy functions and sets of paramenters to them.

Optimizing the cooling schedule:

  • For a given problem domain the best cooling schedule for simulated annealing is dependant on the energy landscape for that class of problems. OOPS uses simulated annealing by default to make sequence -> structure predictions. Improving the cooling schedule should give comparable results in less time (or better in more, if the current schedule gives insufficient time at a given temp.). In order to improve the schedule a research survey is necessary.

Constraint enforcement:

  • Correct me if I'm wrong, but OOPS currently enforces dihedral constraints (limits on the torsional angles between atoms). This is sensible since OOPS steps through state-space by varying these angles.
  • Add constraints for pairwise distance between atoms/residues. This is likely to be implemented with a penalty function, though any other existing methods of enforcing this would be interesting.
  • Add constraints to the Topology of the produced structure, in pariticular what residues appear on the surface of the conformation.

YapView (Yet Another Protein VIEWer):

  • Various topics pertaining to this. A lighter-weight protein viewer than most currently available, this could be integrated into an interface for visualization and manipulation of proteins for various purposes.

MASS SPECTROMETRY PROJECTS

(Urgent, small starter project) Data Management

Install a postgres based proteomic mass spectra management system. A lot is written, a lot needs to be done.

(small starter project) SBEAMS

Install SBEAMS

(High priority) Correlator

Correlating results from several sources (which some are in XML) and placing the original results and the correlations back into the Kah management system.

Database Download

Combine data from a SQL result set from the public database into a single XMLdocument for downloading. Other sub-projects of the datbase projects can be found here.

IBG Navigation

(Very small starter project) The navigation needs to be cleaned up. There are some pages from where you cannot get back but have to use the browser history.

Delete records from private area

(Small starter project) For testing it would be very handy to have a means to delete accessions, mass specs, or peak annotations from the database. As soon as an accession is submitted for curation, only the administrator should be allowed to delete anything. As long as the accession is in private mode, the user should be able to delete it. If a user deletes an accession, all of its mass specs should be deleted. If a user deletes a mass spec (possibly by deleting an accession), all of its annotations and the peak lists should be deleted. These features should be implemented into the BeanHandlers? and use recursive calls.

MS spectra similarity search

It's time that I would like to get someone rolling on the next phase of the spectral similarity project. I feel this is a good "starter" project for someone fairly new.

What we have: a spectral similarity searching engine that searches the empirical database for similar spectra.

What is in public domain: several spectral similarity searching engines that search theoretical spectra from genomic or proteomic sequence databases. These programs are Mascot, Seaquest, and OMSSA

What the problem with our engine is: These engines will ALWAYS return the best match. Thus, the confidence level in the result increases with the size of the database. Our empirical database is quite small compared to the sequence databases

What we want: a search engine that searches both our empirical database and the sequence databases and returns the best hit. In this way, our empirical database searching will always do as well or better than the sequence searching alone.

What has to be done: rip out the sequence searching code from open source (only OMSSA now) and adapt it to be a function with the interface that we've designed. Then fit it into the current searching web app. You'd also have to do management of the sequence databases (most of which already exists, it would just require setting it up).

Mass Spectrometry Desktop

  • Our goal is to make our desktop a central location for biologists to use the analysis tools we create. This project includes creating user interfaces on our standalone java desktop or develop portlets to be used on our OGCE web portal. One subproject is to create the ability to run OMMSA searches from our desktop/portal. Another is to connect to Kah data manager via our desktop/portal.

Lutefisk Integration

  • this project involves taking an existing piece of software, Lutefisk, and allowing it to be accessible by our internal objects. This involves integrating the input and output of Lutefisk and storing it in our specified objects.

CIDentify Enhancement

  • this project enhances what has done by the authors of lutefisk. They revamped FASTA to be a program to use output of the lutefisk program and search genomic/proteomic databases called CIDentify. We want to take what has been done in CIDentify and use it for searching a single amino acid sequence. It should return subsequences of matches.

Lutefisk Enhancement

  • this project involves taking Lutefisk and developing it for high performance grid environments. This entails making Lutefisk more accurate in its output and scoring.

Internal Object Maintainance

  • this project involves bug fixes and enhancements on our internal objects. Creating a more robust Annotation Object is also necessary.

Ommsa Grid Project

  • this project involves getting a database searching tool for mass spectrometry data running and enhancing it for muliple processor use

Spectral Database Searching Tools

  • this project involves developing search algorithms for our spectral database which may include use of "fuzzy searching" and artifical intelligence another part of the project may include developing datamining tools to search for patterns in the database not easily seen

Spectral Comparison Tools

  • Background: I put up four articles on the reading list about spectra comparisons that
anyone interested in comparing spectra should read. These have to do with viewing the spectra as vectors and taking the dot product to find similarity SDP. The final paper(KSDP) extends the technique to peptides and they say they outperform SEQUEST.
    • We are going to want to implement at least the KSDP algorithm and probably
a simpler SDP and probably an older Similiarity Index for comparison reasons.

MS Database Search Results Comparison Tool

  • This tool would allow researchers to view the results run on a single sample protein of major MS database search programs (Sequest, Mascot, Omssa ...etc) side by side in order to compare the results and determine the proper database entry match. If all of the search programs are in agreement then that hit should be reported as the correct one, if not all are in agreement then the user should be notified of the differing entries.
  • There are commercial packages available that try to do this but can definitely be improved upon. A google search should produce these packages.
  • Another interesting part of this project is the storage of and retreival of results obtained from the MS database search programs. There are xml schemas available from the Institute of Systems Biology (I think in their TPP software suite) which try to create a general output format of MS database searches which will be appropriate for input into this type of tool. Extensions of these schemas may be necessary as well as an extension to include results from de novo programs.

Documentation, Test Suites, Examples, Tutorials

  • Documentation may be less thrilling but it is essential for our user community to benefit from our work. Writing tutorials, examples and documentation gives an excellent opportunity to learn how to use our software by creating public documentation for new users. This is an excellent project for beginners who like to get their feet wet immediately.
    • A fun way of document is to write a Flash program showing how to use the desktop.
  • Creating test suites for our software would also benefit beginners who want to get involved immediately. Creating test cases for the core functionality of our software and then building up to the application level will not only help us create more stable bug-free code but will allow you to understand every level of our software's functionality.

Automated Chromatography data analysis quantification using calibration curves

* This data quantification tool would be based on the area under the curve (AUC) and based on calibration curves such as Average/Response Amount, linear, quadratic, logarithmic, exponential etc. The data analysis would be data file format independent and would be using mzXML.

-- FrantzGabeau - 03 Mar 2006
-- DaveAngulo - 01 Mar 2006
-- KevinDrew - 18 Jan 2005
-- GilKwak - 12 Mar 2006

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r18 < r17 < r16 < r15 < r14 | More topic actions
 
Illinois Bio-Grid
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback