1. tests / benchmarks
1.1. benchmarks
1.1.1. done by
1.1.1.1. Altenhoff
1.1.1.2. Dutilh
1.1.1.3. Pryszcz
1.1.2. hampered by
1.1.2.1. availability of results
1.1.2.2. heterogeneous datasets
1.1.2.3. taxonomic biases
1.1.2.4. difference in underlying methodology
1.1.2.5. sparse documentation
1.1.3. did what
1.1.3.1. compared orthology methods as large as
1.1.3.1.1. phylogenetic
1.1.3.1.2. phylogenomic
1.1.3.2. synteny-based dataset
1.1.3.3. latent class analysis (meta-analysis)
1.2. test data
1.2.1. functional data
1.2.1.1. e.g.
1.2.1.1.1. GO terms
1.2.1.1.2. gene expression data
1.2.1.1.3. enzyme numbers
1.2.1.1.4. gene neighborhood conservation
1.2.1.1.5. phylogenetic congruence
1.2.1.1.6. KEGG orthology numbers
1.2.1.1.7. HAMAP family accession numbers
1.2.1.2. problematic b/o
1.2.1.2.1. limited availability of annotations
1.2.1.2.2. pure annotations
1.2.1.2.3. assumptions
1.2.2. phylogenetic data
1.2.2.1. problematic b/o
1.2.2.1.1. Only gene families
1.2.2.1.2. assumptions
1.2.2.2. existing databases
1.2.2.2.1. TreeFam
1.2.2.2.2. COG
1.2.2.2.3. Methaphors
1.2.2.2.4. HoGenom
1.2.2.2.5. PhylomeDB
1.2.2.2.6. Ensembl Compara
1.2.3. simulated data
1.2.3.1. Arvestedt's software (aladen)
1.2.3.1.1. simulates gene trees given a species tree, then MSA
1.2.4. reference benchmarks
1.2.4.1. TreeFam
1.2.4.2. Human-mouse orthologs
1.2.4.3. multi-domain proteins
1.2.5. domain fusion/fisson
1.3. findings
1.3.1. similarity scores
1.3.1.1. raw score vs. bit score vs. e-value vs. identity
1.3.1.2. different configurations
1.3.1.2.1. Blast vs. Smith Waterman
1.3.1.2.2. hard vs soft masking
1.3.2. use of external information
1.3.2.1. e.g.
1.3.2.1.1. gene neighborhood
1.3.3. latent class analysis
1.3.3.1. overlap of existing methods
1.4. studies
1.4.1. human vs. 7 others
1.4.1.1. as done by
1.4.1.1.1. Dolinski
1.4.2. yeast orthologs
1.4.2.1. as done by
1.5. Methods
1.5.1. Species overlap
1.5.1.1. described in
2. Goals
2.1. Identification of orthologs + functional genomics
2.2. Gold standard
2.3. automated biological function interpretation from gene phylogeny
2.4. Accurate function prediction
3. challenges
3.1. Tree reconciliation
3.1.1. use counterexamples of assumptions
3.1.2. selection of genes to build tree
3.1.2.1. What kinds of topologies make tree difficult to partition
3.1.3. Accuracy of tree reconciliation methods
3.1.4. Identification of functionally divergent nodes
3.1.4.1. History of gene much more complex than duplications
3.1.4.2. Map functional genomics data onto tree
3.2. BBH linkage
3.2.1. testing assumptions
3.2.1.1. How many BBH pairs are not functionally identical?
3.2.1.2. Does number of BBH vary btw. closely/distantly-related species?
3.2.1.2.1. if so
3.2.2. Which genes always ambiguous in OG construction?
3.2.3. improvements
3.2.3.1. mainly concerns
3.2.3.1.1. reduce false positives
3.2.3.2. include
3.2.3.2.1. persistent genes
3.3. General
3.3.1. Benchmarks
3.3.1.1. Clade-specific genes
3.3.1.1.1. do they have unique features?
3.3.1.1.2. allow they implications about interactions with environmental factors?
3.3.1.2. Simulation studies
3.3.1.2.1. assess influence of
3.3.2. biological
3.3.2.1. Insufficient masking of low-complexity regions
3.3.2.2. protein mosaics/protein subfamilies
3.3.2.2.1. recent duplications
3.3.2.3. Definition of terms
3.3.2.3.1. orthologs/paralogs
3.3.2.3.2. co-orthologs/in-paralogs/out-paralogs/super-orthologs/ultra-paralogs
3.3.2.3.3. protein function
3.3.2.4. alternative splicing
3.3.2.5. function prediction
3.3.2.5.1. sub-families similar in sequence, but different in domain architecture
3.3.2.6. Horizontal gene transfer
3.3.3. technical
3.3.3.1. Standards
3.3.3.1.1. for
3.3.3.2. Computation
3.3.3.2.1. scalability
3.3.3.3. Database
3.3.3.3.1. stay up-to-date
3.3.3.4. expand representation of orthologs
3.3.3.4.1. to
4. Orthology prediction methods
4.1. ab initio - building groups of similar genes
4.1.1. formation of groups
4.1.1.1. Similarity search
4.1.1.1.1. Approaches
4.1.1.1.2. Similarity scores
4.1.1.1.3. Tools
4.1.1.2. Biases
4.1.1.2.1. induced by
4.1.1.2.2. lead to
4.1.2. expanding groups
4.1.2.1. adding in-paralog
4.1.2.1.1. idea
4.1.2.1.2. approaches
4.1.2.2. improve orthology detection
4.1.2.2.1. External knowledge
4.1.2.2.2. clustering
4.2. post-processing - building ortholog goups
4.2.1. are based on
4.2.2. and use
4.2.2.1. phylogenetic trees
4.2.2.1.1. required pre-steps are
4.2.2.1.2. as done by
4.2.2.1.3. approaches
4.2.2.1.4. goal
4.2.2.1.5. issues
4.3. hybrids
4.3.1. combination of existing dbs
4.3.1.1. as done by
4.3.1.1.1. YOGY
4.3.1.1.2. MetaPhOrs
4.3.2. combination of existing methods
4.3.2.1. advantage
4.3.2.1.1. scalable as
4.3.2.1.2. use phylogenetic information as
4.3.2.2. as done by
4.3.2.2.1. Ensembl Compara
4.3.2.2.2. HomoloGene
4.3.2.2.3. OrthoParaMap
4.3.2.2.4. PhIGs
4.3.2.2.5. PHOG
4.3.2.2.6. PhyOP
4.3.2.2.7. TreeFam
4.3.2.2.8. eggNOG
4.3.2.2.9. P-POD
5. Background
5.1. Biological
5.1.1. Reasons for bias
5.1.1.1. Gene loss
5.1.1.1.1. single gene loss
5.1.1.1.2. reciprocal gene loss
5.1.1.2. Gene gain
5.1.1.3. Horizontal gene transfer
5.1.1.3.1. leads to
5.1.1.3.2. occurs within
5.1.1.4. Incomplete lineage sorting
5.1.1.5. Mosaics of proteins
5.1.1.5.1. Outcome
5.1.1.5.2. Processes
5.1.1.6. Alternative Splicing
5.2. Definitions
5.2.1. issues
5.2.1.1. orthology
5.2.1.1.1. constraint
5.2.1.2. General problem about definitions
5.2.1.2.1. different definitions -> no quality assessment
5.2.1.3. ortholog group
5.2.1.3.1. defined with respect to
5.2.1.4. Synteny
5.2.1.4.1. wrongly used b/o
5.2.2. definitions
5.2.2.1. ortholog group
5.2.2.1.1. defined as
5.2.2.2. gene function
5.2.2.2.1. defined as
5.2.2.2.2. problem with
5.2.2.2.3. proven by
5.2.2.3. homology
5.2.2.3.1. orthology
5.2.2.3.2. parology
5.2.2.3.3. xenologs
5.2.2.3.4. subtree-neightbors
5.2.2.4. Basic units of orthology
5.2.2.4.1. domain
5.2.2.4.2. gene sequence / proteins
5.2.2.4.3. original definition
5.2.2.5. Horizontal gene transfer
5.2.2.5.1. defined as
5.2.2.5.2. found in
5.2.2.6. Conserved gene neighborhood
5.2.2.6.1. defined as
5.2.2.7. non-transitivity of phylogenetic relationships
5.2.2.7.1. defined as
5.2.2.7.2. examples
5.3. Assumptions
5.3.1. Best bidirectional hit
5.3.1.1. true for
5.3.1.1.1. two genes with same function
5.3.1.2. implied assumptions
5.3.1.2.1. function by single gene
5.3.1.2.2. present in both species
5.3.1.2.3. Transitivity of orthologs
5.3.2. smallest reciprocal distance
5.3.2.1. implied assumptions
5.3.2.1.1. most similar = most likely orthologous
5.3.2.2. true for
5.3.2.2.1. two genes with same function
5.3.3. gene evolution = species evolution
5.3.3.1. implied assumption
5.3.3.1.1. duplication leads to
5.3.3.1.2. same evolutionary pattern
5.3.4. orthologs = similar/same function
5.3.4.1. implied assumption
5.3.4.1.1. Gene neighborhood implies orthology
5.3.4.1.2. similar/same
5.3.4.2. failes for
5.3.4.2.1. Genes that lost/changed function
5.3.5. General
5.3.5.1. graph-based vs. tree-based
5.3.5.1.1. approaching a problem
5.3.6. Addition of inparalogs
5.3.6.1. allowed if
5.3.6.1.1. genes are closer to ortholog of same species than to any gene of others
5.3.7. transitivity of orthologous relationship
5.3.7.1. violated by
5.3.8. Xenologs
5.3.8.1. implied assumption
5.3.8.1.1. They often appear as true orthologs in genome comparisons and might exhibit variable functions