1. Test 1 Topics
1.1. Intro
1.1.1. Basics of Molecular Biology & evolution
1.1.1.1. DNA -> RNA -> Protein
1.1.1.1.1. cDNA: complementary DNA created from RNA, used in sequencing & biotech (double-stranded)
1.1.1.1.2. mRNA: messenger RNA
1.1.1.1.3. miRNA: micro RNA
1.1.1.1.4. PCR: Polymerase chain reaction, used to amplify a DNA in order to seqence
1.1.1.1.5. Introns: Spliced out of RNA
1.1.1.1.6. Exons: Coding regions of DNA
1.1.2. Bioinformatics
1.1.2.1. Fields of study it is used in
1.1.2.1.1. Proteomics, genomics, taxonomy & evolution, medicine, biophysics, molecular bio
1.1.2.2. Bioinformatics is the application of computers and mathematics to biological data to analyse and obtain useful information.
1.1.3. Algorithm
1.1.3.1. Self contained step-by-step instructions for a computer program to perform a function
1.1.3.1.1. "If/then"
1.1.3.2. CompSci plays a major role as programming and database admin is needed to create, maintain, and use the giant repositories of genetic and proteomic data.
1.1.3.2.1. Analysis of algothrims
1.1.3.2.2. Data structures & info retreival
1.1.3.2.3. Software engineering
1.1.4. What is a high level computer language?
1.1.4.1. Python, PERL (interpreting data), Java, C++ (software)
1.2. Databases
1.2.1. Searching bio-databases
1.2.1.1. NCBI: can search by name, paper, author, organism, gene accession #, etc. (Entrenz, OMIM)
1.2.2. Types of information on databases
1.2.2.1. Genomic
1.2.2.1.1. Evolutionary info (homologous genes, taxonomic info)
1.2.2.1.2. Genomic info: chromosome length, introns, regulatory regions, shared domains
1.2.2.1.3. Nucleotide database on NCBI gives location info and CDS (coding region)
1.2.2.2. Proteomic
1.2.2.2.1. Structural info: associated protein structures, fold types, structural domains
1.2.2.2.2. Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases
1.2.2.3. ESTs
1.2.2.3.1. Represent coding regions, small
1.2.2.3.2. Expression info: expression specific to particular tissues, developmental stages, phenotypes, diseases
1.2.2.4. mRNA
1.2.2.5. unassembled seq of all types
1.2.3. Genome Browser
1.2.3.1. Example is GenBank
1.2.3.2. Allows user to look through fully sequenced genomes and related information
1.2.4. Homologene
1.2.4.1. Used to find homologous (similar because of common ancestry) genes among other organisms
1.2.5. DBEST
1.2.5.1. EST (Expression Sequence Tag) database
1.2.6. Uniprot
1.2.6.1. Protein seq database
1.2.7. Boolean search operators
1.2.7.1. AND: search for words together
1.2.7.2. OR: search for words together or singly
1.2.7.3. NOT: exclude this
1.2.7.4. [ ]: limits (ex. [orgn] organism, [ab] abstract [titab] title of abstract)
1.3. Pairwise Alignments
1.3.1. Dotlet
1.3.1.1. Program for pairwise seq comparison, diagonal down is matching seq, diagonal up is reversed seq, and X is palandrome (same forward and back)
1.3.2. Proteins vs DNA seq in alignments
1.3.2.1. Proteins are preferred because DNA may include non-coding seq, proteins have relatedness going back mil or bil of years, proteins are more informative (20 vs 4 characters)
1.3.3. Ortologs: different species that share similar sequences
1.3.3.1. Paralogs: genes within the same species that are similar
1.3.4. Homology
1.3.4.1. Homologs are similar because they have common ancestry
1.3.5. Global vs local
1.3.5.1. Global: comparing entire sequences (Needleman-Walsh algorithm), takes longer
1.3.5.2. Local: compairing partial sequences (Smith-Walsh algorithm), better for comparing shorter seq to longer or entire databases or to the whole seq
1.3.5.2.1. Helps identify new seq
1.3.6. Scoring matrix
1.3.6.1. Significance
1.3.6.1.1. E-value, how likely your results could occur due to chance
1.3.6.2. Score = SUM(identities, mismatches) - SUM(gap penalties)
1.3.6.2.1. Match = # higher
1.3.6.2.2. Mismatch = # low
1.3.6.2.3. Gap = # lowest
1.3.7. False positive/negative
1.3.7.1. Match or mismatch due to an error
1.4. Blast
1.4.1. To compare sequences to the database for homology
1.4.2. Interpretation
1.4.2.1. Gives you a list of homologous sequences along with a % identity (high is good) and an e-value (low is good). Generally want above 75% identity.
1.4.3. How to search
1.4.3.1. Can select distance tree to see relatedness/homology represented as a tree. Can also take FASTAs and align to create phylogeny
1.4.4. Different types of blast
1.4.4.1. blastn: Input DNA, compares to DNA database
1.4.4.1.1. Closely related DNA seq
1.4.4.1.2. Interested in non-coding DNA
1.4.4.1.3. Nucleotide collection nr/nt, optimize for somewhat similar seq
1.4.4.2. blastp: Input AA, compares to DNA
1.4.4.2.1. Discover what protein it codes for and its function
1.4.4.2.2. Use Swiss-Prot, Mask low complexity regions under algorithm parameters, use taxonomy report and distance tree results
1.4.4.3. blastx: Input DNA compares to protein
1.4.4.3.1. Discover proteins encoded by query DNA seq
1.4.4.3.2. If it could contain sequencing errors
1.4.4.4. tblastn: Input protein compares to translated DNA
1.4.4.4.1. Discover new genes encoding simple proteins
1.4.4.4.2. Compare protein with DNA seq translated into their six possible reading frames (3 on each strand)
1.4.4.5. tblastx: Input DNA compares to translated DNA
1.4.4.5.1. Discover new proteins
1.4.5. Challenge: Has error propagation
1.5. Practical Knowledge
1.5.1. Searching GenBank using Entrez
1.5.2. Transcribe/translate DNA seq
1.5.3. Analyze DNA
1.5.4. Find primers for PCR
1.5.5. Emboss
2. Test 2 Topics
2.1. MSA
2.1.1. MSA and phylogenetics
2.1.1.1. A collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned
2.1.1.1.1. Done to analyze the homology among a group of related seq from diff species to establish a phylogeny
2.1.1.2. Homologous residues are aligned in columns across the length of the sequence
2.1.1.3. Residues are homologous in an evolutionary state and a structural sense
2.1.1.4. NCBI Blast known for error propagation because it relies on annotation of the first sequence
2.1.2. Once a gap always a gap
2.1.2.1. Changing gaps gives higher weight to more distantly related species
2.1.2.2. Maintain initial gap choices to trust that those are most believable
2.1.2.3. Gaps are often added to the first two closest seq
2.1.2.4. Where gaps are added important
2.1.2.5. New gap is much worse than extending gap in scoring
2.1.3. Formula
2.1.3.1. ((N-1)*(N))/2
2.1.4. ClusalW vs ClustalX
2.1.4.1. Both progressive like T-coffee,
2.1.4.2. ClustalX is downloadable/desktop version
2.1.5. Interpretation
2.1.5.1. Conserved regions or domains signify common ancestry
2.1.6. JalView: graphical representation, can pick out just the most conserved domain
2.1.7. WebLogo: histogram to represent the prevailing/conserved base pair at that location
2.1.8. T-coffee: slowest, normally most accurate, best for small number of alignments
2.1.9. MUSCLE: best for alignments between 100-1000 seq
2.1.10. Clustal Omega: large amounts of seq
2.1.11. Steps:
2.1.11.1. Obtain the seq's of interest in the FASTA format (Homologene)
2.1.11.2. Compute an alignment using a distance matrix
2.1.11.3. Create a guide tree
2.1.11.3.1. Convert similarity scores to distance scores
2.1.11.3.2. tree shows distances between objects
2.1.11.3.3. Use UPGMA
2.1.11.3.4. ClustalW provides syntax to describe tree
2.1.11.3.5. Guide tree =/= phylogenetic tree
2.1.11.4. Progressively align the sequences
2.1.11.4.1. Make MSA based on order in guide tree
2.1.11.4.2. Start w/ 2 most closely related seq
2.1.11.4.3. Add next two closest related
2.1.11.4.4. Continue until all seq are added to the MSA
2.1.12. Basic idea: a significant alignment of the query seq w/ a target seq from PDB is evidence that the query seq has a similar 3-D structure
2.2. Phylogenetics
2.2.1. Placement of species into a tree based on evolutionary patterns of inheritence found in genetic analysis
2.2.2. Phylogenetic tree
2.2.2.1. Using molecular/genetic data
2.2.2.2. Can be used to determine a common ancestor, how distantly related, biogeography
2.2.2.3. Selection of sequences: must be the same gene/protein, similar length
2.2.2.4. Must have conserved MSA (lots of * or :) or conserved domain
2.2.3. Interpretation
2.2.3.1. Most information given by the branch length and locations of nodes
2.2.4. Topology
2.2.4.1. Node: where the branches diverge
2.2.4.2. Taxons: ends of branches
2.2.4.2.1. Operational Taxonomic unit: DNA, AA seq
2.2.4.3. Root: base common ancestor
2.2.4.3.1. Outgroup: a more distantly related taxon, outside the clade of interest, used to root the tree
2.2.4.4. Clade: monophyletic group
2.2.5. UPGMA
2.2.5.1. Unweighted Pair Group Method with Arithmetic mean- bottom up simple heirarchical clustering method
2.2.5.1.1. Compute pairwise distances of all proteins
2.2.5.1.2. Number
2.2.5.1.3. Find the 2 proteins with the smallest pairwise distance and cluster, find next, etc. Keep going until all grouped
2.2.5.2. Uses distance matrix & all taxa are an equal distance from root
2.2.5.3. Always rooted, less accurate
2.2.5.4. Creates a phenogram or phenetic tree, a guide tree for a phylogeny
2.2.6. Neighbor-joining:
2.2.6.1. Place all taxa in star-like pattern
2.2.6.2. Identify neighbors that are most closely related
2.2.6.3. Connect these neighbors to other OTU's via an internal branch
2.2.6.4. Not always very effective
2.2.7. Cladogram vs phylogram
2.2.7.1. Cladogram: branch lengths same
2.2.7.1.1. OTU's are neatly aligned and nodes reflect time
2.2.7.2. Phylogram: branch lengths drawn to scale
2.2.7.2.1. Branch lengths are proportional to number of AA changes
2.2.8. Bootstrapping
2.2.8.1. Measure of robustness
2.2.8.2. Given branching order, how consistently does an algorithm find that branching order in a randomly permutated version of the original data set
2.2.8.2.1. Make artificial dataset using randomly selected sampling columns from MSA
2.2.8.2.2. Make dataset same size as original
2.2.8.2.3. Replicates
2.2.8.2.4. Observer percent of cases in which assignment of clades in the original tree is supported by the bootstrape replicates (>70% is significant)
2.3. Genome sequencing, gene finding and annotation
2.3.1. Coding regions in Prokaryotic and Eukaryotic genes
2.3.2. Genome sequencing
2.3.2.1. 1st organism to be seq was bacteriophage, 1st prokaryote was influenza, 1st eukaryote was yeast
2.3.2.2. Steps to assembling genes:
2.3.2.2.1. Small reads (ESTs) created
2.3.2.2.2. Compile into contigs; assembly (2 or more seq)
2.3.2.2.3. Combine contigs into scaffolds (very big)
2.3.2.2.4. Join 2 or more scaffolds to create the genome
2.3.3. Genome
2.3.3.1. Collection of the DNA that comprises an organism
2.3.3.2. Can browse genomes on NCBI
2.3.3.3. Criteria for what to seq:
2.3.3.3.1. Genome size
2.3.3.3.2. Cost
2.3.3.3.3. Relevance to human disease
2.3.3.3.4. Relevance to basic biology questions
2.3.3.3.5. Relevance to agriculture
2.3.3.4. Two main strategies:
2.3.3.4.1. Whole genome shotgun: Most common, shred into smaller fragments that are sequenced individually, sequences then ordered based on overlaps in genetic code and reassembled into complete seq
2.3.3.4.2. Hierarchical shotgun: Applied to large overlapping DNA fragments of known location in the genome
2.3.4. Gene finding
2.3.4.1. Find the coding region in DNA seq
2.3.4.1.1. Blast against dbEST to see where the coding regions are
2.3.4.1.2. Use tools/algorithms that predict exons, introns, and other features
2.3.4.2. Different packages/tools have been designed to find gene features in a given seq
2.3.4.3. Depends on organism
2.3.4.3.1. Prokaryotes
2.3.4.3.2. Eukaryotes
2.3.4.4. VecScreen: check to see if seq has any vector contamination
2.3.5. Gene annotation
2.3.5.1. Information content in genomic DNA includes
2.3.5.1.1. Repetative DNA elements
2.3.5.1.2. Nucleotide composition (GC content)
2.3.5.1.3. Protein-coding genes and other genes
2.3.5.1.4. GC content varies across genomes
2.3.5.2. Use Blast against nr
2.3.5.3. Blast against MSAs of protein families
2.3.5.3.1. PFAM
2.3.5.3.2. Prosite
3. Test 3 Topics
3.1. Protein Structure
3.1.1. What are the levels of protein structure?
3.1.1.1. Primary: AA seq
3.1.1.2. Secondary: Alpha helices & Beta-pleated sheets
3.1.1.2.1. Super-secondary: hairpin turns, B-a-B units
3.1.1.3. Tertiary: Multiple alpha and beta making up the actual folding of the protein
3.1.1.3.1. Domains: compact units w/in the folding pattern of a single chain, fall b/t super secondary and tertiary
3.1.1.3.2. Modular proteins have multiple domains
3.1.1.4. Quaternary: One protein's interaction w/ others, merged (dimerized) proteins (hemoglobin)
3.1.2. The hydrophobic effect: hydrophobic (non-polar) AA tend to be sequestered in protein interiors away form the solvent
3.1.2.1. Hydrophobic Profile: a plot used to show the hydrophobic (and therefore transmembrane) regions
3.1.3. CASP
3.1.3.1. Critical Assessment of Structure Prediction
3.1.3.2. X-ray crystallographers and NMR specialists publish protein seq, the experimental structure and predictions are compared at a conference, closest match wins
3.1.3.2.1. Prediction mostly based on homology: compare similarity to known protein structures
3.1.4. Cn3D
3.1.4.1. Does not predict anything, only shows existing models
3.1.4.2. Examine structure to determine helices, beta sheets, linear, conserved domains
3.1.4.3. Mouse mode -> select columns: highlights amino acid so you can tell if its on the surface or inside the molecule, easier to see with space fill
3.1.5. PFam
3.1.5.1. Seq search
3.1.5.2. Shows what protein family the seq is in
3.1.5.3. Analyze e-value (lower better)
3.2. Gene Expression: How much of that gene product is produces. Therefore, to measure you would look for the # of mRNA in the experiment (count the reads)
3.2.1. cDNA microarray: Full length probes that are cloned/amplified via PCR (prone to cross-hybridization)
3.2.1.1. cDNA microarray experiment steps
3.2.1.1.1. Isolate mRNA
3.2.1.1.2. Tag w/ fluorescent red/green
3.2.1.1.3. mix together
3.2.1.1.4. image
3.2.1.1.5. addressing: locate spots
3.2.1.1.6. segmenting: classify the pixels as signal or noise
3.2.1.1.7. Information extraction: can estimate background by looking for areas where there is not any color (color pixel = signal)
3.2.2. Oligonucleotide array: a unique probe, small seqments ~25-75bp long, looks for specific gene or EST
3.2.3. Normalization: the background noise in the data is removed in order to clean up data, remove outliers, condense into symmetrical graphics that are easier to interpret
3.2.4. Inferential statistics: In inferential stats (t-test) you are using a stat test to make assumptions about the wider community from your experimental results
3.2.5. Descriptive statistics: clusters your data, looks for patterns
3.2.6. Expression ratios: Calc the ratio of red vs green (experimental vs control) to see where experimental caused more expression (no effect but continued normal expression = 1:1)
3.2.6.1. Why you should take the log base 2 of them
3.2.7. Interpret a Scatter plot: Top area is high expression, bottom left is lower expression, diagonal is expressed with no change, top of diagonal is expression in both control and experiment, above diagonal is down-regulation, below diagonal is up-regulation
3.3. Doing Biochemistry on a computer
3.3.1. Finding molecular weight and other info about a protein
3.3.1.1. Expasy -> protparam -> enter accession # or FASTA seq
3.3.1.2. pI is the pH of that AA, whether its neutral
3.3.1.3. Half-life so you know if its worth working on
3.3.1.4. Instability index: beyond 40 is not stable
3.3.2. Finding transmembrane regions
3.3.2.1. ProtScale: Gives hydrophobicity, set window size to 19, cut off is 1.6 threshold
3.3.2.2. TMHMM: Uses hidden markov models, only accepts FASTA, get seq from SwissProt
3.3.3. Prediction protease digestions
3.3.3.1. To see where an enzyme will cut a given protein
3.3.3.2. Useful if you want to separate the domains in your protein or to make sure that your protein isn't sensitive to some endogenous proteases
3.3.3.3. Expasy -> protein cutter
3.3.3.4. Says the position of cleavage sites and # of cleavages
3.3.4. Finding domains.
3.3.4.1. * exact match
3.3.4.2. : highly similar
3.3.4.3. . similar
3.3.4.4. InterPro: finds domains and functional classifications
3.3.4.5. CD search server: CD= Conserved Domain, deselect "apply low complexity filter", threshold=1
3.3.4.6. Motif Scan: Database of motifs -> prosite profiles
3.4. EST Analysis
3.4.1. Assembly of EST sequences
3.4.1.1. CAP3
3.4.2. How to interpret EST assemblies (know what genes are highly induced)
3.4.2.1. Blastx to find what it codes for
3.4.2.2. Highly expressed EST will have a lot of reads