Bioinformatics Final

Get Started. It's Free
or sign up with your email address
Rocket clouds
Bioinformatics Final by Mind Map: Bioinformatics Final

1. Test 1 Topics

1.1. Intro

1.1.1. Basics of Molecular Biology & evolution DNA -> RNA -> Protein cDNA: complementary DNA created from RNA, used in sequencing & biotech (double-stranded) mRNA: messenger RNA miRNA: micro RNA PCR: Polymerase chain reaction, used to amplify a DNA in order to seqence Introns: Spliced out of RNA Exons: Coding regions of DNA

1.1.2. Bioinformatics Fields of study it is used in Proteomics, genomics, taxonomy & evolution, medicine, biophysics, molecular bio Bioinformatics is the application of computers and mathematics to biological data to analyse and obtain useful information.

1.1.3. Algorithm Self contained step-by-step instructions for a computer program to perform a function "If/then" CompSci plays a major role as programming and database admin is needed to create, maintain, and use the giant repositories of genetic and proteomic data. Analysis of algothrims Data structures & info retreival Software engineering

1.1.4. What is a high level computer language? Python, PERL (interpreting data), Java, C++ (software)

1.2. Databases

1.2.1. Searching bio-databases NCBI: can search by name, paper, author, organism, gene accession #, etc. (Entrenz, OMIM)

1.2.2. Types of information on databases Genomic Evolutionary info (homologous genes, taxonomic info) Genomic info: chromosome length, introns, regulatory regions, shared domains Nucleotide database on NCBI gives location info and CDS (coding region) Proteomic Structural info: associated protein structures, fold types, structural domains Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases ESTs Represent coding regions, small Expression info: expression specific to particular tissues, developmental stages, phenotypes, diseases mRNA unassembled seq of all types

1.2.3. Genome Browser Example is GenBank Allows user to look through fully sequenced genomes and related information

1.2.4. Homologene Used to find homologous (similar because of common ancestry) genes among other organisms

1.2.5. DBEST EST (Expression Sequence Tag) database

1.2.6. Uniprot Protein seq database

1.2.7. Boolean search operators AND: search for words together OR: search for words together or singly NOT: exclude this [ ]: limits (ex. [orgn] organism, [ab] abstract [titab] title of abstract)

1.3. Pairwise Alignments

1.3.1. Dotlet Program for pairwise seq comparison, diagonal down is matching seq, diagonal up is reversed seq, and X is palandrome (same forward and back)

1.3.2. Proteins vs DNA seq in alignments Proteins are preferred because DNA may include non-coding seq, proteins have relatedness going back mil or bil of years, proteins are more informative (20 vs 4 characters)

1.3.3. Ortologs: different species that share similar sequences Paralogs: genes within the same species that are similar

1.3.4. Homology Homologs are similar because they have common ancestry

1.3.5. Global vs local Global: comparing entire sequences (Needleman-Walsh algorithm), takes longer Local: compairing partial sequences (Smith-Walsh algorithm), better for comparing shorter seq to longer or entire databases or to the whole seq Helps identify new seq

1.3.6. Scoring matrix Significance E-value, how likely your results could occur due to chance Score = SUM(identities, mismatches) - SUM(gap penalties) Match = # higher Mismatch = # low Gap = # lowest

1.3.7. False positive/negative Match or mismatch due to an error

1.4. Blast

1.4.1. To compare sequences to the database for homology

1.4.2. Interpretation Gives you a list of homologous sequences along with a % identity (high is good) and an e-value (low is good). Generally want above 75% identity.

1.4.3. How to search Can select distance tree to see relatedness/homology represented as a tree. Can also take FASTAs and align to create phylogeny

1.4.4. Different types of blast blastn: Input DNA, compares to DNA database Closely related DNA seq Interested in non-coding DNA Nucleotide collection nr/nt, optimize for somewhat similar seq blastp: Input AA, compares to DNA Discover what protein it codes for and its function Use Swiss-Prot, Mask low complexity regions under algorithm parameters, use taxonomy report and distance tree results blastx: Input DNA compares to protein Discover proteins encoded by query DNA seq If it could contain sequencing errors tblastn: Input protein compares to translated DNA Discover new genes encoding simple proteins Compare protein with DNA seq translated into their six possible reading frames (3 on each strand) tblastx: Input DNA compares to translated DNA Discover new proteins

1.4.5. Challenge: Has error propagation

1.5. Practical Knowledge

1.5.1. Searching GenBank using Entrez

1.5.2. Transcribe/translate DNA seq

1.5.3. Analyze DNA

1.5.4. Find primers for PCR

1.5.5. Emboss

2. Test 2 Topics

2.1. MSA

2.1.1. MSA and phylogenetics A collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned Done to analyze the homology among a group of related seq from diff species to establish a phylogeny Homologous residues are aligned in columns across the length of the sequence Residues are homologous in an evolutionary state and a structural sense NCBI Blast known for error propagation because it relies on annotation of the first sequence

2.1.2. Once a gap always a gap Changing gaps gives higher weight to more distantly related species Maintain initial gap choices to trust that those are most believable Gaps are often added to the first two closest seq Where gaps are added important New gap is much worse than extending gap in scoring

2.1.3. Formula ((N-1)*(N))/2

2.1.4. ClusalW vs ClustalX Both progressive like T-coffee, ClustalX is downloadable/desktop version

2.1.5. Interpretation Conserved regions or domains signify common ancestry

2.1.6. JalView: graphical representation, can pick out just the most conserved domain

2.1.7. WebLogo: histogram to represent the prevailing/conserved base pair at that location

2.1.8. T-coffee: slowest, normally most accurate, best for small number of alignments

2.1.9. MUSCLE: best for alignments between 100-1000 seq

2.1.10. Clustal Omega: large amounts of seq

2.1.11. Steps: Obtain the seq's of interest in the FASTA format (Homologene) Compute an alignment using a distance matrix Create a guide tree Convert similarity scores to distance scores tree shows distances between objects Use UPGMA ClustalW provides syntax to describe tree Guide tree =/= phylogenetic tree Progressively align the sequences Make MSA based on order in guide tree Start w/ 2 most closely related seq Add next two closest related Continue until all seq are added to the MSA

2.1.12. Basic idea: a significant alignment of the query seq w/ a target seq from PDB is evidence that the query seq has a similar 3-D structure

2.2. Phylogenetics

2.2.1. Placement of species into a tree based on evolutionary patterns of inheritence found in genetic analysis

2.2.2. Phylogenetic tree Using molecular/genetic data Can be used to determine a common ancestor, how distantly related, biogeography Selection of sequences: must be the same gene/protein, similar length Must have conserved MSA (lots of * or :) or conserved domain

2.2.3. Interpretation Most information given by the branch length and locations of nodes

2.2.4. Topology Node: where the branches diverge Taxons: ends of branches Operational Taxonomic unit: DNA, AA seq Root: base common ancestor Outgroup: a more distantly related taxon, outside the clade of interest, used to root the tree Clade: monophyletic group

2.2.5. UPGMA Unweighted Pair Group Method with Arithmetic mean- bottom up simple heirarchical clustering method Compute pairwise distances of all proteins Number Find the 2 proteins with the smallest pairwise distance and cluster, find next, etc. Keep going until all grouped Uses distance matrix & all taxa are an equal distance from root Always rooted, less accurate Creates a phenogram or phenetic tree, a guide tree for a phylogeny

2.2.6. Neighbor-joining: Place all taxa in star-like pattern Identify neighbors that are most closely related Connect these neighbors to other OTU's via an internal branch Not always very effective

2.2.7. Cladogram vs phylogram Cladogram: branch lengths same OTU's are neatly aligned and nodes reflect time Phylogram: branch lengths drawn to scale Branch lengths are proportional to number of AA changes

2.2.8. Bootstrapping Measure of robustness Given branching order, how consistently does an algorithm find that branching order in a randomly permutated version of the original data set Make artificial dataset using randomly selected sampling columns from MSA Make dataset same size as original Replicates Observer percent of cases in which assignment of clades in the original tree is supported by the bootstrape replicates (>70% is significant)

2.3. Genome sequencing, gene finding and annotation

2.3.1. Coding regions in Prokaryotic and Eukaryotic genes

2.3.2. Genome sequencing 1st organism to be seq was bacteriophage, 1st prokaryote was influenza, 1st eukaryote was yeast Steps to assembling genes: Small reads (ESTs) created Compile into contigs; assembly (2 or more seq) Combine contigs into scaffolds (very big) Join 2 or more scaffolds to create the genome

2.3.3. Genome Collection of the DNA that comprises an organism Can browse genomes on NCBI Criteria for what to seq: Genome size Cost Relevance to human disease Relevance to basic biology questions Relevance to agriculture Two main strategies: Whole genome shotgun: Most common, shred into smaller fragments that are sequenced individually, sequences then ordered based on overlaps in genetic code and reassembled into complete seq Hierarchical shotgun: Applied to large overlapping DNA fragments of known location in the genome

2.3.4. Gene finding Find the coding region in DNA seq Blast against dbEST to see where the coding regions are Use tools/algorithms that predict exons, introns, and other features Different packages/tools have been designed to find gene features in a given seq Depends on organism Prokaryotes Eukaryotes VecScreen: check to see if seq has any vector contamination

2.3.5. Gene annotation Information content in genomic DNA includes Repetative DNA elements Nucleotide composition (GC content) Protein-coding genes and other genes GC content varies across genomes Use Blast against nr Blast against MSAs of protein families PFAM Prosite

3. Test 3 Topics

3.1. Protein Structure

3.1.1. What are the levels of protein structure? Primary: AA seq Secondary: Alpha helices & Beta-pleated sheets Super-secondary: hairpin turns, B-a-B units Tertiary: Multiple alpha and beta making up the actual folding of the protein Domains: compact units w/in the folding pattern of a single chain, fall b/t super secondary and tertiary Modular proteins have multiple domains Quaternary: One protein's interaction w/ others, merged (dimerized) proteins (hemoglobin)

3.1.2. The hydrophobic effect: hydrophobic (non-polar) AA tend to be sequestered in protein interiors away form the solvent Hydrophobic Profile: a plot used to show the hydrophobic (and therefore transmembrane) regions

3.1.3. CASP Critical Assessment of Structure Prediction X-ray crystallographers and NMR specialists publish protein seq, the experimental structure and predictions are compared at a conference, closest match wins Prediction mostly based on homology: compare similarity to known protein structures

3.1.4. Cn3D Does not predict anything, only shows existing models Examine structure to determine helices, beta sheets, linear, conserved domains Mouse mode -> select columns: highlights amino acid so you can tell if its on the surface or inside the molecule, easier to see with space fill

3.1.5. PFam Seq search Shows what protein family the seq is in Analyze e-value (lower better)

3.2. Gene Expression: How much of that gene product is produces. Therefore, to measure you would look for the # of mRNA in the experiment (count the reads)

3.2.1. cDNA microarray: Full length probes that are cloned/amplified via PCR (prone to cross-hybridization) cDNA microarray experiment steps Isolate mRNA Tag w/ fluorescent red/green mix together image addressing: locate spots segmenting: classify the pixels as signal or noise Information extraction: can estimate background by looking for areas where there is not any color (color pixel = signal)

3.2.2. Oligonucleotide array: a unique probe, small seqments ~25-75bp long, looks for specific gene or EST

3.2.3. Normalization: the background noise in the data is removed in order to clean up data, remove outliers, condense into symmetrical graphics that are easier to interpret

3.2.4. Inferential statistics: In inferential stats (t-test) you are using a stat test to make assumptions about the wider community from your experimental results

3.2.5. Descriptive statistics: clusters your data, looks for patterns

3.2.6. Expression ratios: Calc the ratio of red vs green (experimental vs control) to see where experimental caused more expression (no effect but continued normal expression = 1:1) Why you should take the log base 2 of them

3.2.7. Interpret a Scatter plot: Top area is high expression, bottom left is lower expression, diagonal is expressed with no change, top of diagonal is expression in both control and experiment, above diagonal is down-regulation, below diagonal is up-regulation

3.3. Doing Biochemistry on a computer

3.3.1. Finding molecular weight and other info about a protein Expasy -> protparam -> enter accession # or FASTA seq pI is the pH of that AA, whether its neutral Half-life so you know if its worth working on Instability index: beyond 40 is not stable

3.3.2. Finding transmembrane regions ProtScale: Gives hydrophobicity, set window size to 19, cut off is 1.6 threshold TMHMM: Uses hidden markov models, only accepts FASTA, get seq from SwissProt

3.3.3. Prediction protease digestions To see where an enzyme will cut a given protein Useful if you want to separate the domains in your protein or to make sure that your protein isn't sensitive to some endogenous proteases Expasy -> protein cutter Says the position of cleavage sites and # of cleavages

3.3.4. Finding domains. * exact match : highly similar . similar InterPro: finds domains and functional classifications CD search server: CD= Conserved Domain, deselect "apply low complexity filter", threshold=1 Motif Scan: Database of motifs -> prosite profiles

3.4. EST Analysis

3.4.1. Assembly of EST sequences CAP3

3.4.2. How to interpret EST assemblies (know what genes are highly induced) Blastx to find what it codes for Highly expressed EST will have a lot of reads