Practice English Speaking&Listening with: Biological Sequence Analysis I - Andy Baxevanis (2012)

Normal
(0)
Difficulty: 0

Andy Baxevanis: Before we get into this week's lecture, for

those of you who are joining us for the first time, if you didn't pick up a copy of the

syllabus last week, they're up in the corners by the door, so please pick one up on your

way in. Also on the syllabus is the URL for the course website, which, once again, is

the place where you can find information about the course. Most importantly, for those of

you who may miss a lecture, there -- this is the place where we post all of the YouTube

versions of each one of the live lectures. Our goal is to have the lectures up the Thursday

following the live lecture so that if you miss a lecture one week, you at least have

some time to watch last week's lecture before this week's lecture. Just by way of reminder,

if you are interested in CNME credit, please be sure to sign the sign-in sheets in the

back of the room. My disclosure for the week is right there. Okay.

So, in the same way that Dr. Green last week laid the groundwork for upcoming lectures

in the course by providing you a broad overview of the field of human genomics last week,

my job, over the two weeks that I'm with you, is to give you a solid foundation for the

upcoming lectures devoted to bioinformatics. I mentioned last week that genomics and bioinformatics

really do go hand in hand with one another, and Eric reinforced that sentiment when he

showed you this figure from NHGRI's current vision document that was published in Nature

last February, showing the spectrum of genomics-based research, going from understanding human genomes,

just the structure and the actual sequence, going through the biology of what's encoded

in those genomes, how that relates to our understanding of human disease, advancing

that into medical applications to finally, hopefully, improve the effectiveness of health

care.

And our ability to work across this entire spectrum really rests, in large part, on our

ability to use computational approaches to generate the kinds of -- and analyze the kinds

of data that are generated in whole genome sequences studies and genome-wide association

studies, clinical sequencing efforts, pharmacogenomic studies, the whole range of genomic science

that you're going to hear about throughout this course. So given this inextricable relationship

between genomics and bioinformatics, my job, again, during the two weeks I'm going to be

lecturing, is to give you as gentle of an introduction as I can to the major tools that

are used in bioinformatics. You're going to see these tools and the concepts that underly

these tools over and over and over again in the next 11 weeks.

So it's really important to have a basic understanding of the theory that underline these techniques

and to start to take them away from the realm of being the black box. It's very easy to

go to a website, you see a box, you stick your sequence in it, you hit a button, you

get some results and you just assume that those results are correct. And many times

-- in fact, in the vast majority of those times, those results maybe are not biologically

significant. So that's one of the major themes that we're going to hit through all of these

bioinformatics lectures. So we have a lot of ground to cover today. We're probably going

to go right up against the 11:00 mark, so -- but as we start, I hope that you find all

this background information useful. And for those of you who are actively doing these

kinds of studies or using these kinds of tools, at whatever level, you'll hopefully find something

that's of practical use in your own studies.

So what we're going to do today is we're going to start off defining some terms. We're going

to talk about the importance of alignments as one of the most powerful approaches that

we have in the study of sequence data. We're then going to talk about how we actually evaluate

how good alignments are and whether we can infer biological conclusions based on those

alignments, and then we're going to spend the vast majority of our time talking about

two big tools that are used in biological discovery, sequence-based discovery, BLAST

and BLAT. And again, we're going to use this an a foundation for my next lecture with you,

two weeks from now, that will now build upon these approaches.

Okay, so why do sequence alignments -- that's what we're going to be talking about for the

next 90 minutes -- and what these sequence alignments allow you to do is figure out some

measure of relatedness between two protein sequences or two nucleotide sequences. And

when you go through the effort of trying to align two sequences with one another, you

can start to make some interesting biological inferences, whether they have to do with whether

two sequences have a structural relationship to one another, whether their functions are

similar, or whether they have a certain evolutionary relationship with one another. As we start

to talk about these things, it's important to use the right word. So we have to cover

a little bit in the definition department. When we align two sequences to one another,

so we have sequence A and sequence B, the metric that we can use to assess -- the easiest

metric we have to use to assess how good that alignment is, is to simply count how many

positions in that alignment match between sequence A and sequence B. So that observation,

that observable is similarity.

So it's -- the term similarity is usually expressed as a percent identity. You can use

that to quantify changes that occur as two sequences diverge from one another by looking

at substitutions at individual positions, insertions or deletions, and you can also

use them to identify residues that are crucial for maintaining a particular protein's structure

or function. And this is very important when you do a multiple sequence alignment -- we'll

talk about how to do those in two weeks' time -- because you can really pick out, by looking

at these multiple sequence alignments, what's important from a structural standpoint, what's

important from a functional standpoint. What residues do you have to preserve to have the

right alpha helices or beta strands, or to have the right binding pockets or so on.

Now when we see high degrees of similarity between sequence A and sequence B, it might

imply to you either some sort of common evolutionary history, that they have a common ancestor,

or some possible commonality of biological function. So you may have an unknown, a new

protein or a new gene that you've sequenced, you have no idea what it actually does in

the cell, but if you find that it is similar, by virtue of one of these sequence similarity

methods, to something where the function is known or the structure is known, you can infer

a relatedness; you can start to assign function to that newly identified sequence that you

have.

Okay, so this term is pretty easy to understand. The second one, though, is often misused,

and that is the term homology, okay? So genes are or are not homologous. This is never measured

in degrees so you cannot say that something is 15 percent homologous or 50 percent homologous

or 100 percent homologous with something else. So the similarity takes the number, the percentage;

the homology part implies the evolutionary relationship. So to put a little bit of a

finer point on this, a quote from Walter Fitch: "It's worth repeating here that homology,

like pregnancy, is indivisible." You either are homologous -- pregnant -- or you are not.

Okay? And it goes on to say that if 80 percent of the character states, in this case, the

aligned residues, are identical, one should speak of 80 percent identity and not 80 percent

homology. So again, similarity is the metric, homology is the conclusion. Okay? Is that

clear? Good? All right.

So just to make things a little bit more complicated, there are two types of homology, so let's

define both of those. So the term may apply either to two genes that have been separated

by an event of speciation -- those are called orthologs -- or between genes that have been

separated by a genetic duplication event and these are called paralogs. So a little bit

more detail. When we talk about orthologs, these are sequences that have come from a

common ancestor. And because they've come from a common ancestor, they most likely have

similar domain structure, similar three-dimensional structure, and share a common biological function.

Okay? When we talk about paralogs, they're related, again, through a gene duplication

event. And these are kind of interesting from an evolutionary standpoint because this gives

you some insight as to how mother nature has tinkered round with the armamentarium of genes

that are available, to start to use genes in different contexts. So the whole idea of

evolutionary innovation, taking a particular gene that's already being used for one function,

changing it a little bit and co-opting it into either another pathway or into another

function.

Okay, I think this makes more sense when we look at a diagram. So let's say we have this

gene, alpha, here, and we're going to call this the most recent common ancestor of three

genes that are up here: gene 1, gene 2, and gene 3. The relationship between gene 1 and

2, 2 and 3, or 1 and 3 is that all of those are orthologous to one another. Okay? Because

they rose from this common ancestor. All right, now let's say a little further back from A,

we've have a gene duplication event that gave rise to gene alpha and to gene beta. So perhaps

alpha and beta are exact copies of each other, perhaps they're slightly different, but from

that point on, they've had independent lineages arising from them, okay? So beta, in this

case, gave rise to 4, 5, and 6, so these genes, again, because they came from a common ancestor,

4, 5, and 6 are all orthologous to one another. But, when we talk about the relationship of

any of 1, 2, or 3 to any of 4, 5, or 6, so any of the ones in this box to any of the

ones in this box, these are all paralogous to one another because they're related to

each way back here because of this gene duplication event. Okay? Does that make sense? Okay.

So this takes a little while to learn the terms and when I was in graduate school, I

had to post this on the board in front of my desk to finally learn them, but it's important

to know what these terms are, because again, these do imply different kinds of relationships

and unfortunately they're not always used the right way. Okay. So, we're going to put

the definitions behind us and now start to talk about the alignments themselves and how

we look at the quality of those alignments. Okay.

So I want you to imagine, again, two sequences, sequence A and sequence B, and we're going

to try to align those sequences. And if we try to align those sequences across their

entire length, from the N terminal end to the C terminal end of A and B, or, if we're

looking at nucleotide sequences, from the 5'-end to the 3'-ends of both of those, that's

called a global sequence alignment, because you're trying to force an alignment along

the entire length of both of those two sequences. Now this works incredibly well when you have

two sequences of approximately the same length and that are highly similar to one another,

because aligning those things over such a long stretch when they're very similar, it's

an easy proposition. But as you start to see the degree of sequence similarity going down,

doing these global alignments gets harder and harder to do. And when you try to force

these global alignments amongst more dissimilar sequences, you start to miss important biological

relationships. You may miss the alignment of an important protein domain that's responsible

for either structure or function or some other similar feature. So from a biological point

of view, a slightly different approach is better.

So rather than doing global sequence alignments, we do local sequence alignments. So here,

we still start with sequence A and sequence B and try to align them to one another, but

we don't try to force the alignment over the entire length of both of those sequences.

Instead, we look for regions that align well. So the goal here is to find what, in the parlance,

are called paired subsequences, just a glorified term for short regions that align well with

one another. Anything outside the area of those local alignments are excluded, and you

can imagine because we're looking for these local alignment areas that we could have more

than one good alignment for two sequences that we're comparing to one another. So really,

this is, by far, much better for sequences that share some similarity or when you have

sequences of different lengths. And in the business that we're all in, in trying to figure

out what various genes do, how they're related to each other, trying to identify members

of protein families, this is what we're trying to accomplish here. So not necessarily look

for global alignments, but those key regions that are common from protein to protein to

protein or from gene to gene to gene. So this is really the approach of choice for biological

discovery. Okay.

So we're going to try to find these local regions, we're going to align the letters

in sequence A and sequence B the best that we can, but at some point we have to have

some sort of qualitative measure to tell us how good the alignment is. So we talked about

percent similarity before; that's one metric. But there are other metrics that are available

to us as well. And in order to calculate these metrics, we use something called scoring matrices.

So these are empirical weighting schemes that look at the physicochemical properties and

the biological characteristics of nucleotides and of amino acids. So specifically, side

chains, their chemistries, their function, their structure. So when we look at these

scoring matrices, what's being taken into account are factors like, you know, looking

at the fact that cystines and prolines are always important for a structure in function.

You need those cystines in order to make disulphide cross-bridges. Prolines often break alpha

helices, so you'll see either a proline or a proline-proline at the end of an alpha helix.

Things like a tryptophan; big, bulky side chain can only be accommodated in certain

places in proteins. Things like lysines and arginines, they all have -- they both have

a plus charge on them. Things that you would want to conserve when you're comparing two

proteins to one another. Okay.

So the game here, what we're trying to take into account and trying to capture through

these scoring matrices, involve conservation. So which residues can substitute for one another

and not affect the function of the protein? Okay. So things like isoleucines and valines

can substitute freely for one another because they're both small, they're both hydrophobic.

When we look are serines and threonines, again, we often see those substituting for one another

because they're both polar. So we want to try to conserve, in this game, the charge,

the size, the hydrophobicity, and other physicochemical factors. The other thing we take into account

is the frequency of these individual amino acids, asking the question, "How often do

those amino acids occur in -- amongst the entire constellation of proteins?" Things

like glycines, all over the place. Things like tryptophans, the most rare of the amino

acids. So we want to account for the frequency of those amino acids as we start to score

these alignments. Okay.

So why do you care about all this? Why is this actually important to you? When you do

a BLAST search or a BLAT search or any sort of analysis that involves comparing one sequence

to another sequence or one sequence to a set of sequences, these scoring matrices are all

being used in the background. So again, we're talking about the black box here. So when

you do those BLAST searches, these scoring matrices are being used. They're always part

of a sequence analysis. The selection of them is important because each one of them -- there's

not just one, there's a whole slew of these scoring matrices -- and each one of them implicitly

represents a particular evolutionary pattern. So the choice of the matrix really can strongly

influence the outcome of your analysis, the results you get back when you do that BLAST

search. Okay? And for a lot of things that you might want to do, using the default values

on NCBI's BLAST website may not be the right choice. So we're going to spend a good amount

of time talking about these individual scoring matrices, and more importantly, some guidelines

about which ones to choose. Okay, how are we doing so far? Good? All right.

So let's do the easy part first. If we compare two nucleotide sequences to each other, we

have here just a simple match/mismatch scheme. So we've aligned nucleotide sequence A with

nucleotide sequence B and we're just going to look for exact matches. Every time we see

an exact match, we give that particular position two points. Every time we see a mismatch,

we subtract three points. So you want to go, obviously, in favor of the matches. So for

example, if we look at this, this is the actual scoring matrix. If we have a particular position

where a C has aligned with a C, you just look for where these intersect and there's the

two points for the match. And you'll notice on the diagonal, you will always see the value,

two. And then off the diagonal, any place there's a mismatch, you see the minus three.

So very simple, okay? Now this assumes that each of the nucleotides occurs 25 percent

of the time, so here's where the frequency business comes into play. In the nucleotide

world, this is easy, the side chains are very chemically similar to one another, you can

have a very simple scoring scheme. But it looks a little bit different when you get

to the protein side of the world. Okay.

So this unreadable slide, I know, is what is called the BLOSUM62 matrix. This is the

default matrix when you do a BLAST search on the NCBI website. So how do things work

here? So as before, instead of having the nucleotides across the top and down the side,

we've got each one of the amino acids across the top and down the side. The exact matches

are represented on the diagonal -- and it's probably easier to look at the screen than

look at your handouts right now -- and you'll see that the values on the diagonal, unlike

the nucleotide case, are not all the same. So when things are infrequent, they get higher

scores. And in fact, if we look at where -- if we have a particular position in our amino

acid alignment, in our protein alignment, where our tryptophan has aligned with a tryptophan

and we see where that intersects, that particular position would be scored with 11 points. It's

the highest number on the diagonal because the tryptophan is the least common of the

amino acids when you look at this entire constellation of proteins. Okay.

Now, of course, we don't want to just look for exact matches. We want to look for conservative

substitutions as well because we might learn something from proteins that are somewhat

similar to the one that we're looking for. We don't want to just look for the ones that

are exactly the same. So starting with this tryptophan again, let's say we have a particular

position where instead of having the tryptophan, the tryptophan has been substituted by a tyrosine.

So we've got the W up here, we've got the Y over here for tyrosine, and if we now see

where that intersects, two points, okay? Still a positive value to represent that it's a

conservative substitution, that the tyrosine can substitute for the tryptophan, but not

as high as if you would have had the exact match. Okay?

Finally, let's say you just totally mess things up, you've got a position where two things

that should not be aligned are now aligned with each other. So let's say you've got a

cystine that has aligned with a glutamine, in this particular case, minus three. You're

subtracting points because those two should not be aligned with one another. They don't

have the kind of relationship that we're looking for. They can't serve in the same role, functionally

or structurally.

So, simply put, you get the most points for an exact match. We score higher for rare residues

than we do for common residues. Okay? If you have a conservative substitution, you still

earn points, just not as many as if you would have had an exact match. And finally, any

time you have a mismatch, you subtract points. Does that makes sense? Good. All right.

Okay, so this is -- this is what the basic structure of these matrices look like. Now

I've already alluded to the fact that there are more than one of these and this is one

of a family of matrices called the BLOSUM matrices. These were developed by Steve and

Jorja Henikoff at the Hutch, back in 1992. BLOSUM stands for blocks substitution matrix.

And what they did was they took all of the proteins that were known at the time, did

multiple sequence alignments on those protein families to look for the natural patterns

of what could substitute for what, and which amino acids were more frequent than other

amino acids. So this is based on empirical evidence from the sequence databases, so they're

directly calculated. Because they come from real data, they are very sensitive to detecting

structural or functional substitution. So really these matrices have become the benchmark

for whatever we want to assess the quality of any pair-wise sequence alignment.

Many years previous, there was another series -- still is -- called PAM, point associated

mutations, that was developed by Margaret Dayhoff back in the early '60s. She was one

of the founders of the field of bioinformatics. But as you can imagine, in the 1960s, we didn't

really have very many sequences of any type. Most of what we had were globular proteins;

they were very similar to one another. You'll often see this PAM offered to you as a choice

in phylogenetic programs, but just because of the point in time where they were made

and the kinds of proteins they were based upon, they're really not the best choice for

most analyses. So whenever you have a choice between the PAM matrices and the BLOSUM matrices,

pick the BLOSUM matrices. And there'll be some more information on that in your readings

later. Since I hear a phone going off, if you haven't already, could you please put

your phones on stun and that would be great. Thank you.

Okay, so how do you now select which one of these BLOSUM matrices to use? When we say

BLOSUM, it's always followed by a number. The example that I showed you before was BLOSUM62.

What that "n" represents is based on how it was calculated. So the N just says, all right,

it was calculated from sequences sharing no more than N percent identity. So in the case

of a BLOSUM62, the sequences that were used were no more than 62 percent similar to any

of the other sequences in this set that were used to calculate the matrix. I'll show that

example in a minute. More importantly, what this does is it helps to remove bias from

the set. So the contribution of sequences that are higher than the number are clustered

and weighted to one. Okay? When we do whole genome sequencing, the whole history of how

proteins have been sequenced and genes were found in individual laboratories -- these

are not uniformly distributed amongst proteins. They're not uniformly distributed amongst

tact, so we have overrepresentation of certain sequences in the public sequence databases.

So we need to normalize all of that out, remove the bias from the dataset and take those closely

related sequences out so that we don't miss something that might actually be biologically

relevant if we hadn't done that. Okay.

So to hopefully make this make a little bit more sense, here is a multiple sequence alignment.

There's six sequences in the set. You'll notice that there are some absolutely conserved positions,

the ones that are marked with the star and that are in the lighter blue. And if I were

to now say, "Okay, I want you to cluster this alignment into blocks where we're going to

apply an 80 percent cut-off." So I want you to just group this together, all of the sequences

that are 80 percent or more similar to another sequence in the set. And when you do that

-- all right, so this is not similar to any of the others at this percent relationship.

These three are 80 percent or more similar to one another. And finally, these two are

80 percent or more similar to one another. Once this is done, this is counted as one

sequence, these are counted as one sequence, and these are counted as one sequence. So

we've started with a set of six, we've collapsed it to a set of three. That's now going to

be used as the basis for calculating the matrix, okay? So again, this is how the bias is removed

from the set. Okay? Hopefully that made sense. Okay.

All right, so, implications. Okay. Again, when we do this clustering, reduces the contribution

of closely related sequences, we have less bias towards substitutions that are in really

closely related members of a family. At the end of the day, the take home message, reducing

N yields more distantly related sequences. So your first cheat sheet of the day, put

a big star next to it, circle the slide. Okay.

Which one of these do you actually choose, after all of that? You know how they're constructed

now, you know what the -- what the underlying implications are. Which ones do you pick?

Okay. The one in yellow is the default again, and this was determined in studies where the

answers were known in advance, going backwards, to try to figure out which ones should be

the default on the NCBI website. That again is BLOSUM62: it's the most effective one for

finding all potential similarities between two sequences and generally works best when

you're in this kind of percent similarity range, in that 30 to 40 percent range. As

you go further up, you do -- you'll do better finding more closely related sequences, things

that are almost exactly the same. So by the time you get up to BLOSUM90, you're going

to get shorter alignments but they're going to be highly similar in the 70 to 90 percent

similarity range. Going the other way, when you go down, you start to get longer and weaker

local alignments, things that might be less than 30 percent similar to one another, but

this is really where a lot of the biological action is. Okay? You can look by eye and see

two sequences are similar to one another once you have them aligned. Once you get down to

the bottom is where you start making these interesting discoveries, finding out that

two proteins are related to one another, even though there may be very little sequence similarity

between the two of them. A great example of this, homeodomain proteins. I know everybody's

familiar with the homeobox genes. When you look at the protein complement, you're looking

at a 60 amino acid stretch, six or seven of those positions are absolutely conserved over

evolutionary time, and the rest of it, it's kind of willy-nilly. So, but if you were to

just do a routine search and possibly pick the wrong matrix, you'd never find all the

members of the family. Okay? So this is why the choice of these things is important.

So the strategy that I advocate here is one that was originally put forth by Steve Altschul,

at NCBI, the guy who was behind developing BLAST back in 1990. And instead of doing one

BLAST search, I want you to do three BLAST searches, okay? Use the default, pick one

of the more -- higher number one's to give you -- to look for closer similarities, but

also pick one of the lower ones to look for the more distant similarities. So when you

go back to lab, take your sequence of interest, run it using three different matrices. The

vast majority of the results will be the same, admittedly, but you will start to see subtle

differences that you wouldn't have found if you just did one search. So I think it's a

good approach to start to find those relationships that you wouldn't normally otherwise find.

Of course, if you know what you're doing in advance, if you're looking for things that

are really close or really distance, you can just pick one and go from there. But if you're

just really trying to cover the field, blanket all your possibilities, take the default,

take a higher one, take a lower one, as well. Okay?

So really, what I'm trying to get across here is no single matrix is the complete answer

for all sequence comparisons. You need to make this intelligent choice and not again

use that NCBI website as a black box. It's a great resource, but there's a lot of things

you can change along the way to help refine what you get. Okay? Any questions? Okay, great.

All right, so for those of you who would like a little bit more background, there is a unit

in Current Protocols in Bioinformatics, Unit 3.5, that talks more about the PAM matrices

that I really went through very quickly, a lot more on the BLOSUM matrices, and some

more information about specialized scoring matrices. So we're looking at, in this case,

things like transmembrane proteins places, where the usual rules don't apply because

of where they are found in the cell, because they're embedded in membranes and so on. So

the frequencies and substitution patterns are often very different because of those

either compartmentalized or embedded proteins. Okay. All right.

So one last thing we need to talk about before we actually get into BLAST, and those are

gaps. So again, I want you to imagine sequence A and sequence B. You've just aligned them

and there might be some place in that alignment where it would be fortuitous to improve the

alignment to insert a gap. Okay? Now I want you to remember that when you put those gaps

in, it's not just for sake of making a nicer alignment, it's not a word processing exercise,

okay? Each time you put a gap in, it actually represents a biological event. So every gap

represents either an insertion or a deletion. Okay? So because of that, you can't just put

them in willy-nilly again -- I've used willy-nilly twice now today -- okay, you can't just put

them any place you want. You have to keep in mind that they represent biological events

and keep them to some sort of plausible number. A good rule of thumb is about one in every

20 residues, but the fewer number of gaps, the better. Now if you think about your position

by position alignment again and you think about your scoring matrices -- so you've got

an amino acid lined up with another amino acid. You can go to your BLOSUM matrices,

see where the values intersect and assess a number for that particular position. But

when you have a gap, okay, you have a residue and then nothing. Okay? We didn't see gaps

in the scoring matrices before, so we can't simply score those as a match or a mismatch,

as we did before. So instead, there are gap penalties that are assigned.

With apologies, an equation. And basically all this equation is saying is that any time

you insert a gap, you take a penalty, and you also take an additional penalty for the

length of the gap. So one for the gap and one for the length of the gap. So the equation

is just G plus L times n. So G is the penalty for opening the gap. You see the values here

for both nucleotide and protein sequence comparisons. L is the gap extension penalty, this value

right here, so it's two for nucleotides, one for proteins. N is just the length of the

gap. So let's say I have a protein alignment. It is of length 2, okay? So to assess how

many points I'm going to subtract for that, okay, so G for a protein is 11, L times n.

So here's L, the gap extension penalty is 1, the length is 2. So 1 times 2 is 2, 11

plus 2 makes 13. So you subtract 13 points for the gap, okay? If you'll remember, the

tryptophan to tryptophan was 11, so you get plus 11. So to give you some measure of how

these compare to exact matches -- so when you start opening big gaps, you start plummeting

as far as the quality of your alignment goes and the scores of that alignment actually

go. Okay? All right.

So with that, it's time to talk about BLAST. Okay. So how many of you have used it before?

How many of you have used BLAST? The vast majority of people in the room. Great. How

many of you actually know how it works? Okay, a few. [laughs] And from the people who are

left, how many people have actually ever changed the default values on the NCBI website? One,

two, a few. Okay, good. All right, so -- maybe not. So all will now be revealed. We're going

to go through this in gross detail and hopefully you'll all be your lab's experts at this at

the end of the day, and again, use this tool to its -- to much better advantage. Okay.

All right, so BLAST is the basic local alignment search tool, B-L-A-S-T. By far, the work horse

in bioinformatics, the most important tool that we use in our field, because the whole

act of doing these sequence alignments, whether it's one by one or one to a database, really

help us discover these previously unknown relationships. So we're going to, again, hit

this pretty hard today to make sure you know how to use it properly. Part of the reason

for that is not just making sure you know how to use it right, but there are many, unfortunately,

cases in the literature where biological conclusions have been made that are incorrect based on

people's either misinterpretation of BLAST results, they didn't either know how to use

the program or they didn't know how to interpret the results. So -- and that's a pitfall that

I don't want you all to fall into. So one of the things to keep in the back of our heads

as we go through this or any of the techniques is that just because a result appears on a

screen or just because it's printed out on a piece of paper does not automatically mean

it's biologically significant. Okay? And we're going to talk about that a little bit more

as we go on, but just keep that in mind that that does not automatically mean it's a great

result and that you should run with it. Okay.

So when we do this, what we're trying to do -- this is a local alignment technique, so

we're looking for these paired subsequences. In this case, they're called high scoring

segment pairs, HSPs. So a pair of sequences, it can be aligned well with one another, we

are going to try to align them to make them as long as possible. They have to be above

a certain score threshold -- we'll talk about that in a minute -- and they can either be

gapped or ungapped. Because we're -- again, we're looking for local alignments, we can

have more than one good alignment between sequence A and sequence B. All right.

There are various flavors of BLAST that are shown on this slide. The ones that are used

much more frequently are BLASTN -- the N at the end just stands for nucleotide, so we

start with a nucleotide sequence and we search that against a database of nucleotide sequences.

BLASTP -- P is for protein, so we start with a protein query looking against a protein

database. BLASTX is a little bit funky; we start with a nucleotide sequence, we translate

it in six frames and then we take that translation and compare it to the protein databases. Okay?

The two at the bottom are shown for sake of completion; they're not used very often. By

far, you'll be using the three at the top most often and we're going to concentrate

on BLASTP for most of our examples today.

All right, so before we actually pretend to do a search, how does the alignment actually

take place? So this is a query sequence, just an amino acid sequence, and we're going to

use this as the basis of comparison to a public database. And what I've done in the middle

is I've highlighted three residues: the P, the Q, and the G. Okay, so this is what is

called the query word. By default, BLAST will nucleate a search with a three amino acid

stretch. Now of course it starts at the end so it'll start with the GSQ and then go to

the next three SQ and move across, but for the sake of symmetry on the slide, we're going

to just start with the P, the Q, and the G. So what BLAST is now going to do is look in

the public database for any other protein sequence that contains, in that order, the

P, the Q, and the G.

But we also want to find places where there are conservative substitutions, places where

there might be a little wiggle in the sequence but might produce an important, biologically

relevant match, places where there are conservative substitutions. So certainly, we're going to

look for things for the PQG. What I've done for this example is just changed the middle

letter here. So looking for things like the PEG, the PRG, and so on. Each one of these,

you see, has a score next to it. You now know how to calculate this score, so if you were

to take the BLOSUM62 matrices you have in your handouts and just looked at P for P,

Q for Q, G for G, see where they intersect, you'd get 7 and 5 and 6 for a title of 18,

the best match because it's the exact match. The next one down, we have a conservative

substitution in the middle position. We see that the score goes down a little bit and

the scores keep going down as we go through this.

At some point, BLAST is going to say, well, we've gone kind of far enough, you know, we're

now starting to go into territory where the sequences are starting to diverge too much.

That's something called the neighborhood score threshold. In this case, it's 13 points. This

changes on the fly from search to search and depending on where you are along the length

of the sequence. So you don't have to worry about changing that value. You can, but this

is something I don't think I've ever personally changed in a search. Okay.

All right, so we have now established these other three-letter words that are in the neighborhood

and we're going to use this as the basis for continuing into the next step of the BLAST

search. So we're going to look for everything that has a PQ, GPE, GPRG, and so on. All right,

so this is from the previous slide. Here's our PQG again, and in this particular case,

we've aligned with a sequence that instead has a PMG. So it's not an exact match, but

if we look up here, PMG is in the neighborhood, so that's okay. And then the next thing BLAST

will do is, starting from this position, this query word, three across, it's going to go

out in both directions as far as it can, aligning as many of these of these amino acid positions

as it can. Okay, so we're going to go out to the N terminus, we're going to go out to

the C terminus. Okay. And by looking at this, before we talk about the numbers, you can

also do a qualitative assessment of how good the alignment is. In this middle row, any

time you see an exact match, you see that the letter has repeated itself; any place

you see a conservative substitution, you see a plus sign. So just by eye, you can see how

good the alignment actually is and you want as much of that middle filled in as possible.

Okay.

So, now, time to bring the numbers into play. I told you we start with that nucleated PQG,

go out in both directions -- well, how far is too far? Okay, so here's our alignment

again and here is a graph. So the extension on the X axis is just how many positions have

we aligned, how long is the alignment? The cumulative score, which comes from our BLOSUM

matrix, is on the Y axis. Back here, at the first position, this is this part of the alignment,

the PQG with the PMG, the 13 points that we had from the previous slide, above our score

threshold. Okay? So as we start to go in both directions, we're going to assume that we're

going to have more matches than mismatches. So remember, when we match, we're adding points,

when we are mismatching, we're subtracting points, but we're going to assume that we're

going in favor of the exact matches. So the score, as we start to extend, goes up.

At some point, you're going to hit this score threshold S. Okay? This is also set on the

fly, search to search. If you break S, this alignment is returned to you in your BLAST

results. It does not, again, mean that it is biologically significant. It just means

it's going to return the result to you. All right? So, one thing to remember. So we're

going to break this score threshold, it's going to now appear in our BLAST results and

we're going to keep aligning merrily away. But at some point, we're going to go too far.

The mismatches are going to start to outweigh the matches, so we're going to start subtracting

points at a faster rate than we're adding them. We might start having some gap penalties,

again, subtracting points. And at some point, we're going to hit something called a significance

delay. When we go off of a peak like this, a certain amount down, BLAST is going to say,

okay, you've gone a little bit too far. We're going to go back up to the peak, so whatever

that length alignment corresponds to is the length of this alignment at the peak, that

is now your high-scoring segment pair, the maximal length alignment in that region between

sequence A and sequence B. Got it? Okay. All right.

Now I've just made a big deal about the scores. Okay, we need the scores to assess how good

a match is position by position by position. But we don't actually use the scores to assess

biological significance. What we're going to use instead is the result of this equation.

Okay? Don't even worry about what this equation is, okay? This is something called a Karlin-Altschul

equation. What the score allows us to do -- there's the score -- based on the length of the database

we're looking at, the length of the sequences we're looking at, composition, all of that,

it allows us to calculate a normalized probability, and that's the value you're going to look

at. Okay?

So why I am making a big deal out of this? If you now have some understanding of what

those scoring matrices look like, I want you to think of two alignments. So we have maybe

sequence A compared to sequence B and that might be of a certain length, but it's all

composed of common residues. You may have another alignment of sequence C and sequence

D that is of the same length as the first alignment, but it's all composed of rare residues.

Okay? You already know that rare residues score higher than common residues. So you

could end up with a situation where you have two alignments of the same length with the

same number of positions matching up, but because of taking this rare versus common

thing into account, the score of one is going to be vastly higher than the score of the

other one. So you need some way to compare, from alignment to alignment to alignment,

some normalized measure that allows you to say, "Okay, this one is good and this one

is not so good." So that's why the score itself is not a good measure, but it is the key for

us to get to the probability value. I hope that made sense.

Okay, so, at the end of the day, you're going to forget about all of that. This is what's

important. What that E value represents to you is the number of high-scoring segment

pairs that come from comparing sequence 1 to sequence 2 that were found purely by chance:

the number of false positives. Okay? So because this represents the number of false positives,

you want E to be as low as possible. Okay? So lower values of E signify higher similarity.

And again, because this is a normalized value, you can use this as a metric to compare any

two sequences -- excuse me, any two alignments to each other. You don't have to worry about

lengths and other vagaries. Okay.

So rule number 1: okay, if you are doing a nucleotide-based search, you've started with

a nucleotide sequence, you searched against GenBank, the rule I want you to apply here

is you're going to look for values of E that are 10 to the minus six or less. If you are

doing a protein-based search -- so this is BLASTN and BLASTX. Remember, BLASTX, even

though you're starting with a nucleotide sequence, the comparison's at the protein level, you're

looking for values of 10 to the minus 3rd, .001 or smaller. Okay? So that's rule number

one, and we're going to come back to the rules in a moment. All right. Questions? Good. All

right.

Okay, so let's go through an example, finally, after all of that background. And again, we're

going to consider a BLASTP example. We're going to pretend we're at the computer. The

screen captures are all in your handout, so everything I show you, you have, and I would

really recommend that you go through these exercises when you get back to your labs or

to your offices. It's one thing to watch me explain this to you, it's another thing for

your hands to get onto the keyboards and actually do them yourselves. That's the only way you're

going to learn how to do this stuff. Okay.

So here's the NCBI website at ncbi.nlm.nih.gov. Let me orient you to the page a little bit.

There are some common features, some popular resources that are shown in the right sidebar

here. And the very first one on the list is BLAST, so we're just going to pretend that

we've clicked on that. That will take us to the BLAST page, or you could just go directly

to this URL. There are three sections on this page. The first one allows you to search assembled

genomes directly. So if you know that you only want to search against a particular organism,

you can pick one of these off the list and there's actually a link that will take you

to all of the ones that are available to you. At the bottom, there's some specialized BLAST

stuff, we'll come back to that a little later. But what we're going to concentrate on now

is this middle block, basic BLAST, and it says nucleotide BLAST, protein BLAST, and

so on. So we're going to do a BLASTP search, we're going to click on that protein BLAST

link. Before we can actually do the search, we need a sequence. Okay? So again, for you

to reinforce these concepts when you get back to your offices or to your labs, there is

a website at this URL that has all of the sequences that I'm using as the examples for

these lectures. So please go to this website, copy them, go through step by step exactly

what we're doing here. All right.

So this is now the BLAST query page. So at the very top we have a sequence box, so from

that processing page, I just took the first sequence, pasted it into the box. If I just

leave that the way it is, it will search the entire length of that sequence, from position

1 to however long it is, but let's say I only want to use part of the sequence. I can tell

BLAST, all right, only use a subrange: start at position 10 and only go to position 50.

If you have a reason for picking a particular domain, you can do that rather than trimming

the sequence manually. Okay. In the next block down we have some choices to make regarding

what we want to search. We know now what our query is; what's going to be the subject,

the target of our search? By default, it says "non-redundant protein sequences" or "nr,"

so this is what people colloquially call GenBank. Okay? All of the sequences that are publicly

available through NCBI, so the non-redundant database. There are a couple of other interesting

ones, though, that I want to talk to you about. The first one is called RefSeq and the second

we're going to talk about in a couple weeks called Swiss-Prot. But these are specialized

databases because they are highly curated.

So let me tell you a little bit about RefSeq. So RefSeq stands for "reference sequence."

This is NCBI's reference sequence project and the goal of this particular project is

to represent each molecule in the central dogma, each DNA sequence, each protein sequence,

or each mRNA sequence, once and only once. So those of you who have done sequence-based

searches before may have come across the situation where you put in a gene name, DCC, and you

might get one one, five, 10, 20 entries back, and you sort of quizzically look at the page

and you wonder, well, which one is the right one? Which is the canonically accepted sequence

for that particular gene? So that's the whole point of this effort is that you don't have

to worry about trying to figure out which one is "the right one." The curators at NCBI

have already done this for you and identified which sequence is the accepted, correct sequence

for a particular gene, mRNA, or protein. Because of this curatorial effort, by definition,

this is a non-redundant database. Each sequence is represented once and only once. And the

cool thing is that there are curators, I believe somewhere below ground in Building 38A -- I'm

not sure how many subbasements down -- who read the literature and make sure that all

of the things in the entry are kept up to date. Okay.

Now, how do you know when you're looking at a RefSeq entry? So I want to introduce the

concept of the accesion number, and you're going to hear that term many times throughout

the course. So the accesion number is just a unique identifier, that when you use that

as the basis for your search, will return one and only one sequence. So in essence,

it is that sequence's Social Security number. Okay? They take various formats; they have

codes. And the way you can tell which ones are in the RefSeq series is by looking at

the first couple of letters. So the first set I have up here says from curation of GenBank

entries. They all start with an N, so the ones that are NT underscore -- and then six

digits -- these are genomic contigs. They are the sequences that come off of the sequencing

machines n the kinds of large-scale whole genome efforts that Eric Green spoke about

last week, just cranking all of this genome data out. By the same token, the NMs are the

mRNAs; NP, the P stands for proteins; and there's also entries for the non-coding transcripts,

which are shown by NRs. So the thing that all of these have in common is that these

were directly experimentally verified. Somebody had these molecules in their hands at some

point and directly sequenced them.

Now you can imagine a situation where, because of these large-scale genome sequencing efforts

we do, the sequences are being cranked out at just amazing, breakneck speeds. Sometimes

what we do is take these genomic contigs, the data that comes off these sequencing machines,

and we apply computational techniques to try to predict where we think the genes are, where

the introns are, where the exons are, genomic structure, all of that. So what we can start

with one of these genomic contigs and then if one of these computational techniques has

been applied, we would then have a series of model mRNAs, the X series here, so XM for

model mRNAs, and then based on those predictions, in turn, we could predict where the model

proteins -- what the model proteins are, and these are the XPs. So the important takeaway

here is when you do a search against RefSeq, you always want to go in favor of the Ns,

because again, these have all been experimentally verified. If these are not available, take

a look at the Xs, but keep in mind, again, that these are computational predictions;

they have not been verified. The computational predictions are pretty damn good, okay? But

they are not fool-proof. So you can look at those to get an idea of, "Well, I think I

can make some sort of conclusion," but you're going to need some other sort of verification

to support that. So again, the Ns are experimental, the X are predictions. Clear? Okay, very important.

Okay, so we can pick -- we're going back to our BLAST page -- we can pick a page -- excuse

me, a target database that we want to use. Let's say we want to limit our search to particular

organisms or particular types of organisms. We could just put terms in this box here,

so you can say "human" or "bilaterian" or whatever you'd like to limit the results that

you get back. Okay. In this section here, it says "program selection," BLASTP is already

selected for us. You'll see that there are two other choices here -- PSI-BLAST and PHI-BLAST

-- we will talk about those in two weeks' time. Now most people at this point will just

click the BLAST button and go off on their merry way, but if you'll notice underneath,

it says "algorithm parameters." So we're going to pretend that we've now clicked on that.

It will expand the screen, there's more below. And this is where we now get to fine tune

our searches. So all of the stuff that I talked to you about in the first 45 minutes comes

into play here. Okay.

The first thing it asks you, "How many sequences do you want returned to you?" The default

is 100, okay? I've, for the purposes of the example, changed this to 250. I would rather

get too many results back than too few and make the decision myself about what is relevant

and what is not relevant. Okay? And we're going to see in this example that this is

critical. Okay, expect threshold. What E value cut-off do you want to apply? So I've already

given you a rule about the E value -- it's 10 to the minus 3rd or less when it comes

to protein sequence comparisons. The default is 10. For purposes of the example, we're

going to leave it at 10. Okay? But you can change that there if you want. Right below,

you can see the word size is 3, our P, Q, and G. Okay. Going a little further down,

here's now where you can pick the matrices. So remember, I'm advocating you using multiple

matrices; here's where you make that selection. Above that, before we get to the next one,

you can see gap costs-- remember I was telling you about the affine gap penalty -- if you

want to make the gaps more permissive or less permissive, you can change the default values

here. So by default, it's 11 for the -- opening the gap and 1 for the extension. If you want

to make it more permissive, you'd push those values down. If you want to make it less permissive,

you'd push the values up. Okay.

Filters and masking. So always, you should check this particular box off, to filter for

things called low-complexity regions. So what is a low-complexity region? So these are regions

of biased composition. What you'll sometimes see in sequences are just long, homopolymeric

runs, short-period repeats, or just subtle over-representation of the same residues.

So here's an example where we've got a homopolymeric, just run of As and Qs, an alanine/glutamine

track here. And unfortunately, BLAST really relies on having somewhat of a uniform distribution

of residues. So when it sees something like that, it sometimes doesn't know what to do

with it, because you've got all of this, in this case, all these As and Qs. If it's got

another sequence with an equal run like that, it doesn't quite know what register to put

them in. Okay? So it's best to actually remove those, mask those from the sequence to not

have the results get confounded. This usually -- just biologically -- this usually comes

from either DNA replication errors or you might have some unequal crossing over, still

a question mark on that one. But as far as these analyses go, it's important to mask

them because again, it often leads to false positives. So you just get in the habit of

checking that box off when you do these searches. Okay.

Finally, we've gotten to the bottom of the page. It tells us we can go ahead and search.

There's a little check box here that I always check off, that says "Show results in a new

window." Okay? So when we finally click that last BLAST button, by checking that box, your

results will, as it says, appear in a new window. The reason I like doing that is, if

you don't do that, everything's going to be in the same window and if you want to back

up, it's harder to do that. So just from a practical standpoint, you've got your results

in a different window, if you want to now go back and make an adjustment, you still

have your original window without having to put the sequence back in, pick all the same

parameters again. It's just a nicer way of doing it.

Okay, so finally, time to issue the query, and this is what you get back. Okay. Now you'll

recall the default value for the number of results to return from the target database

was 100. If you look at the screen, it says "Distribution of 204 BLAST hits on the query

sequence," so we would have not seen the vast majority of our results, which, in this case,

most of them are actually biologically significant, as you'll see in a moment. Above this box,

you'll see that BLAST has identified some important protein domains -- and again, we'll

come back to this in a couple of weeks.

So let's concentrate on this box here. So this gives you sort of a general overview

of your BLAST results. At the very top, you have a color key that shows you just sort

of a heat map of as the scores get higher, we move from the blacks to the reds. Remember

the score is now what you're looking at; you -- we're going to look at the probability

values, but we'll just work with this for now. The alignments, each one of the hits

that have been found -- so any time I use the word hit, it just means a sequence that

was found in the target database that was deemed similar to the one that we started

with. So the best ones are, by score, are at the top, then going down in descending

score order. If you were to take your mouse and just mouse any place over any of these

bars, there's a window at the very top that would give you the identity of what protein

it is that was hit in this particular analysis. What's next? Okay.

Now, you'll see some of these are connected by these very faint, thin bars. Some of them

are not. Whenever you see two of them connected like this, this means that we have a protein

that was found in the public database where there are multiple high-scoring segment pairs.

Okay? So in the very case at the top, there were two separate alignments for sequence

A and sequence B. Okay? The case where the arrows showing 1, 2, 3, 4. Some places you'll

see this very light vertical bar. And what that is intended to convey to you is that,

all right, in these cases, you know, there's some space between this high-scoring segment

pair and this one. In this case, either the space is too small to show, but it wants to

give you an indication that there is a boundary there, or you have two high-scoring segment

pairs that overlap with one another. Okay? And that happens quite frequently. So in -- let's

find a good one -- well, we'll come back to it later. Okay. Okay, any time you see things

on a line that are not connected, they're unrelated; it's just done to save space. Okay?

So that's sort of the overview. What we're going to get -- I like numbers, so we're going

to see much more useful information when we scroll down a little bit and look at the actual

results. So this is the table that you get back, nicely formatted. The accesion numbers,

the unique identifiers, are in the first column, and if you were to click on any of those,

it would take you into GenBank and you could just see the GenBank entry for that particular

protein. In the description column, you're given a brief one-liner about what was found,

usually just the name of the protein and the organism that it's in. The third column gives

you the score. Remember, the score is not important: the E value here is. And thank

you to NCBI for finally putting them in E value order instead of score order. For many

years, those of you who may have done this before will remember that the results were

always sorted by score instead of E value. You'll notice that there is a small up arrow

here, so the sort is actually the way you want it by default. But if you, for some reason,

want to sort on one of the other values, you can just click on the column header and it

will do that for you.

You have some links in the final column. The cheat sheet is above the table here, so if

you see U, U stands for UniGene, which gives you some information from a gene-centric standpoint

about gene clusters. E is for GEO, for the gene expression omnibus. So if you see a E

next to one of these -- we don't have one here -- that just means that there's gene

expression information available to you as well. The G stands for EntrezGene, a nice

gene-centric view of the world. So if you want o know about -- more about the gene that

encodes for that particular protein, you could go there. A couple of cases here, we have

Ss. S is for structure, so that means that there is a solid, three-dimensional structure,

either by x-ray or NMR for that particular protein on that line. Might not be the whole

protein, might just be a domain, but there is some real structural information available.

M would take you to the NCBI map; you're -- Tyra will cover that very briefly next week. And

finally, if there's some bioassay information available through PubChem, you'll get that

as well. All right. So that's the orientation to the table.

This is the column, again, we're paying attention to. At the very top, you'll notice that the

values are zero. So what zero just means is that it is less than 10 to the minus 1,000.

So basically, there's just not enough room to show the value. And then, you know, so

they'll go out to three digits in the exponent, but not four. So if you see a zero, you're

sitting pretty at that point. Okay. As we go down, the values still look pretty good,

given the criteria that I had given you before and we're going to pretend to -- oh, I'm sorry

-- just so you know how to read these, if you're not used to the notation, if you see

something like 4e minus 97, that just means 4 times 10 to the minus 97. So the part after

the e is the exponent. Okay. All right.

So now we're going to pretend to scroll down and the values, if this is the E value column,

again, they look pretty good, we're going, we're going, we're going, but at some point,

we now are going to apply that .001 criteria that I gave you and draw a cut-off line. Okay?

So just a couple things to reinforce here. We changed how many hits to return. If we

had stopped at 100, we would have only seen something maybe further up here and not seen

any of the ones at the bottom. We left the E value at 10 and I think that's actually

okay, as long as you know what the cut-off values are so you can draw the line yourself.

So in this particular case, the line would go here because this one, 10 to the minus

7, that one's okay, but .006 is just slightly above our cut-off guide. So everything from

there up, we're going to accept for now, and everything below, we're going to reject. We're

going to come back to this point in a minute because you'll see in -- if you look at these

def lines, you see a protein called Prospero above the line, but you also see some below

the line. So we're going to talk about that in a minute. Just keep that in mind.

All right. We're now going to imagine we've scrolled further down. We're now looking at

one of the alignments. And so you'll always see something like this. At the very top,

it tells you what the hit is. So again, here's protein Prospero. In this case, the first

high-scoring segment pair starts at position 17 of the query that's aligned with position

317 in the subject, and you can see how the numbers go across as you go down. The part

to pay particular attention to is right here. So expect -- that's your E value -- again,

this one is zero, that one's good. You see two metrics here, identities and positives.

So identities are the number of exact matches. Okay? Percent identity. In this case, it's

100 percent identical. The positives are the exact matches plus the conservative substitutions.

Okay? In this case, because the identity is 100, the positives have to be 100, but we'll

see those numbers changing momentarily. So here's where guideline number two comes into

play. This is a protein-based search. What I want you to look for is greater than 25

percent identity between the two sequences. If it was a nucleotide search, 70 percent

or more. Okay? You'll see that some of these amino acids are in lower case and grayed out

a bit; that represents the low-complexity regions that we talked about before. So these

have been masked out for purpose of the comparison.

Now we're going to pretend again, we're going to scroll down just a little bit. This is

the bottom of the piece that we just looked at, ending at position 704. And you'll notice

here, now, is another block of statistics. But unlike -- let's just go back one real

fast -- unlike the previous one, we don't see any def line information, we don't see

any protein information being given to us. So what that means is that it's the second

high-scoring segment pair for that same protein. Okay? So our protein query against the same

sequence, different alignment, that was found. As before, we can see, you know, where the

sequences align. We have our statistics, so here, 93 percent identical and again, 93 percent

positives. Here's our first case of a gap, so you can see what that looks like when you

have a gap in the alignment. Okay? And in this particular case, there's actually a third

one in this set as well.

So if we look at this overview again, here's just the two little blocks of scores from

the alignments in the two previous slides. So the very first one we had and the second,

our E values were zero, so that beats our 10 to the minus 3rd cut-off. We look at the

percent identities, we want this to be more than 25. Well, both of those values beat our

cut-off. And if we just look at where these are, here's the first high-scoring segment

pair from 17 to 704 in the query, this bar right here. Here is the second one, going

from this black mark going all the way over, and the third one, which we didn't take a

look at, is right here as well. Okay. Questions? Okay. It's -- I know it's starting to get

a little bit overwhelming at this point.

Okay, so cheat sheet number two, to try to bring some rhyme and reason to all of this.

Okay, so just by way of reminder. If we're looking at nucleotide sequences, we want an

E value better than 10 to the minus 6th, sequence identity 70 percent or better. If we're looking

at protein sequences, we want an E value 10 to the minus 3rd or better, 25 percent sequence

identity or more. Okay? Now, these are guidelines, okay? And I told you to pay attention to what's

above and below the line. I don't want you to use these cut-offs blindly. These are just

guidelines as a starting point, as you get more and more experienced with using BLAST.

Okay? You have to pay attention to what's on either side of the line.

So if you see things that are below the line, that you have biological evidence that supports

that that thing below the line, that protein below the line, is actually related to the

one that you started with, the one that you used as your query, your biological evidence

will always trump the computational prediction and you will accept that as being related.

So whenever you've got biological information, you always use that first. Okay? And then

you use these to either confirm the relationship or again, to discover new relationships. But

don't just say, "Oh, Andy told me I have to draw a line here and so I'm not going to consider

this." Always make sure to dig back into what you know from the literature, what you know

from your own studies, when you're looking at things on either side of the line. There

might be things slightly above it that you want to reject, there might be things slightly

below it that you want to accept. Okay? So that will help you get -- not fall into the

pit, again, if you keep your biology into account. Okay.

Just very quickly, some things to keep in mind that can mess up a BLAST search. So again,

we've already talked about low complexity regions. Repetitive elements also, for the

same reason, because of the repeats, tend to confound the analyses, and I give you some

ways of dealing with those here. Low-quality sequence hits are sort of less of an issue

now that we're cranking out sequences at the pace that we are, but they still exist in

the databases, so you may, occasionally, if you're doing nucleotide searches, get hits

to expressed sequence tags, to EST's, or to single-pass sequence reads. There's a Poisson

distribution that governs how good a single-pass read is, as far as accuracy goes, with respect

to genome sequencing. So when you only sequence once, the accuracy is on the order of something

like, what is it, 63 percent, okay? When you do the second pass, it goes up to 87 and then

so on. So that's why we do genomes at 10 or 12x coverage, to make sure we've got it right.

So just keep that in the back of your mind. If you hit one of these lower-quality reads,

you may want to discount the result. Okay. All right.

So that is how you use BLAST, and hopefully that has helped to take it a little bit more

out of the realm of the black box. There are some variations on B LAST; we're going to

talk about one of them today, called BLAST 2 sequences. So let's say you've sequence

A and sequence B. You don't want to search against the databases; you just want to do

an alignment of A with B. So this is the tool that you would use for that, either at the

protein level or on the nucleotide level. All of the BLAST programs are available to

you. You can, as before, select any of the BLOSUM matrices, any of the PAM matrices;

the scoring works exactly the same. Okay. So we're back to our BLAST homepage at this

URL. We're now going to look at this bottom block here, the specialized BLAST searches,

and just click where it says "Align two sequences using BLAST."

Okay, once you click on that, the screen looks very similar to what we saw before. At the

very top, BLASTP has been highlighted, the BLASTP tab. But what's slightly different

from before, before we had a query sequence for one sequence, because we're now comparing

two pre-identified sequences: sequence A, sequence B. Okay? And as before, if we want

to select a subrange of that sequence, we can. BLASTP is selected by default, it's the

only thing you can do here. Again, as before, we can take a look at the algorithm parameters,

so this will take us further down the page. In the first block, how many sequences do

we want returned. E value, the E value is somewhat irrelevant here, as far as performing

the search, because you've, in advance, identified which two sequences you want to align. But

you do want to look at the result later to get the measure of statistical significance.

You can change the word size and so on. As before, matrices are selected the exact same

way. You've got your low-complexity regions here. You're going to make sure you always

check that box, check the box to get the results in the new window, and then finally, go ahead

and BLAST. Okay.

And the result format is exactly the same as what we've seen before, with minor exception.

We only were comparing one sequence to one other sequence, so there is only one line

here in our overview. The results formatted exactly the same way, so this was the sequence

in the second box, which was used as the target. It tells us the E value is 4e to the minus

24, so we accept that, using our criteria. And if, as before, we look at our percent

identities, the cut-off is 25 percent or more. Here, these two sequences in this region are

46 percent identical to one another, 74 percent similarity. Okay.

So just a very quick way to align two sequences. Here's the first high-scoring segment pair,

here's a very teeny one as well. Okay? This one, we would reject because here, if you

look at the E value, 1.9, it's way above our threshold. Okay. Another way to look at these,

here's just a graphical view of where those two alignments line up. So this is sequence

1, starting at position 1 going to position 178. Here's sequence 2, going from 1 to 134.

This was the statistically significant high-scoring segment pair that we identified going from

95 to 178 along here, going from 51 to 134 going up here. This is the one that we rejected,

but you can see that it overlapped this one as well. So we're going to, again, by basis

of our statistics, just reject this one outright and accept this as the best alignment of the

two sequences. Okay.

All right, very quickly, we're going to talk about nucleotide BLAST. So two methods: one's

called MegaBLAST, the other one's called BLASTN. We're not -- I'm not going to walk you through

an example, I'm just going to show you how to actually do it. So again, we're going back

to our BLAST homepage, the center section. Nucleotide BLAST is the first option in the

second section, it takes us to our BLAST page, and what I want to zero in on here is a section

you have not seen before where it asks you which nucleotide comparison program do you

want to use, MegaBLAST, discontiguous MegaBLAST, or BLASTN? Okay? And that's, of course, incredibly

obvious, so here we go. [laughs]

So here's a table that just gives you, all right, what the word lengths are, the plus/minus,

so for matches and mismatches, what the scores are, how the gaps are scored. So affine is

what I described to you before, where it's a penalty to open the gap plus the extension.

When it's linear, you don't asses a penalty to open the gap, you just score the length

of the gap, you subtract points for the length of the gap. If you have two sequences that

are highly similar to one another, really long stuff, chromosome length, you know, big

contig length stuff, you want to use the first one, which is called MegaBLAST. So this is

what you use if you've got very long sequences or highly similar sequences, I mean, really,

highly similar sequences, 95 percent or greater. You're going to get very long alignments back

and it's incredibly fast. I'm going to jump to the lower block real fast. So the lower

block is if you've got really short stuff that you want to make sure you've got almost

exact matches for. These are the parameters that NCBI recommends for this, but they don't

recommend putting a cut-off in, so this is some place where manual inspection is very

important.

In the vast majority of cases, you're in the middle, where you've got things that are somewhat

similar to one another. The results that you would get using either a discontiguous MegaBLAST

or BLASTN are going to be not quite exact, but they're going to be pretty much very similar

to one another. The only difference is that MegaBLAST is going to run much faster in relative

terms. It's like, how fast can you word process, right? So you might want to just do both of

those methods to see if you get slightly different results when you do your BLASTN searches.

So that's all we're going to talk about on the nucleotide side of the house and we're

going to keep coming back to this throughout the course, so you'll see some more treatment

of this later on.

Okay, so finally, in the last 11 minutes, we're going to get to the last thing on the

list, which is called BLAT. So everything we're looked at so far, the scoring matrices,

all of that stuff, has revolved around NCBI's basic local alignment search tool. But the

same basic premises of how the scoring matrices are computed, how they're used to assess similarities,

how searches are nucleated to find other sequences that are similar to one another, also apply

to BLAT.

So BLAT is a tool called the BLAST-Like Alignment Tool that's made available by UC Santa Cruz.

And what this is designed to do, like MegaBLAST, is to really rapidly align longer nucleotide

sequences, so things that are longer than 40 in length that have really, really high

sequence similarity. So again, we're talking things on more of a genomic scale at this

point. It really is, you know, a method of choice when you're looking for exact matches

in nucleotide databases. It's incredibly speedy, 500 times faster than BLAST for either mRNA

or DNA-based searches. Because it is pushing in the direction of exact matches, you might

miss divergent or shorter sequence alignments. That's when you're going to go back and use

BLASTN instead, over at NCBI. It can be used on protein sequences; truth be told, I've

never actually done that. I've always used this on the nucleotide side, but just that

you're aware that you can do that as well.

All right. So when do you use this instead of one of the BLAST algorithms? All right,

you might have a particular unknown gene or you might have a sequence fragment, so it'll

allow you to place it onto a chromosome to find its genomic coordinates. It allows you

to help determine the gene structure by virtue of this comparison; where are the introns,

where are the exons? Or it might help you find some markers of interest in the vicinity

of a sequence, so there might be upstream regulatory elements or other important genetic

features of -- note, Laura Elnitski will talk about this in great length when she talks

about epigenetics and upstream regulatory regions. You, as already inferred, use it

to find highly similar sequences, so other gene family members or putative homologs.

You know what a homolog is now. And also, just to display your sequences as a specific

track. So, track is a new concept. It's just a way to describe linearly a bunch of sequences

that are related to one another, and we'll see an example of that in a moment.

Okay, so we now leave Building 38A, we go to the West Coast, genome.ucsc.edu. This is

UCSC's genome bioinformatics homepage. All of the tools that are available to you are

on the left-hand side. The fourth one down is BLAT, so we're going to pretend we're clicking

on that for purposes of the example. Again, I've just pasted a sequence into the box from

the sample sequence page that I told you about earlier. This is a rat CDNA clone, so I've

changed the genome to rat. It's using the latest assembly, so by default, it will give

you the latest set of information available. Tyra will touch on that a little bit next

week and why it's important to keep track of the assemblies. The query type, it's a

DNA sequence. And here, we're just going to submit it, okay? There are ways to change

the parameters, but we're just going to go ahead and click on the submit button and see

what we get back.

And we end up getting back a very simple format on our results. And they are, in this case,

shown by -- in score. You don't see probability values here. Okay? The best hit is the one

at the top, so -- because it has the highest score. It is also the longest one, so the

span in the last column tells you how long the alignment is. So in this case, we started

with a sequence. It aligned from position 1 to position 733: 98 percent identity. It's

on chromosome 5 on the plus strand and it gives us the genomic coordinates of where

it landed. It's one thing to look at it that way. We can now see it in the context of the

genome browser by clicking on the browser link. So this is now what a standard UCSC

genome browser page looks like, and again, Tyra will make you experts in how to use this

by the end of next week. Very simply, here's a red bar on this ideogram here, so we can

see that we are at 5Q 31, is where our sequence landed. It's right here, your sequence from

BLAT search, and the accesion number is at the front. You'll remember that this was 98.1

percent similar to the chromosomal sequence, so there are some gaps out here to show you

where there have been gaps inserted. You'll also see, very faintly, there are some red

bars that are in there as well, and those just indicate to you the positions of mismatches.

Okay? On your handout, I've given you a color key, so red genome and query sequence have

different bases at this position. You can read for yourselves when you see something

in orange, in purple, in green, what those all represent. So different characteristics

of the alignment. Okay.

There are a slew of other tracks here that are of great interest. So there are some rat

EST, some conservation tracks that tell you how well the sequences line up between, in

this case, mouse and human and dog and so on, in this particular region. There's SNP

information available to you, a whole slew of really cool genomic information, and again,

Tyra will show you all of that next week. Okay.

There's another way to look at these results, by clicking the second link over here, the

details link, so we get a more text-based result here. The first sequence here is the

one we started with. This was the query that I pasted into the BLAT search page. This was

the result. You'll see that some of these are capitalized, they're in different colors.

It tells you at the very top what all of that means, so matching bases and CDNA and genomic

sequences are colored blue and capitalized, light blue bases mark the boundaries of gaps

in either sequence. And if we just pretend we scroll down a little bit more, here is

the alignment in the more traditional way that you're now used to looking at these alignments.

Okay?

So again, very powerful tool, gives you chromosomal context, whether you're looking at human sequences,

rat sequences, what have you, and allows you to now start thinking about things, not just

between the two sequences lined up, but a whole just amazing collection of genomic information,

and may help you think about how that particular gene is implicated in certain biological processes

and perhaps in certain disease processes as well.

Okay, the last item. It's called FASTA. So FASTA actually predates BLAST as a sequence

similarity method. It, like BLAST, identifies region of local alignment. It uses a very

complex algorithm called a Smith-Waterman algorithm to do the comparison. So the method

is significantly different than the one I described to you earlier in this lecture,

but it's an incredibly robust method. While we don't have time to talk about it today,

I wanted to point it out because, again, if we think about where those cut-off lines are,

if you want confirmatory evidence by some other method for some of those BLAST hits,

you could go ahead and do a FASTA search or use some of the methods I will tell you about

in Week 4 to help you confirm or reject results that are around the cut-off line. So the same

way that in the lab you might use a second or a third method to confirm your results

or your conclusions, you do the same thing in the computational world as well. Here's

where you can find some online implementations of that.

And I know we covered a lot of ground in the last hour and a half. For those of you who

would like some more information on how to use these methods, some more of the fine points

of how these methods are utilized to answer specific biological questions, you can read

Chapter 11 in my own textbook, which talks about both BLAST and FASTA. In the Mount textbook,

Bioinformatics Chapter 6 talks about sequence similarity as well. Both of these books are

available in the NIH library, so you can just check them out at your convenience.

All right, so that's it for this week. What we're going to do two weeks from now is build

on all of this stuff to start thinking about things in more context, looking at profiles

and patterns and motifs. We're going to talk about structure a little bit, something that

usually scares people but actually is pretty cool and is not that hard to do, and talk

about how to start constructing multiple sequence alignments. So just by way of reminder, next

week, Tyra Wolfsberg will pick up from here, talk to you about genome browsers, the stuff

that I alluded to towards the end of today's lecture. So thank you for coming, I'm glad

to take any questions at the podium. We'll see you again next week. All right.

[applause]

The Description of Biological Sequence Analysis I - Andy Baxevanis (2012)