Practice English Speaking&Listening with: Genome Browsers - Tyra Wolfsberg (2012)

Difficulty: 0

Tyra Wolfsberg: Can you hear me? Okay. I'm Tyra Wolfsberg.

I'm one of the course organizers and also the associate director of the Bioinformatics

Core at NHGRI. So what I'm going to be talking to you about today are how to get data out

of genome browsers.

This is something -- oh, sorry -- before I start I need to say that I have no relevant

financial relationships with commercial interests, for those of you doing CME credit.

So what I'm going to talk to today is how to get data from the three publicly available

genome sequence browsers, one at UC Santa Cruz, one at Ensembl and one at NCBI. This

is something that I do pretty much on a daily basis and a lot of our work in the core, in

the Bioinformatics Core, is based on getting data out of these different online resources

and either helping people to analyze it or to display it themselves.

So before I get into the browsers themselves, let me go over a couple of details about the

types of data that you're going to see. So all the browsers start with the same information

and that is genomics sequence. And I'll show you in a couple of slides for the most part,

that's the same genetic sequence but not always. Then each of the three browser teams independently

annotates the genome with relevant information. Those annotations can be different because

they are done independently by the three different teams. The types of things they annotate would

of course be genes, which they can annotate using RefSeq mRNAs that Andy talked about

last week and I'll talk -- I'll touch on a bit more today, GenBank mRNAs, other sources

of transcript sequences and also ab initio gene predictions. They also annotate things

like SNPs and, for example, non- coding functional elements. And we'll go through some of the

different types of annotations that the three browsers make available to you.

Again, before jumping in, I'd just like to go over a very quick overview of how genome

sequences get generated in the first place. This is taken from an older review that's

11 years old at this point, going over two modes of genome sequencing that were available

at the time. This clone-by-clone shotgun sequencing is what was used for the publicly funded Human

Genome Project, that is the NIH Homan Genome Project, and what they did was to take the

human genome, which you might imagine as an encyclopedia where each chromosome is a single

volume in that encyclopedia and break each chromosome up into a series of about 350 kilobase

inserts which they cloned into vectors call BACs.

Before they even started sequencing, they first made a map of these BACs along each

chromosome. So they had one BAC that started at the left end of chromosome one, then another

BAC overlapping that a bit, another BAC, another BAC, until they had a whole tiling pattern

of BACs across each chromosome. Once they had mapped these BACs, then they sequenced

them. The BAC inserts of about 350 kilobases, as I said, are too big to stick in the sequencing

machine so they were broken up into smaller pieces by a process called shotgunning, which

generates sequences of a hundred nucleotides and those were the pieces of sequence that

were actually put in the sequencing machines.

So you end up with a sequence of each of these BACs, shown here in different shades of blue,

and then using a genome assembly program, you can stitch these pieces back together,

looking at the overlapping sequences of letters, until you end up with a sequence of each BAC.

Because you have made a map of the BACs along the chromosome before you've started, you

then know the order of these different color blue pieces and you can end up with a chromosome


The opposite strategy, which was taken by Solera, which sequenced the privately funded

human genome, was to dispense with this whole BAC mapping process. Rather, they just took

the entire human genome and shotgunned it into all these little pieces and then wrote

computer programs to assemble these pieces back into the sequences of individual chromosomes.

There was a lot of controversy at the time about whether such a method was actually possible,

whether you could have sequences that come from chromosome one and chromosome X all in

one big pile and manage to figure which comes from which, but it was a successful strategy

and they ended up with a relatively complete genome map.

This process worked so well that it was used for most of the other genome projects until

fairly recently. So for example, mouse and rat are pretty much all done by this whole

genome shotgun sequencing.

Newer strategies are based on next generation sequencing methods that you'll hear about

in a couple of weeks from Elaine Mardis. But just as an overview, this procedure is quite

different. It's a lot faster. You would take -- this is one example of a procedure that

we're using to sequence a genome in our group. You take the genome, break it up into smaller

pieces and sequence those pieces on a machine called the GS454, which generates sequences

of a few hundred nucleotides in length. And those sequences are assembled into these contigs,

longer pieces of sequence. Because you haven't done a mapping step, you don't know the order

of these contigs along the chromosome but you can fix that by generating what are called

paired-end reads off a machine called an Illumina machine which generates much shorter sequences.

So basically you would have -- you would have clones and rather than sequencing the whole

clone, you would just sequence in the two ends of the clone shown here as these blue

lines. Because you know that these two pieces came from -- were contiguous in the genome

you can pull these two pieces, these two contigs together with a blue sequence and end up making

longer scaffold, perhaps with some gaps you would go back and do more sequencing on. So

this is a strategy that is starting to be used over the last year or so for other genomes.

For example, the panda genome was sequenced using this methodology.

A bit more about the sequence assemblies. It's pretty complicated to generate a sequence

assembly from the available data, so even though data are continuously being -- are

still being generated for genomes which have been declared finished like the human genome,

assemblies are still being calculated. So the human genome gets updated every couple

of years. The mouse genome gets updated every couple of years even though they've been declared

finished. The assemblies are not always displayed simultaneously

on the three different genome browsers. I'll show you an example of that in a minute. If

you don't see a version of an assembly that you're looking for, both Santa Cruz and Ensembl

maintain these pre-release websites -- I've shown you the URLs here -- and they may have

genome assemblies or gene annotations that are more recent than the ones -- not more

recent, more, I guess more recent than the ones that you see on the regular genome browser

but that are still sort of in a test phase so you wouldn't want to trust them entirely.

Both Santa Cruz and Ensembl will provide online archives of older assemblies so you can always

get back to your old data. So say, for example, you're working on or looking at a region of

a genome in one of the browsers, and all of a sudden one day you come in and they've updated

the genome assembly. And all of a sudden the coordinate system is going to change, some

of the gene predictions may change. It can really be difficult to figure out what you

were doing the previous day. So both Santa Cruz and Ensembl, you can always get back

to your old data. NCBI provides only limited archives.

So bottom line, if you're -- if you're trying to compare data from the different -- from

different genome browsers, you need to make sure you're looking at the same assembly.

And this can be a bit cryptic because all three browsers name their assemblies using

slightly different fashions. So I'm just showing some examples of the different naming conventions

shown here. And for example, the dog genome at present, NCBI is displaying the most recent

version of the dog genome where Santa Cruz and Ensembl are still one assembly behind.

So you're not going to be able to directly compare data between NCBI and the other two

genome browsers. I'd like to encourage you to ask questions

as I go through so if you have anything that you want to ask at this point, wave your hand,

or as I go through later please stop me and I'll stop.

So, jumping -- oh, sorry, one more thing. I want to just briefly touch on the reference

sequences. Andy talked about these last week but I'll go over them again. So basically

the RefSeq project was initiated by NCBI to come up one good copy of each mRNA protein

and non-coding sequence in the genome. And this was just a way to eliminate confusion.

So for example there are at least 20 different cDNA sequences for human beta-actin in GenBank.

And you -- as a researcher, if you want a sequence for beta-actin, you really have no

idea which one to pick. Well NCBI has, using a combination of computational and manual

methods, selected which one they think is the best version of the beta-actin sequence,

copied it over, assigned it a new accession number which looks like this. It has an NM,

underscore, and a series of numbers, and they put that out for people to get and also for

the genome browsers to use as part of their gene prediction pipelines.

You can recognize these RefSeqs because they always contain two letters, an underscore,

and then a string of numbers. The ones that start with the letter N are ones that are

derived from GenBank submissions and have real mRNA sequences backing you up or real

-- I should say -- real sequence data backing them up. Things that start with the letter

X are predictions based on an annotation pipeline which may or may not have real sequence data

backing them up and may or may not be as good quality. So those should always be viewed

with a bit of skepticism.

And here is just an example of the beta-actin reference sequence. Looks very much like a

normal sequence except you have this weird style accession number. You get an overview

summary which was written by the NCBI staff telling you what this does and it also tells

you which original GenBank accession numbers this particular sequence was derived from,

which can be handy if you want to go back to the original record for some reason.

Okay, so jumping into Santa Cruz, I'll tell you that I'm going to proceed -- if you've

downloaded the handouts you already know this -- but I've downloaded the screen shots that

you would use that you would generate yourself if you were going through these examples.

So I encourage you, as time permits after the class is over, to go back and try some

of these examples because it's one thing to sit here and watch me go through this; it's

quite a different thing to try to do this on your own, if you haven't tried that already.

So we're going to start out with a very common query at Santa Cruz, which is to view a region

of the genome by querying with a gene symbol. This is the Santa Cruz home page. They give

you news right up front here, as well as a variety of tools across the top and down the

sides. If you just want to do a simple query, you click on the genome browser link. In the

search box you would enter your search term. Before you do that, you'd need to choose what

clade and what genome you're looking for -- we're doing human -- and also what assembly. You'll

see that they make available four different human genome assemblies going back to 2003,

up until 2009. Most people at this point are using the 2009 assembly but there are some

reasons why you might want to use the 2006 assembly because there's a lot more annotation

data available for that. And I'll go into -- get into that a bit later.

So you put in your -- you put in your query. There's a list of potential queries that you

can do down here. I'm searching by age; you can also search by accession number, chromosomal

coordinate, a keyword, etc. You'd press the submit button and come back to a list of entries.

So what Santa Cruz does, when you do a search in the general search boxes, it's just looking

for a text string. I typed in the word ADAM2 so it's matching any instance of ADAM2 that

it can find, which includes not only the ADAM2 gene but also ADAM20, 21, 22, 23, et cetera,

because those all have the text string ADAM2 in them. The data are organized into what

are called tracks based on the source of annotation. Up here at the top we have the UCSC genes

track, so these are genes which are predicted using a variety of online transcripts. They

use RefSeqs, they use UniPro, they use something called CCDS that I'll get into a bit later.

These down here called the RefSeq genes, only contain genes predicted by the location of

RefSeq transcripts. I tend to prefer these because the RefSeqs are manually curated and

I trust what I'm getting out of them. Some of the sequences up here are not as -- some

of the transcripts they have up here are not as well curated and you get some sort of weird-looking

splicing coming back.

So for the purposes of this example I would select the ADAM2 link in the RefSeq gene and

go on to the next slide which is giving you a basic overview of what you're going to get

from Santa Cruz. The data are organized into tracks so up at the top here we see where

we are in the chromosome, which is also shown here in the schematic, and then we have our

genes. The first track is called UCSC genes so as I said these are based on a variety

of different sources. There is one gene, yet four different transcripts, four different

spliced forms. So the reason I know this is that these tick marks right here, those represent

exons and the horizontal line connecting them represents the introns. So if you look carefully

you'll see four different splice farms which are either using or splicing out a variety

of different exons.

The direction of transcription of the gene is shown by the very small arrowheads that

are present on the exon sequence. It's a bit hard to see in this view but in this case

all the arrowheads are pointing to the left. I mean, the gene starts over here on the left

and points to the right. So that's something to keep in mind when you're used to looking

at a text book representation of the gene, they always start on the left and go over

to the right. But in the genome, the genes can point either direction. The genome doesn't

care which way the genes are going. Some of them point this way, some of them point that

way, and you need to make sure when you're looking at them that you know what the orientation

is or you may get very confused and think that this the first exon rather than the last

exon of the gene.

Underneath the note, the UCSC genes track, we have the RefSeq genes track. Again, this

only has one transcript. As we noticed before -- actually this only has one transcript with

one set of exons. And then down below are some other annotations that I'll be talking

about a bit more later.

If you click on a track name, so for example here, if you click on any of these ADAM2 genes,

you're going to get more information. So this is an example of what you might get if you

go to the known genes track description. You get all types of information. The page is

very, very long. I'm just showing here one particular example which is some microarray

data that you can get right off the bat.

If, on the other hand, rather than clicking on the UCSC genes track, you click on the

RefSeq genes track you get a different look. Again, these data are coming from NCBI so

you get a lot of links up here back to various NCBI resources which I'll be talking about


But what I want to point out is down near the bottom where you can retrieve sequence,

in particular, the genomics sequence. So what we want to do here is get the promoter sequence

of this particular transcript. That's a question I get quite a bit. People want a promoter

sequence of a gene so they can copy and paste it into their favorite transcription factor

binding site prediction program but they need to have the sequence before they can do that.

So we're going to say we want the promoter sequence and some other defaults that I've

checked down here. And when you click that you very quickly get the 1,000 nucleotides

upstream of your gene of interest. And note that this is just doing it one gene at a time.

I'll show you a bit later how you can do all the transcripts in the genome.

So how do you move around the genome browser? What we're looking right now is just a zoomed

in view of ADAM2. Actually, before I do that let me mention one other thing. You can now

move tracks around on the browser. So say, for example, you didn't like this UCSC genes

track up here; you wanted to move it further down so you could align it, say, next to the

SNP track which is down here, you could just highlight it so it turns green and drag it

down and it would show up there.

Another thing you can do is flip the orientation of a gene, so if it really disturbs you that

this particular gene is going from right to left, you can hit the reverse button down

here and that's going to fit flip the display so now the gene is going from left to right.

The way you know this is that the gene annotations, the transcript names are no longer on the

left side of the display; they're now on the right side of the display. So up to you whether

you like it this way. Just keep in mind, once you have clicked that reverse button, everything

that you look at is going to be reversed. So I find this actually very confusing because

all of a sudden you're in a different orientation from what you might be expecting. So I would

say use the reverse button with a bit of caution.

If you want to navigate around the genome, you can use these move buttons to move to

the left or to the right to see more details. You can zoom in and you can also zoom out,

which is what we're going to do here. So if you zoom out by threefold, you're now going

to be seeing a longer region of the genome. So ADAM2 is in the middle now and you're seeing

the two flanking genes on either side. If you want to zoom in, as I said, you can use

these buttons up here. Or, for a more directed zoom, what you probably want to do is what

I've shown right here. So I'm just going to -- I've put my mouse in the very top track

up here where the numbers are and I've sort of dragged it over to the right and that highlights

a section of the genome in purple. And what's going to happen on the next screen is I'm

going to focus in just on this region that I've highlighted in purple, which is shown

now right here. So we're just looking at the very five prime end of the IDO1 transcript.

Something I want to point out here is that this particular exon looks weird. Rather than

just being a single vertical box like most of the other exons are, it looks sort of like

a top hat flipped on its side. So what Santa Cruz is trying to show you there is the difference

between the translated and the untranslated regions of the transcript. So any place where

you see a tall box like right here, that indicates that part of the exon is translated or made

into protein. Any place where you see a shorter box would be the untranslated region of that

particular exon. So the subsequent exons off the screen down here would all be tall; those

would all be translated exons. And you would see the opposite orientation at the three

prime end of the transcript, the hat would be flipped on its side to show the three prime

UTR way off on the right.

So what you saw there was just the default tracks that Santa Cruz shows you. I want to

point out one thing down here. There's a track called common SNPs. So this is basically showing

you all the SNPs in this region of the genome; each one is indicated by a tick mark.

So let's see how we change that display and also add tracks to this browser in general.

If you scroll down to the bottom of the screen, so below where you had that nice graphic,

you end up with a very long list of other data which you can add to this display. If

you click on the names of any of these tracks, you get an information page which explains

what it is. I'm going to be concentrating down here on the SNP tracks. There's now four

different SNP tracks at Santa Cruz -- it's a bit confusing -- which are described here.

I'm not going to read through those, but the one I'm going to focus on here is this thing

called common SNPs, so these are SNPs that have a minor allele frequency of greater than

one percent.

If you want to change the display mode of a particular track, you would click on the

pull down menu underneath it. And you'll notice that there's five different ways of showing

that particular track. By default, most of the tracks are hidden, which means, I guess

intuitively, that they are not displayed, which is a good thing because if you display

all the tracks your window's just going to get longer and longer and longer. So you want

to hide most of the tracks that you're not interested in.

The opposite of hide is full, and that displays your data -- all of your data. That's normally

the mode that you want to do your display in, with the exception of the SNP track and

I'll explain that to you in a minute. Dense, squish and pack are just different ways of

condensing the data so it fits better on a single line so you see it all but it's not

shown in great detail.

So let's change the SNP track to full and see what happens. You would change it on this

pull down; you would then click the refresh button and go to a new view of your browser.

So if you remember a minute or so ago, the SNP track was all on one line with lots of

tick marks; now you have the SNPs individually displayed. Each one of these is the accession

number of a SNP from NCBI. Most of them are black -- [clears throat] excuse me -- a couple

of them are red and green.

So, you may wonder, what are these color coding of the SNPs? What does that mean and how do

I change it? Well, if you click on the link here called common SNPs, that's going to bring

you to a page where it explains what the common SNP tracks is and also allows you to change

the configuration of that track. Different tracks will have different types of configurations

that you can change. In this case, we can change the color so we can turn all the SNPs

to being shown in black except for coding SNPs shown in green -- [clears throat] sorry

-- synonymous SNPs shown in green and non-synonymous SNPs shown here in red. I'm also going to

set the display mode to pack. I'll spare you the details but in general, the SNP track

displays best when it's in pack mode as opposed to the other modes. And when you come back

you see a display very much like what you had before, except that you now understand

that the red SNPs are non-synonymous and the green SNPs are synonymous.

So the next thing that I want to talk about briefly are the ENCODE tracks which are available

at Santa Cruz. So if you remember two weeks ago, Eric Green talked about the ENCODE project.

This is in an NHGRI-initiated project to annotate all the genes and all the regulatory regions

in the human genome. Much of that data is submitted to Santa Cruz, so it will be viewable

in the context of other genomic features. You can recognize these ENCODE tracks because

they have a double -- [clears throat] excuse me -- a double helix, the NHGRI logo, next

to them. So all these things with a double helix, these are all coming off the ENCODE


A number -- there now a growing number of ENCODE tracks available on the most recent

genome assembly which is from 2009, also called HG19. Some of the tracks were originally generated

on the HG18 version of the genome sequence which is from a couple years ago and then

ported over to the current version. Those are highlighted here with this number 18 next

to them and I think there are some ENCODE tracks that came from the previous assembly,

although I'm not seeing them right now.

And there's a large variety of ENCODE tracks, which I'm not going to go into in any detail.

The one I want to focus on is this one here which is called ENCODE regulation, which is

on by default. But that's a super track which is composed of data from the six different

ENCODE experiments. Three of them are in hide modes; we're not seeing them by default. The

other three are on by default. And if you look at them, here's what you get: so remember

I said one of the goals of ENCODE is to figure out where the regulatory regions are in the

genome. So what I've focused on here is this region between the ADAM2 gene and the IDO1

gene where both genes are pointing in opposite orientations. So ADAM2, the five prime end

is here; in the IDO1, five prime end is here so they're pointing like this. So you would

imagine there should be a lot of regulatory data in this region because that's a promoter

region which would control the transcription of both of these genes. And indeed that's

what you see.

In this ENCODE super track, we have histone acetylation marks that are often found near

regulatory elements. Those are shown as these different color histograms. We have a lot

of DNase hypersensitive regions, DNase hypersensitive regions, mark regions, which are accessible

to protein binding, and we also have a variety of transcription factors binding in this region.

For more detail on the ENCODE data, I point you to this manuscript that Eric Green pointed

out two weeks ago, the User's Guide to the ENCODE, published in PLoS Biology last year,

and they help walk you through many different examples showing you different types of tracks

and the type of data that you can get out of them. For example, in this particular example

we have a SNP which is upstream of the oncogene MYC and it's associated with a number of different

regulatory elements which are all characteristics of enhance -- all have characteristics of

enhancer sequence. The histomodifications as well as transcription factor binding all

point to this being an enhancer region, so this particular SNP, which I forgot to mention

is a cancer associated SNP, might be involved in the regulation of MYC through enhancer

activity or this region; it might be involved in enhancer activity.

I would also point you to a seminar tomorrow which is being given here in Lipsett at 11:00

by Ewan Birney who is basically the head of Ensembl, and he's going to be talking about

the ENCODE projects and the latest and greatest results that aren't yet published. So if you

want to know more about ENCODE I highly recommend coming back tomorrow, in 25 hours.

So, a couple more things to touch on at Santa Cruz: one of them is attempting to use a Santa

Cruz BLAT engine to find the chicken homolog of a human protein. So, if you remember, last

week when Andy Baxevanis went through different methods of sequence alignment, he talked about

both BLAT and BLAST. BLAT -- so BLAST is a traditional sequence alignment tool developed

by NCBI. It's very sensitive. It's used for comparing both sequences that are close evolutionarily

as well sequences that are more distant. Very sensitive, but a workhorse for many years.

The downside is that it's slow. So Santa Cruz developed a program called BLAT which is meant

to find a very -- sequences that are very highly similar to each other, either nucleotide

sequences against the genome or you can even take protein sequence as a BLAT them against

the translator genome. It does a great job of finding things that are within the same

species but not so good at finding hits between species.

So we are going to attempt to use it here to find the chicken homolog of a human protein.

We're going to get that human protein sequence from NCBI, from RefSeq, paste it into the

BLAT search. If this were a normal Santa Cruz page, not a BLAT page, there'd be a link up

here called BLAT. You'd paste in your sequence, select the genome, which is chicken, and the

assembly. I should say that this is a rather contrived

example which I'm showing just to make a point about BLAT. This particular older chicken

assembly is no longer available off the main Santa Cruz page. I had to go to the genome

preview page I told you about earlier in the talk to get this older assembly. The newer

assembly is from 2006.

So you press -- you submit this data and very, very quickly, it goes off and it takes that

human protein and compares it to the entire translated chicken genome and returns you

four results. You get two links, one to the browser and one to the details.

So let's take a look at what the browser link looks like. This opens up a screen on the

chicken genome browser and this line right here shows you your BLAT hit. So these solid

rectangles indicate regions where the human protein sequence aligns to the chicken genome.

The regions between them indicate regions where there is no alignment.

So at first glance, you might think this looks pretty good. If you take a protein sequence

and compare it to a genome sequence, what do you expect? You don't expect one big block

of sequence, rather you expect individual exons, because remember the exons are split

-- are stitched together to make the mRNA a layer of the protein. So you might think

that these individual blue blocks that you're seeing here actually represent individual

exons in the human genome. However, if you were to look at the details link which shows

the sequence alignment between the human protein and the translated genome, you'll find out

that you've been sorely misled. The region -- the three boxes that you were getting are

actually these very, very short regions of alignments, and anything that's in blue aligns,

anything that's in black does not align. So here we have the human protein and it's showing

in blue the bits that align to the chicken genome. This doesn't look very good. So here's

the chicken genome showing you the three regions of alignment and here's one example of the

alignment. This would be something that would just be a spurious alignment. You can get

-- there happens to be a bit of sequence overlap between these two sequences but nothing of

any importance that you would really care about.

So I just want to make the point that you need to -- just because you get a result back

on a genome browser doesn't mean that you have some interesting biological result. You

always need to go back and look at the underlying data, look at the sequence alignment, look

at whatever you need to look at to convince yourself that what you're seeing is actually

the correct answer.

If you had looked at this in a bit more detail, you might have noticed the following: so these

columns here tell you the start and end positions of the sequence alignment here on the protein

and here in the genome. So for example, we have a 600-nucleotide protein that we started

with -- sorry, sorry -- a 735-nucleotide -- 735-amino acid protein that we started with. The alignment

only goes from nucleotide 539 to 600 so there's only about 70 amino acids worth of sequence

alignment, which by most standards would not be enough to predict some sort of homology

between one protein and the genome you're comparing it to. Questions?

So another thing that you can do at Santa Cruz, it's actually used a lot by the genomics

community, is to add your own custom tracks. So this is basically adding your own data

to the genome browser such that you can see it in the context of other genome browser

data. All you need to do is to format your data in the right format. There's a number

of different acceptable formats but in short, they're all some sort of tab-delineated text

where you would have the chromosome number, a start position and a stop position. If you

wanted to try this yourself, this particular text file is available at this URL down below.

And when you put it into the browser appropriately, you end up with something that looks like

this. So from the red box down is all the normal Santa Cruz data that anybody would

see at Santa Cruz. What's highlighted here in the red box are the data that you added

yourself. So you've created four different tracks, you color-coded them in different

colors and each of your data points is indicated as a little tick mark.

There are four different ways, basically, of sharing your custom track data or getting

your own data into Santa Cruz. One of them is that you can upload your own data from

your computer. The upside to that is you can see it, not available to anybody else. You

can have your top-secret data. You can look at it and it's not going to stay around.

Another option is that you can post your annotation data, that text file I showed you, to your

website and if you create the right type of URL or web link that feeds that data into

the Santa Cruz genome browser, you will have the view that I showed you before as well

and you can then send that link out to other people. You can make a companion page for

a manuscript if you want to display your data. If you want have a Santa Cruz track, so you

say in your manuscript, look, here's where you can get to them, that's one way to do


A third option is to create something called a session which configures your browser with

specific track combinations including custom tracks. This is a great place if you're sort

of at the process of analyzing your own data, you can create a session without making it

publicly accessible and you can come back day after day and look at this data in the

context of the genome browser. Once you decide it's more ready, you can then share that session

with a limited group of people if you so choose. Or if you have data which seems very appropriate

you can actually contribute it to the Santa Cruz genome browser team and they will make

it available for the whole world to see on our website. And there's more information

about how to deal with these custom tracks at the two URLs at the bottom of the screen.

The final thing I want to talk about at Santa Cruz is something called a table browser.

So this is basically a way to get the underlying data out of the Santa Cruz databases. So you're

seeing this nice web display that shows you your data in nice graphical format but underlying

that is a database that contains all your data in text format that can be extracted,

and if you're doing any sort of computer programming you probably want to be able to extract the

data as text and then run your own programs on it. So the table browser is a very powerful

way to do that.

Types of things you can do: you can get back DNA sequence in a track. I'll show you how

to do that. You can calculate the intersections between tracks, so for example, if you want

to find all the SNPs in a particular gene you would intersect the SNP track with the

gene track to pull out the data that's shared between those two tracks. And you can also

filter the track data based on certain criteria so for example, show RefSeq genes that only

contain one exon.

So an example I'm showing, we want to get all the promoter sequences for all of the

genes in RefSeq. So a couple of slides ago I showed you how to get promoter sequence

for one gene. Here's how you do it for all of them. To make a long story short, you would

select the appropriate tracks, output format a sequence, you go through a couple of selection

screens which look a lot like the screen we saw before where you tell it what kind of

sequence data you want and out the bottom will come a list of promoter sequences. I've

chosen to just show 200 nucleotides here because that way I can fit a couple on the screen

but you probably want something bigger.

So that was all I have for Santa Cruz. Do I have any questions from anybody before I

move on to the next browser? Okay.

So the next browser is Ensembl, which is created by -- it's a collaboration between the EBI

and the Wellcome Trust Sanger Centre and it's based in Cambridge, England.

I should say before I do this that my distinct preference for genome browsers is really the

Santa Cruz genome browser. It's very easy to use, it's user-friendly, it's pretty intuitive

without having to read any documentation. That's my perspective. I'm sure other people

fell very differently. Ensembl is another popular genome browser. I think it has a higher

learning curve but once you get through that learning curve there's a lot of interesting

data and I'm going to go through that now as well.

So here we have the Ensembl homepage. We have links to a number of different genomes; there's

a very long list of genomes available. And then -- excuse me -- also to different things

you can do with those genomes as links across the page. We're going to do a BLAT/BLAST search

by clicking at the link at the top of the page. And in my example here, what I'm doing

is I'm taking a short sequence tag, a 20-nucleotide sequence tag and I want to compare that -- I

want to find its location, where it maps in the human genome. So I paste my sequence in

the box, choose my organism. You can select the search tool so it allows you to do BLAT,

which is going to be quick. It also allows you to do BLASTn, which is going to be more

sensitive. BLAT is probably not going to work for a 20-nucleotide sequence tag because it's

too short for BLAT to handle so you really need to do a BLASTn.

And they have a number of pre-canned parameters. I'm going to pick the one here called near-exact

matches to [unintelligible]. So these are going to set some of the configuration parameters

that Andy told you about last week. It's going to do that for you automatically. One could

argue that may or may not be a good idea, that you may want more control over your BLAST

searches than this gives you, but if you want more control you can always click on the configure

button and you'll see a lot of the same controls that Andy showed you last week with NCBI.

So the search runs for a couple of minutes and you come back with results that look like

this. You get a schematic of all the karyotypes in the human genome showing you the two locations

where that particular sequence it, one here on chromosome eight and one here on chromosome


Down below, you see details of those match -- matches, both of them hit at 100 percent

identity, so that's good but only one of them hits over the full length. So it says query

start and query end, that tells you where in that 20 nucleotides the sequence alignment

goes. Both of them start at nucleotide one. The first one extends to nucleotide 20; the

second one only goes to nucleotide 17 so that's telling you that only the first 17 nucleotides

in that query sequence align with the genome on chromosome eight. So we're obviously more

interested in a hit on chromosome 15 because you get a full alignment.

A number of different links that you can choose here. The one you probably want is this thing

called C for contig view because that is going to take you to a view that looks like this.

This looks a lot like Santa Cruz, but yet different.

So let me walk you through it. The display's organized into a couple of sections. So up

here at the top, we have an overview so this is showing you the context of your BLAT hit

which I believe is this red line right here. And it's showing you that there is some stuff

around it but at a very 10,000-foot view. Down below is showing you more details. So

again, here's your BLAT hit and here are some genes. We're hitting the TCF12 gene which

has lots and lots of different transcripts. So each of these transcripts is indicated

by a separate line. The translated exons are the solid boxes. The untranslated exons are

open boxes so that showing you where the UTRs are compared to the coding sequence.

The reason there are so many different transcripts is that Ensembl uses a variety of ways to

predict genes. The blue guys and the -- sorry, the red guys and the blue guys are the normal

Ensembl transcripts predicted by their annotation pipeline. The red ones are coding. The blue

ones are non-coding. Then there's some shown here in yellow which are called merged Ensembl-HAVANA.

What that means is -- HAVANA is a project to manually curate genes in the human genome

so these yellow ones -- I think they're called gold -- are ones that are both predicted by

Ensembl and then manually curated by HAVANA.

Then there's a third type of somewhat curated data represented up here by these green tracks.

This is something called a CCDS set or the consensus coding sequence set. The logo is

right here. So those are -- the CCDS project is a joint collaboration between a couple

of different genome centers to create a good set of protein coding genes. So you'll notice

there's no untranslated sequence here, just the translated exons. So that's another source

of well annotated transcripts in addition to the RefSeqs and in addition to these gold

Ensembl-HAVANA transcripts. It gets a bit confusing, I do admit.

There is this blue line down here at the bottom and that's showing you the contig to which

these things have been mapped, so the human chromosome is still in some sense divided

into individual contigs or individual long chunks of sequence that all make up a chromosome.

Any transcripts that are shown above the blue line are pointing in this direction, the normal

direction from left to right. Transcripts that are shown below the blue line like this

guy here, are pointing in the opposite direction. A bit confusing, but that's the way they show



Male Speaker: The thicker green line on top is what? As

opposed to the --

Tyra Wolfsberg: I'm not sure, to tell you the truth. It's

not part of the CCDS and I can't read from here what it says.

Male Speaker: It says --

Tyra Wolfsberg: It says --

Male Speaker: [inaudible]

Tyra Wolfsberg: Yeah, I'm honestly not sure but if you want

to come down afterwards we can look at -- we can look into that one together. Any other


If you want to add tracks to this view, you do that by clicking on configure this page

and that brings up a window where there's a long list of tracks sort of like what you

saw at Santa Cruz but perhaps not as well -- well, perhaps not as intuitive as to what's

going on. If you click on one of these track categories such as dbSNP, you get a list of

types of dbSNP data that they have. We're going to turn on the dbSNP variants by clicking

on the box and you get this sort of weird splotchy display that's supposed to show you

the display is now turned on. And when you go back to your viewer, you've added on this

SNP track down here with a variety of SNPs with different colors indicating different

types of SNPs which I'll show you in just a minute.

To navigate in the browser, you would use these links up here. You can zoom in, you

can zoom out. You can move right, you can move left. We're going to move to the right

a little bit and go into this exon right here which is now a completely translated exon

and here are SNPs down below again.

So I've brought up the SNP color coding legend. At least for these SNPs we have blue ones

that are introns. Not surprising, this is a big intronic region right here. We have

green that are synonymous and yellow that are non-synonymous that you're probably having

a hard time seeing.

If you click on any of these SNPs you will bring up a little pop-up menu that looks like

this. It gives you a bit more detail about the this particular SNP and it also allows

you to click on this link up here which takes you more information about that particular

SNP. What I have clicked on is the yellow, non-synonymous SNP and down here -- I'm sorry,

I clicked on the green synonymous SNP so you could actually see it. It tells you it's synonymous

and gives you some more information here. So if you click on this link right here, you

get to a page that gives you various links to various different properties of the SNP.

So there's eight possible different views that Santa Cruz will give you of this particular

SNP. I'm not going to go through all eight of them; I'm just going to focus on the two

that I thought might be the most interesting. Of course the background information, what

type of SNP this is, is always present on each page so you know it's an A/G SNP. And

this was a synonymous SNP so there's no -- there's no protein change in this particular view.

Up here, we see -- sorry -- I clicked on this link here called genomic context. And what

that brings you up is a page that looks like this showing you a very nice picture of this

region of the genome with the two exons and all the SNPs down below color coded. If you

ever needed to make a figure of the SNPs in a particular gene this might possibly be a

place to come.

Another link you can click on is this one here called population genetics. This is going

to show you the distribution of that SNP in different populations. So that's this link

down here. So again, this is our A/G SNP and this is showing you -- it's just frequency

distribution in different populations. CH -- CHB plus JPT, that's the Han Chinese and

Japanese populations from the 1000 Genomes project which Eric Green talked -- touched

briefly on two weeks ago. In this particular population, As are represented at 70 percent.

The G allele is seen about 30 percent of the time. That's a contrast with the YRI population.

This is -- these are the Yoruban ethnic group from Nigeria. They have a much higher frequency

of As than do the Asian populations and you can get this information for any of the SNPs.

Something -- other types of things you might want to get out of Ensembl: if we go back

to this particular view, where we're seeing all of our transcripts, if you click on any

of these transcripts you get a pop-up menu that allows you to do a variety of different

choices. You can link to the transcript. You can link to the gene. You can link to the

protein product. So we are going to explore more about the TCF12 gene by clicking on this

gene link and that opens a page that looks like this.

To orient you, all of the Santa Cruz pages are organized into tabs. So we were earlier

on the location tab because this is showing you the overall genomic context of what we're

looking at. We're now in something called the gene tab which shows you information about

this particular gene and there's also a transcript tab which I'll touch on a bit more later.

So this gene is TCF12. At Ensembl it has this accession number. Let me just digress a minute

about the Ensembl style accession numbers. They all start with the letters ENS and then

they have a gene, a G for gene, a T for transcript or a P for protein. So that's how you recognize

whether it's a gene, a transcript or a protein. They also give you the species information,

so by default, human is the original Ensembl annotation. Human doesn't have any species

orientation, any species information. But if you were looking at mouse genes they would

be ENS for Ensembl, MUS for mouse and then G for gene, T for transcript, P for protein.

So if you're smart about it you can sort of figure out what type of Ensembl product you're

looking at. So this is just one gene, the TCF12 gene,

and as I said earlier, this gene has a lot of transcripts numbered from one down to 20

and I think there's even more going off the screen. Each of these transcripts gets its

own identifier because each one of them is going to have a slightly different splice

form. And a good number of these have corresponding protein products although some of them are

just processed transcripts that aren't actually translated.

There's a variety of things that you can do from the Ensembl gene tab, one of them is

look for orthologs of this gene. So these are homologies that are automatically calculated

by Ensembl and they would link to homologies in all the different organisms that Ensembl

has genomes for and supports. So for example, just off the top of the list we have links

to alpaca, a lizard, armadillo, a bush baby and the list goes on. So if you don't want

to do any BLAST searches yourself but you want to very quickly get homology information

from the gene that you're looking at to a different species, this is one place to come.

Female Speaker: Can you explain the [inaudible]?

Tyra Wolfsberg: I'm sorry, what was that?

Female Speaker: What the difference between targeting [inaudible]?

Tyra Wolfsberg: So the question is, what's the difference

between the target and the query? Is that what you're asking? And where are you seeing


Female Speaker: Well it's on the card, the two percentages

that they give you.

Tyra Wolfsberg: That's a very good question. So the question

is, what are these two percentages? There's a target percent ID and a query percent ID

and why are they different from each other? I don't know the answer to that one, either,

but I'll give you the same answer I offered the other gentleman which is if you want to

come down afterwards and we can discuss it after my talk and try to get the information.

Sorry. Anybody else want to stump me while I'm on a roll here?


Okay. So going back to the view that we had -- oops, wrong direction -- I'm not going

to go through all the different possible things that we can do but one other thing that -- hello

-- one other display that I think is nice is this thing called the variation image.

So what that gives you is, again, another pretty picture that you potentially use in

a manuscript and that's showing the locations of all the variants in the gene. So what we

have here is a blown-up view of all the potential exons in use by the TCF12 gene and each of

these exons is shown by a vertical brown bar. And it goes through transcript by transcript

-- this display goes through transcript by transcript showing you all the possible -- all

the possible transcripts. So this is the first transcript here and this is showing you the

exons in use by this transcript. So you can see there it uses some of them indicated by

these solid rectangles. Other exons are skipped in this transcript, indicated just by a horizontal

bar and then in each one of these exons it's showing you what variants are -- have been

documented. So again, potentially a nice display. You also get down here some PROSITE profiles

and other protein domains that you will be learning about next week. Yes.

Okay. So as I mentioned, we were on the gene tab. If we skip to the transcript tab for

one particular transcript it looks very similar to the gene tab, although if you have very

good eyes, you'll notice the list of things on the left that you can do is a bit different.

One thing I want to show you is supporting evidence. So each of the transcripts in Ensembl

was predicted in some way using the Ensembl pipeline which is using some computational

methods and also using some transcript alignment data. So here's how you get the transcript

data. On the supporting evidence tab, is -- again, the transcript is displayed, all the exons

of that transcript are displayed up here. And down here is a long laundry list of all

the transcripts that went into annotating this particular gene. The place -- here are

the accession numbers of each transcript and the place where you see either the green or

yellow boxes are the places where that particular transcript had an exon. So if you focus in

on this third coding exon right here, you'll see it's actually quite rare in the transcript

data. It's present in this yellow transcript right here. The accession number starts at

the letter B and I can't read beyond that, repeated here and then it's present in one

-- I'm in the right column here -- one other transcripts right here. So that transcript

isn't -- that particular exon in this predicted transcript doesn't have a whole lot of supporting

evidence. So if you were looking at this transcript in detail, you might or might not actually

believe this particular transcript just because that exon doesn't have much biological data

supporting it.

Something else I want to point out at Ensembl is the protein sequence for this particular

transcript. So if you click on the protein link, that will take you to the translated

protein sequence which is nicely color coded. Each exon is a different color and the splice

junction -- the splice sites, the junctions between the exons and the introns is shown

in red.

While I'm in this view, I want to point out the link called view in archive site. So as

I mentioned at the very beginning, you can always get to old Ensembl data and at the

bottom of every Ensembl page is this link call view in archive site which brings up

a page like this. So you can choose from a variety of different Ensembl genome builds

going back to August 2007. So I need to make a distinction between Ensembl

versions and genome assemblies. So remember I said way back in the beginning, the human

genome, the mouse genome, whatever genome you name it, are assembled every couple of

years. So the old -- some of the older data from 2007 up to May of 2009 is annotated on

NCBI36, that is NCBI build 36, or the previous version of the human genome. The later data

is all coming on this other assembly called GRCh37, which is the newer assembly. It's

very cryptic; if you didn't know this it probably wouldn't be at all obvious.

Now on a given genome assembly, Ensembl updates all their annotations every couple months

and these annotation sets get sequentially numbered. So Ensembl 46 was back in August

of 2007, so that was an annotation set on NCBI build 36. They sequentially number these

assemblies up to -- the current one is 65 so the previous one is 64. So what I want

to point out is, this group of Ensembl annotations from 46 up to 54, those are all based on one-genome

assembly so the underlying data are the same but the annotations are different because

they perhaps updated their computer algorithms. So even though the bottom-line genome assembly

hasn't changed, the annotations may be different and there may be reasons why you want to go

back to an older annotation because perhaps you think the gene prediction in that particular

region or genome was better a year ago than it is today. So bottom line, you can always

get back to it through this archive page.

Okay. So the next thing we want to do on Ensembl is try to find a human -- a chicken homolog

of our human protein. So remember, we try this using BLAT at Santa Cruz and it did not

work so let's try this at Ensembl using BLAST. The interface is similar to what you saw before

but we're pasting in a protein sequence, we are going to go against the chicken genome

and we're running a version of BLAST called TBLASTn, which takes a protein sequence and

compares it to a translated genome sequence. And you get back a lot more results than you

did by BLAT.

If you're -- now that you've learned a little bit about what these results may look like,

you can study this page before you actually look at the alignment data. So if you'll notice,

the top three hits all have fairly decent length alignment. So for example, on the first

hit the alignment starts at amino acid four of the protein and goes to amino acid 600

and some odd. The second one starts at amino acid -- I think it's six and then goes to

amino acid 500 and some odd so they're decent length alignments, which is good. Remember

when we were back in the BLAT view it was only a 70-amino acid alignment. And if we

look at one of these particular alignments in greater detail by clicking on the A link,

you get a page that looks like this. There is not -- which I will just say is actually

a fairly decent alignment between human and chicken. There's not a lot of conserved sequence

but you wouldn't expect human and chicken to be all that similar to each other. In this

particular example they're about 30 percent identical.

So we have our query sequence up on top. This is a human protein sequence. Our translated

chicken genome sequence down below and anyplace that you see a capital letter in the space

in between represents positions at which the sequences are identical. If you see a plus

sign you have conservative substitutions. So bottom line, BLAST works a lot better than

BLAT if you're trying to go between species.

The final thing I want to touch on on Ensembl is using a search engine or a database called

-- a database interface called BioMart which is a wonderfully -- wonderful resource for

cross referencing data from different sources. I find that people don't really know about

this but it's really a nice thing, a nice tool to add to your arsenal. So in this particular

example, we're going to use BioMart to start with the list of Ensembl gene identifiers

shown here. And we want to pull out their genomic coordinates, the gene symbol, as well

as the RefSeq accession. So we're going to cross link between Ensembl and NCBI's RefSeq.

So you basically first need to choose your database, which is -- these are zebrafish

genes, Danio rerio, so we're going to choose zebrafish. We paste in our gene identifiers.

So remember the key that I told you earlier: ENS is Ensembl, G is a gene so these are all

gene identifiers. The DAR stands for Danio rerio so these are zebrafish gene identifiers.

We paste them in the box and then we -- so what's a bit confusing is you put your input

into this section here called filters and there's a long list of different types of

filters that you can use. I'm filtering based on gene IDs. Then to select what you want

to get out of the process, you need to select these attributes shown on the next screen.

The attributes that are always on by default are the Ensembl gene and transcript IDs. These

are always there, although you can take them off.

And then you can add additional attributes. So I'm adding on the chromosome name, the

gene start and stop, as well as the associated gene name and those are all found under this

section called features. You can also go to a section further down the page called external

references. So this is how Ensembl has correlated itself with other databases. In our case,

the database that we're interested in is the NCBI data so we're telling it we want the

RefSeq mRNA and the RefSeq mRNA predicted.

So all of those things together, you click on the link up at the top which is sort of

cut off, called results, and you get back a page that looks like this. So we have our

starting Ensembl gene identifier from zebrafish. We get back the transcript identifier. Each

gene has multiple transcripts which is why you see the genes repeated with unique -- different

unique transcript identifiers. You get the chromosome to which each one maps, the gene's

start and end -- and this would be the transcript start and ends, the different transcripts.

Actually, I take that back. These are the gene's start and end, the gene name, as well

as the RefSeq accessions where those are available. So you get -- for some of these you are getting

an NM RefSeq, which again are the nicely curated RefSeqs that are coming from GenBank sequences.

In some cases you're getting XM RefSeqs, which are the ones coming out of the gene production

pipeline at NCBI. So not all of the Ensembl genes are going to map to NCBI. And conversely,

if you were to perform a similar search at NCBI -- I'm not quite sure how you'd do that

-- but if you were to do a similar mapping, not all of the NCBI RefSeqs would map to an

Ensembl identifier, making my point again that the different groups are doing their

annotations independently and you're going to get different data depending where you


Anecdotally, I would say that Ensembl is a much better source of gene annotations for

zebrafish than is NCBI. Ensembl has a much more active zebrafish annotation program for

zebrafish than NCBI.

A second quick BioMart example is, we're going to start with that same list of zebrafish

gene IDs and say you want to get back the IDs of the human -- the predicted human orthologs

of those genes. We would go click on the -- put in the data the same way we did before, click

on the homologs link, select the organism for which you want homologs -- the alphabetical

list starts with Atlantic cod and goes all the way down to -- I don't know what's at

the bottom but human is somewhere in the middle. You say you want the gene and protein ideas

from human, the percent identity and you get back the predicted orthologs right there.

You'll notice that not all of the genes have human orthologs so this particular one, there's

no calculated ortholog. So again, it's a very handy way of very quickly getting data from

different resources all integrated together.

Questions? Yes.

Female Speaker: Is there a way to search [inaudible] number


Tyra Wolfsberg: So you want to start with the RefSeq number

and get out an ENS number, I'm pretty sure that there is. Under the filter option, I

think there's a way to choose external identifiers and you would paste in your RefSeq accession

numbers and then do the converse when you get to the attributes. You would export your

Ensembl links. If you want to come talk to me afterwards, again, we can work through

this for sure but I'm pretty sure I've done that.

Any other questions? I'm glad I could actually answer one. Does anybody else have a nice



Okay. So moving on to NCBI. I'm going to go through this fairly quickly. I will say the

-- I'm hoping there's not too many NCBI people in the audience -- the NCBI browser is by

far my least favorite. I'm using this -- I'll show you in very brief detail how to use NCBI

but I'm also using it as a way to highlight some other NCBI resources that I think you

should be familiar with.

So this is the NCBI map viewer homepage. A whole lot of organisms available. You choose

the organism you're searching in and our search that we're doing today is looking for a region

between two SNPs. So you've done -- so say you've done some sort of association study.

You've narrowed down your critical region to a region between these two SNPs and you

want to know what genes are in that region. So you do your search. You get back that you

hit -- these two SNPs hit on the short arm of chromosome eight. And they hit two different

genome assemblies. This is something we haven't encountered at either Santa Cruz or at Ensembl.

So I should say these are two different human genome assemblies. This one here called reference,

that's the normal genome assembly, the build 36 of the genome or HG19 or whatever your

name is system is.

This other assembly down here is called HuRef primary assembly and if you Google this you

will find out this is actually Craig Venter's genome. So Craig Venter, for those of you

too young to remember all the controversies, Craig Venter was the founder of Solera Genomics

that did the privately-funded human genome project. He's decided to make his personal

genome available in the genome browser. This reference genome -- I didn't say this earlier

-- this is a composite of about 20 or so different people's genome so it doesn't represent one

person. This one does.

We're going to stick with the conventional reference assembly and we're going to look

at all the matches on chromosome eight for these two SNPs. You come up with a display

that looks like this, which is sort of similar to what you're seeing in Santa Cruz and Ensembl,

hopefully becoming somewhat familiar to you, although at this point the display is organized

vertically instead of horizontally.

Over here in our variation track we have a list of SNPs. The one that we started with

is up at the top. We query for two SNPs. The first one is up here at the top; the second

one would be off the screen at the bottom.

Other tracks that we have, we have a UniGene track which represents a synthesis of EST

sequences. I'm not completely sure why that's shown here because I don't find it to be an

overly useful track. The track I think you ought to be more interested in is the one

here called genes on sequence, which are the -- basically the NCBI gene predictions.

So let's try to change the track display a bit. You would click on the maps and options

button on the left which brings up a screen that looks like this. For those of you who've

used the NCBI map view in the past, you will be aware that this is a newer version of the

maps and options box. It used to look very different but the function is basically the

same. Over here on the right you have a list of the displayed tracks. Over here on the

left you have a list of the tracks you can add to your display. I used have the UniGene

track up here because that was on by default. I clicked -- there was a minus button next

to it -- I clicked on the minus; it went away.

If I wanted to add tracks, I would click on their names over here. So for example you

could add the Ensembl genes or Ensembl transcripts to your view at NCBI if you wanted to. You

can also reorder your tracks so, here, by default because I searched for SNPs, the SNP

track or the variation track is showing up on the right. That's sometimes called the

master map and that has the most detail available for -- that particular map has the most detail

-- I'll go into that a bit later.

I'm switching things around. Rather than have the rightmost map be the variation track,

I want it to be the gene map and I just did that by dragging the maps with respect to

each other. So my new view looks like this. I've gotten rid of the UniGene track and I've

moved the gene track over to the right.

So you'll notice when the gene track is on the right, the master map, you've got a lot

links to NCBI resources. So let's explain the -- explore some of those in a bit more

detail. We're going to look at what happens when you click on the gene symbol, what happens

when you click on the OMIM link and what happens when you click on the hm link.

So the genes symbol first: this takes you to an NCBI resource called Entrez Gene which

I hope that you're all familiar with and if you're not, I suggest you look into it. This

is a great curated catalog of all information available for a particular gene. So you get

-- we're looking now at the ERLIN2 gene -- ERLIN2 gene. You have links to some resources. You

have usually -- I think I've cut it off down here -- you'll have a description of what

this particular gene does which is manually curated, manually written by NCBI staff. And

you have what I find the most useful is a link to the RefSeqs for this particular transcript.

So rather than searching Entrez Nucleotide to find these -- to find the RefSeqs -- you

can come directly to Entrez Gene and they're right here. You have the NM transcripts, the

corresponding NP proteins, if there are multiple isoforms, they all show up. And there's a

long list of other features that are available in here as well which I'm not going to go

through but I encourage you to look at on your own to see all the information that NCBI

has brought together.


Female Speaker: [inaudible]

Tyra Wolfsberg: The NC number -- another one I can answer

-- down here -- so that is the reference -- the RefSeq accession for the chromosome. So this

is NC underscore, a bunch of zeros and an eight. That is human chromosome number eight.

So the chromosomes are NCs. There's scaffold -- contigs or scaffolds, I can't remember

which -- are called NTs, NT underscore and a string of numbers. You'll sometimes see

those as well. But it's nice to have chromosomes because the coordinates that you get are then

chromosomal coordinates that start with nucleotide one and go to the very end of the chromosome.

I don't know who asked the question but did I answer it? Yes, another one? Okay. Sorry,

it's very bright down here; I can't actually see anybody.

Okay. So another thing you can link to is the OMIM record for this particular entry

so this is the OMIM display for this gene. OMIM is Online Mendelian Inheritance in Man

that I believe is still handled by Hopkins.

Male Speaker: Yes.

Tyra Wolfsberg: Yes. It is manually curated by staff at Johns

Hopkins University who bring -- read the literature and then bring in lots of information from

the literature, specifically medically-oriented information about a particular gene. So you'll

see right off the bat, this particular gene is associated with the phenotype spastic paraplegia

18, I think that says. And there's links to other things here as well.

I hesitate about where this was based. It used to be that OMIM did the annotations and

it was displayed at NCBI. OMIM has recently taken back the handling of the online display

so it looks a bit different from what you might have seen before but I believe the data

are the same as what they were a couple months ago. So this could be a great place to come

if you just want a synthesis of the literature about a particular gene.

Another link that was available off the map viewer page, the link called hm, goes to something

called Homologene. So this is NCBI's tool to find predicted homologs between the protein

that you started with and other proteins in the NCBI databases. So here we're getting

predicted homologs between the human gene, mouse, dog, zebrafish, worm, etc. And further

down the page you get more details. Up here you're getting links to the accession numbers.

So this is sort of similar to what you saw at Ensembl for the ortholog prediction. NCBI

doesn't have as an extensive a list of species that they're doing their ortholog or homolog

comparison between, so it's going to be a shorter list.

Okay. So how do we navigate around NCBI? You can zoom in and out using this zoom control

over on the left or you can put your mouse over a particular track and click on it and

it will allow you to zoom in or zoom out by a specific amount. I want to zoom in on one

particular region of this ERLIN2 gene so I'm going to show 100K, which is going to zoom

the display in, and I didn't show it here, after I did that I did another zoom in, 4X,

just to really zoom in. And I ended up with a display that looked like this, where we're

focusing in on two exons of ERLIN2. Here's one of them. Here is the other one.

The direction of transcription of this gene is rather cryptically noted by this little

arrowhead next to the gene that shows you the gene is pointing down or if it were sideways

it would be pointing left to right, so the traditional direction. And then over here,

you can see we have all of our variants and you're now seeing individual variants rather

than a smear that you were seeing at a higher -- a higher, more expanded resolution.

If you want to change the order of the maps, if you want to now bring this variation map

over to the right to see more details, you have two options. One is to go back into the

maps and options and change the master map there. Another one is to click on the arrowhead

next to the map name and it will jump over to the right which we shall do. And now we're

seeing the variants listed in greater detail.

So these are each of the SNPs. They have little lines going from the accession number to the

actual position at which they appear on the genome so you don't get confused just because

the accession number is right here. It doesn't mean it corresponds to a SNP right here. There's

actually an arrowhead that's pointing down a little bit further down the chromosome.

Next to each of these SNPs you have an L, a T and a C. The L tells you that the SNP

is within a locus or within a gene. The T tells you that it is within a transcript.

What they mean by transcript is within an exon. So you can be on an intron and you'll

get the L button lighting up but not the T button. And the C tells you that it's inside

a coding sequence. So most of the SNPs within transcripts are going to be within coding

sequence unless you're in the five prime or three prime isolated region.

To get more detail about a particular SNP, you would click on its accession number and

that takes you to NCBI's dbSNP. I'm just showing some highlights from that here. Up at the

top we get more information about a particular allele, the nomenclature that's used in different

transcripts and I've scrolled way down and here's sort of the meat of what you want.

It's a missense mutation. Missense means non-synonymous, meaning the SNP is changing the sequence of

a protein. You're changing A to a G, showing you the two codons here and here showing you

the actual amino acid change and then here showing a slightly different view of what

that looks like in its genomic context.

So I'm basically done. I'll point you to another couple of sources of online help. All the

genome browsers have various tutorials and help documents online which are available

here. And there's also -- through current protocols in bioinformatics, there is a chapter

on each of these genome browsers available. If you are logged in through your -- on the

NIH network, you can get to these units for free using the URL down here. You can also

get to them through the NIH library although that link is currently broken so I'm using

this link directly to the current protocols website.

So I'm happy to take any questions although before I do that I should announce that the

next lecture is given by Andy Baxevanis called Biological Sequence Analysis II and that will

be held here next week. Thank you.


The Description of Genome Browsers - Tyra Wolfsberg (2012)