USF Libraries
USF Digital Collections

Concerted evolution in SM50, a gene with unusual repeat structure

MISSING IMAGE

Material Information

Title:
Concerted evolution in SM50, a gene with unusual repeat structure
Physical Description:
Book
Language:
English
Creator:
Hussain, Sofia
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
Molecular evolution
Sea urchin
DNA
Spicule matrix genes
Neutral evolution
Dissertations, Academic -- Biology -- Masters -- USF   ( lcsh )
Genre:
government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Summary:
ABSTRACT: Genes present in multiple copies and genes that contain regions of repetitive sequences can undergo concerted evolution, which results in homogenization of the nucleotide sequence of the genes or repetitive regions. In regions of tandem repeats, this occurs through misalignment of repeat units followed by unequal crossover, which generates two products with differing numbers of repeat units. Gene conversion is thought to lead to one of these products becoming fixed in a species. The homogenous sequence of previously studied genes that have been thought to undergo this process has made it difficult to determine the exact models involved. Here I examine concerted evolution in SM50, a sea urchin gene that encodes a protein involved in biomineralization. The repetitive region in the SM50 gene varies in length between species, and there is variability in each repeat unit as well.I examine the codon usage in SM50 in a variety of species, and discuss how purifying selection, substitutions, concerted evolution, and selection at the level of DNA sequence have played a role in the evolution of this gene. I also examine the structure and sequence of the repeat units, and purpose models that have led to the evolution of the repeat pattern seen in the different species examined. Finally, I have found variation in the number of repeat units within several species. This has allowed us to deduce the specific models of unequal crossover that led to this variation. The unique variation in the repetitive region of SM50 has enabled us to describe a model of how substitutions affect the model of misalignment and unequal crossover.
Thesis:
Thesis (M.S.)--University of South Florida, 2005.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Sofia Hussain.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 124 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001709521
oclc - 68903378
usfldc doi - E14-SFE0001401
usfldc handle - e14.1401
System ID:
SFS0025721:00001


This item is only available as the following downloads:


Full Text

PAGE 1

Concerted Evolution in SM50, a Gene with Unusual Repeat Structure by Sofia Hussain A thesis submitted in partial fulfillment of the requirements for the degree of Masters of Science Department of Biology College of Arts and Sciences University of South Florida Major Professor: Bria n T. Livingston, Ph.D. Stephen A. Karl, Ph.D. James R. Garey, Ph.D. Date of Approval: September 2nd, 2005 Keywords: molecular evolution, sea urch in, DNA, spicule matrix genes, neutral evolution Copyright 2005, Sofia Hussain

PAGE 2

Acknowledgments There are a lot of people I need to tha nk and I shall endeavor to do my best to name them all here. First, I would like to th ank all the people who helped me collect sea urchin samples. Ocean Fresh in Ft. Bragg, CA allowed me to visit my first sea urchin fishery and collect sea urchin samples direct ly off the assembly line. Dr. Christiana Biermann allowed me to share her DNA samples from Norway and helped me collect samples from Friday Harbor. Dr. Fred Wilt, Dr. Ron Burton and Dr. Vic Vacquier also sent me samples from their labs. Bill Dent and Dr. Dave Duggins guided me with all scientific diving protocols while collecting using SCUBA. I wish to thank all my dive partners too, especially Jen Rhora who was out there with me in the gulf almost every time. Second, I would like to thank all the pl aces that gave me scholarships to attend meetings or conduct research. This includes the Blinks Fellowship at Friday Harbor Laboratories, SICB for student support at meetings, the BGSO, GPSO, and the Biology Department at USF for money to be sure I arrived and was fed at those meetings. Third, I would humbly like to thank a ll my professors for their guidance and understanding when I was lost, frustrated, and for giving me a good kick when I was a slacker. Without their mentorship I doubt I could have made it thr ough this process. Fourth, I would like to thank all my friends for being patient with me when things got rough. Fifth, I would like to thank my lab mates who taught me everyt hing, especially Rae Reuille who taught me how to do lab work, Mary Harmon for laughi ng, and Gio DeSilva for singing. Also the Karl lab and the Garey lab for making me feel welcome when our lab just moved to USF, and the graduate student community at U SF for sponsoring activities. And my SCA fencing crewe for allowing me to beat up on them when I just needed some violence in my life. And, finally, my whole family for supporting me even when they didnt understand exactly what I was doing ...especia lly my dad who recently in a phone call reminded me RNA is a cool thing...

PAGE 3

i Table of Contents List of Tables iii List of Figures iv Abstract vi Introduction 1 Our study of SM50 1 The theory of neutral evolution 3 The theory of concerted evolution 5 Mechanisums and effects of molecular drive 6 Analysis of codon usage frequencies can il lustrate neutral and selective forces 9 Additional forces of selection at the codon usage level are plentiful 10 Spicule matrix proteins 13 The sea urchins in this study 18 Chapter One: Codon usage analysis suggest s concerted evolution, substitutions, and selection influence the evolutionary hist ory of the SM50 repeat array in various Strongylocentrotridae and Lytechinus sea urchins. 25 Introduction 25 Materials and Methods 26 Species utilized 26 Genomic DNA isolation 26 Polymerase Chain Reaction (PCR) and cloning 28 Sequencing 28 Total codon usage frequencies 29 Total codon usage frequencies by position in the SM50 repeat 29 tRNA frequencies 30 GC Content 30 Altered mRNA sequences 30 Results 31 Codon usage frequencies of th e WG are similar to the CLD 31 Codon usage frequencies of the SM50 re peat array are different than the WG and the CLD 32 Codon bias by position further illustrate s evidence of concerted evolution 35 Analysis of the most frequent codon (by amino acid and by position) illustrates clade-specific patterns 36

PAGE 4

ii tRNA frequencies 38 GC content 39 mRNA secondary structure 39 Discussion 41 There is evidence of concerted evolu tion influencing the synonymous codon usage in the SM50 repeat array. 41 tRNA frequencies cannot account for the unusual codon usage frequencies. 46 GC content cannot account for the unusual codon usage frequencies. 47 There may be selection against a hi ghly stable secondary structure in the mRNA 48 Conclusions 50 Chapter Two: Models of concerted evolution. 52 Introduction 52 Materials and Methods 54 Species utilized and genomic DNA isolation 54 Polymerase Chain Reaction (PCR) and cloning 55 Sequencing 55 Analysis of DNA sequences using dot plot analysis 55 Results 56 Analysis of SM50 repeat duplicat ions in representative species 56 in three clades Analysis from additional species 58 Patterns of duplica tion within species. 59 Discussion 61 Models of unequal crossover 65 Conclusions 69 Summary 67 References 99 Appendix 1: DNA sequences of the SM50 repeat array 107 Appendix 2: Amino acid sequences of the SM50 repeat array 111 Appendix 3; Amino acid sequences of the SM 50 repeat array of a lleles used in models 113 Appendix 4: Summary of alleles used in study. 116

PAGE 5

iii List of Tables Table 1 Codon usage frequencies cal culated by amino acid of the SM50 repeat array, C-type lectin do main, and a sample of the whole genome 71 Table 2 Codon usage frequency of th e SM50 repeat array calculated by position. 74 Table 3 Codon usage frequency of tRNA genes found in the S. purpuratus genome 76 Table 4 The length and percentage of GC content of the SM50 repeat array. 77

PAGE 6

iv List of Figures Figure 1 Diagram of the protein-codi ng portion of the SM50 gene in S. purpuratus. 69 Figure 2 Schematic relationships of sea ur chin species used in this study 70 Figure 3 Change in secondary structure stability produ ced by an altered mRNA sequence. 78 Figure 4 Dot plot analysis of the SM 50 repeat array of selected species compared to themselves. 82 Figure 5 The SM50 repeat arrays of S. droebachiensis, S. pallidus, and S. nudus alleles. 84 Figure 6 Dot plot analysis of S. nudus, S. droebachiensis, and S. pallidus alleles 88 Figure 7 Schematic of larger order duplications in SM50 repeats. 91 Figure 8 Model for the creation of S.nudJP17 from the misalignment and crossover of two S.nudJP18 alleles. 92 Figure 9 Model for the creation of S.nudJP18 from the misalignment and crossover of two S.nudJP17 alleles. 93 Figure 10 Model for the creation of S.droWA28a from the misalignment and crossover of two S.droWA30 alleles. 94 Figure 11 Model for the creation of S.droWA30 from the misalignment and crossover of two S .droWA28a alleles. 95 Figure 12 Model for the creation of S.pallNO27 from the misalignment and crossover of two S.pallNO30 alleles. 96

PAGE 7

v Figure 13 Model for the creation of S.palWA24 from misalignment and crossover of two S .palWA32 alleles. 97 Figure 14 Model for the creation of S.palWA24 and S.palWA32 from the misalignment and crossover of a hypothetical product, S.palHYP 98

PAGE 8

vi Concerted Evolution in SM50, a Gene with Unusual Repeat Structure Sofia Hussain ABSTRACT Genes present in multiple copies and ge nes that contain regions of repetitive sequences can undergo concerted evolution, which results in homogenization of the nucleotide sequence of the genes or repetitive re gions. In regions of tandem repeats, this occurs through misalignment of repeat units followed by unequal crossover, which generates two products with differing numb ers of repeat units. Gene conversion is thought to lead to one of these products becoming fixed in a species. The homogenous sequence of previously studied genes that ha ve been thought to undergo this process has made it difficult to determine the exact models involved. Here I examine concerted evolution in SM50, a sea urchin gene that encodes a protein involved in biomineralization. The repeti tive region in the SM50 gene varies in length between species, and there is variability in each repeat unit as well. I examine the codon usage in SM50 in a variety of species, and disc uss how purifying selection, substitutions, concerted evolution, and selecti on at the level of DNA sequence have played a role in the evolution of this gene. I also examine the st ructure and sequence of the repeat units, and purpose models that have led to the evolution of the repeat pattern seen in the different species examined. Finally, I have found varia tion in the number of repeat units within several species. This has allowed us to de duce the specific models of unequal crossover

PAGE 9

vii that led to this variation. The unique variation in the repetitive region of SM50 has enabled us to describe a model of how subs titutions affect the model of misalignment and unequal crossover.

PAGE 10

1 Introduction Our study of SM50 Mutations diversify DNA sequences and are influenced by two forces of evolution. Those mutations that affecting an organisms fitness can be subjected to selection, but mutations that do not affect th e fitness of an individual are said to be neutral. The interplay of these forces sh apes the diversity of DNA sequences seen in nature. DNA sequences that contain repetiti ve elements can undergo a specialized form of change, called concerted e volution, which causes these sequences to evolve differently than the rest of the genome. Even protein-coding genes can be subject to mechanisums of concerted evolution. There are many protein-coding multi gene families with interesting functions that undergo concerted evolution. Mechanisums of concerted evolution ha ve the effect of changing the number of repeated elements and homogenizing the sequences of genes within these families (Walsh 1987a; Dover 1982; 1993; Elde r and Turner 1995; Liao 1999; Ohta 2000). Concerted evolution has al so been observed in tandem repeated elements within the coding region of single-copy protein-coding ge nes. These include fertilization genes, (Bierm ann 1998; Swanson and Vacquier 1998), spider silk protein genes, (Hayashi and Lewis 2001; Craig and Ri ekel 2002) and spicule matrix genes in sea urchins (Meeds et al. 2001). Understanding the degree th ese sequences are influenced by concerted evolution can also help us understand selecti on pressures and functional aspects of these genes.

PAGE 11

2 In addition, there are many human diseases thought to be due directly to unequal crossover, a mechanisums of concerted evolution. B-Thala ssemia, globin fusion genes, and the deletion of GH1 encoding for huma n growth hormone all result from unequal crossover between homologous genes (Lupski 1998). Red-green color-blindness is also due to misalignment and crossover of the ta ndem array of a red opsin gene and one or more green opsin genes, thereby making a hybrid dysfunctional gene (Lupski 1998). Fragile X syndrome, spinobulbar muscular atro phy, and Huntingtons disease all also are thought to form due to unequal crossover and gene conversion (reviewed by Baldi et al. 1999; Parniewski and Staczek 2002). A bette r understanding of the effects of the interplay between unequal crossover, base substitutions, and purifying selection will help in the treatment of these diseases. The repeats found in multiple copy genes a nd genes with repetitive elements in natural populations are often all the same, thereby leaving no evidence of the interactions between forces of evolution. Thus it has been difficult to purpose models of evolution that led to the observed sequences. The prot ein-coding portion of SM50, a spicule matrix gene found in many species of sea urchins, enables us to study the interplay between purifying selection, neutral s ubstitutions, and concerted evolution (due to unequal crossover) because the repetitive elements are not completely homogonous in natural populations. As a protein-coding gene, SM50 is subject to purifying selection and neutral mutations. In addition, the SM 50 gene contains a region of 5-7 amino acids that are repeated in tandem 1732 times depending on the species that are subject to the homogenizing and length altering effects of concerted evolution. But because the biomineralization function of the SM50 protein relies on the ov erall shape of the

PAGE 12

3 molecule rather than the exact primary seque nce, some variation in amino acid sequence and repeat length remains in natural populations (Wilt et al. 2003; Wilt 2002; Meeds et al. 2001; Berman et al. 1988; 1990; 1993; Emlet 1982). This variation enables the unraveling of the evolutionary events that have altered the SM50 gene (Meeds et al. 2001). I analyze codon usage frequencies to examine how th e interplay of concerted evolution, purifying selection, and neutral s ubstitutions have inte racted to shape the evolutionary history of the SM50 gene. C oncerted evolution produces a high level of codon usage bias because it homogenizes the co dons found in repetitive sequences. But other forces, including select ion, can also produce a codon usage bias. Comparison of the codon usage bias between species indicates how much of this bias is due to concerted evolution and how much is due to selec tion. The comparison allows me to purpose models of concerted evolutionary events that led to the organization of the SM50 gene in different species. I propose that substitutions have constrained the misalignment during unequal crossover, and therefore have altered the mechanisums of concerted evolution in each species. Finally, variations in the SM50 gene within several species do exist enabling the purposal of models of unequal cr ossover events that have occured since speciation. These models support the hypothe isis of how substitutions constrain concerted evolution. The theory of neutral evolution Mutations cause a change in the se quence of DNA and provide alternate variations that evolve over time. If a mutation produces a product either more or less

PAGE 13

4 favorable than the original, the evolution is highly governed by selective forces. In contrast, some changes do not increase or decrease the function of the DNA sequence yet these sequences still change over time (Kimura 1968). Th e neutral mutation-random drift hypothesis of molecular evolution and polymorphism can account for the changes in DNA variation that are not due to selectiv e forces (Kimura 1977). Under this hypothesis, alternate forms of a DNA sequence will be removed from a population by genetic drift, rather than be ing selected against (Nei 1987). In this way much of the variation seen at the DNA level can be e xplained by neutral evolution (Kimura 1986). Many DNA sequences are influenced by both selection and neutral evolution. When a DNA sequence is functional any mutati ons that would alter the function of the sequence would be subject to selection. Bu t not all mutations alter the function of the DNA. Those that conserve the function of the DNA sequence are subject to neutral evolution instead. In this wa y neutral theory can act within the confines of a selective force (Kimura 1976). Even mutations that have a slight increase or decrease in function are thought to be nearly neut ral in both random genetic dr ift and selection, and thus best explained by neutral models of e volution (Ohta 1997; Zuckerlandl 1997). Evolution of the protein-coding regions of a gene provides a good example of neutral evolution within a selective restraint. Mutations in the pr otein-coding region of a genes are of two types; those that alte r the amino acid composition (nonsynonymous) and those that conserve the amino acid composition (synonymous). Under purifying selection, nonsynonymous mutations that alter the function of the re sulting protien are subjected to selection (Nei 1987). Synonymous changes do not alter the amino acid sequence, and thus are more often su bject to neutral evolution (Sharp et al. 1995).

PAGE 14

5 Because of this, the effects of selection are often detect ed by the difference in the frequency distribution of nonsynonymous to s ynonymous mustations in protien coding genes (Fay and Wu 2001). In fact, the first strong evidence fo r the neutral theory emerged in DNA (or RNA) sequence data where it was observed that synonymous mutations were more common than nonsynonymous mutations (Kimura 1986). Based on this, synonymous mutations would evolve according to the neutral theory and nonsynonymous mutations would evol ve due to selecti on. Unfortunately, not all nonsynonymous mutations ar e subject to selection, and not all synonymous mutations are subject to nutral evolution. Because ma ny amino acids have similar properties, nonsynonymous mutation may not alter the functi on of the protein, and therefore these changes are subject to neutral evolution (Fay and Wu 2001; Kimura 1986; Zuckerlandl 1997) There are also some models of sel ection that act upon synonymous mutations just as well (see Chapter 1). Alt hough the interaction of neutra l and selective models of molecular evolution is quite complicated, the neutral theory still provides a framework to analyse the evolution of synonymous mutations in protien-coding genes (Fay and Wu 2003). The theory of concerted evolution Genomes contain substantial numbers of repeated elements of DNA, both coding and non-coding. In fact, as much as 33% of the human genome is made up of repeated elements (Liao 1999). Sometimes these repeated elements are more similar to each other within a species than between species (Dover 1982). In this case, the gene copies are thought to evolve in concert which is why they are said to undergo concerted evolution

PAGE 15

6 (reviewed in Elder and Turner 1995). Concer ted evolution was originally discovered when looking at various rRNA genes, but regulatory sequenc es, microsatellites, and any other region of DNA containing repeated sequences can also undergo concerted evolution (Dover 1982; Lupski 1998; Laio 1999). Con certed evolution has been found in both prokaryotes and eukaryotes (Elder and Tu rner 1995). Many possible mechanisums for concerted evolution have been purposed and collectivly termed molecular drive (Dover 1982). These mechanisums include DNA transposition, gene conversion, and unequal crossover (Dover 1982). These mechanisumss can occur in tandem repeated sequences, repeated segments in different locations on a chromosome, and sequences on different chromosomes (Dover 1982). Mechanisums and effects of molecular drive Transposons are mobile elements that can cause double-strand breaks in DNA when they move (Thompson-Stewart et al. 1994). These breaks are repaired by using homologous sequences sometimes found on the sister strand of DNA (Thompson-Stewart et al. 1994). If the break is in the middle of a tandem array or repeated elements, the repair process can cause the addition or deletion of repeated elements (ThompsonStewart et al. 1994). Constant duplication of repeated elements would cause them to have similar sequences. Transposons can act anywhe re in the genome, yet they require the specific transposon sequences to ho mogenize the repeat regions. Gene conversion can homogenize repeated el ements regardless of their location in the genome. Gene conversion can act within a tandem array, repeated elements on the same chromosome, and repeated elements on different chromosomes. As the name

PAGE 16

7 suggests, gene conversion converts one copy of a repeated element to the other. Not all mechanisums are well understood, but there is evidence that this does occur (Teshima and Innan 2003; Dover 1982). In fact, gene conversion has been considered the most important mechanisum for homogenizing duplic ated genes (Teshima and Innan 2003). By makeing all the copies of a repeated element within a genome indentical, gene conversion can homogenize the sequence of a re peated element with in a population. In the special case of gene conversion acting upon a repeat array, all repeated elements will have identicle sequences and the arrays will contain the same number of elements (Dover 1982). Even very low levels of gene conve rsion are effective at homogenizing tandem arrays of repeats (Eld er and Turner 1995). Unequal crossover occurs only in arrays of tandem repeated elements of DNA and has the effect of homogenizing the DNA se quence of each repeated element (Dover 1982). During normal meiosis or mitosis these repeat arrays will align perfectly to crossover (Smith 1976). When the repeated elements are very similar in sequence, however, they can misalign to a different repeated element in the tandum array causing the entire repeat array to misalign. Crossover within the misaligned repeat array produces alleles that differe in lengths than the original. This explains the presence of alleles containing repeat arrays of varying lengths within a population (Elder and Turner 1995). The longer allele will contain the repeated elements absent in the shorter allele and, therefore, appeare to have duplicated those repeated elements. After time, the expansion and contraction of repeat arrays due to unequal crossover (followed by gene conversion) will homogenize the repeated elemen ts within a repeat array. Therefore, although repeat arrays may vary in lengths, the individual repeated elements within them

PAGE 17

8 have similar sequences due to concerted evolution (Dover 1982). Repeat arrays that have undergone homogenization due to unequal cro ssover are usually found in regions that usually have high rates of normal crossover (Elder and Turner 1995). The mechanisums of molecular driv e themseleves are altered by other mechanisums of evolution, including base pair mutation, gene tic drift and selection, that all interact to shape repeated regions of DNA (Elder and Turner 1995). For example, unequal crossover requires misalignment of similar repeated elements. Base pair substitutions counteract this effect by divers ifying the repeated elements thereby limiting the locations of unequal crossover and the resulting concerted e volution (Smith 1976; Brunner et al. 1986; Dover 1986; Murti 1992; Thomas 1998). In addition, purifying selection may limit the diversity and total numbe r of repeated elements in a repeat array (Smith 1976). Yet, as long as unequal cross over and gene conversion can occur in the presence of substitutions, there will be some homogenization due to concerted evolution (Parkin and Butlin 2004). Therefore, to unders tand the mechanisums of molecular drive, the other forces of evolution acting upon a given region of DNA mu st be studied as well. Repeated elements in DNA are commonly used genetic studies and, therefore, understanding the effects of molecular drive will enrich th ese studies. As much as 10% of the human genome is thought to be composed of tandem repeated arrays subjected to the forces of concerted evolution (Liao 1999) Also, genes that have been used to determine phylogenic relationships, including 18 S and 28S ribosomal genes, are found in many copies within the genome and therefore al so subject to concerte d evolution (Parkin and Butlin 2004; Elder and Turner 1995). Fina lly, microsatellites (each repeat unit is 2 bp-10 bp) and minisatellites (e ach repeat unit is 10 bp-100 bp) may increase and decrease

PAGE 18

9 in length by concerted evolu tion (Yauk 2004). The high level of variation in the length of these sequences, and an unders tanding of the mechanisums producing the variation, has allowed them to be used in population studies (reviewed by Ugarkovic and Plohl 2002). Therefore studying the interaction between mol ecular drive and other forces of evolution is important to many genetic studies. Analysis of codon usage frequencies can illustrate neutral and selective forces In protein-coding genes, all amino acids except for one are represented by multiple codons. The second and first base pairs of the codon, with a few exceptions, are usually conserved (Ohta 1997). Often variation in the third position will cause a synonymous change, and thus the third position is called a wobble position (Ohta 1997). If there is little to no selective advantage to which base pair is in the wobble position at equilibrium, substitution patterns at th is site should be identical to mutational processes (Duret 2002; Sharp et al. 1995). All regions of DNA are subjec t to random base pair mutations and if that is the only mutational process acting on a gene, there should be an equal frequency of each base in the wobble position (Duret 2002). This condition would produce equal frequencies of synonymous codons in a protien-coding gene (Duret 2002). For example, there are four synonymous codons for the amino acid prol ine. If the codon usage of proline is completely dependant on random base pair mutations, the expected frequency of each codon should be 25%. But if other models of evolution are acting on codon usage, the codon usage frequencies will deviate from this predicted value (Duret 2002). Thus, equal

PAGE 19

10 frequencies of codon usage can be used as a null hypothesis to test for evidence of alternate models of evolution (Fay and Wu 2003). Unequal frequencies of codon usage may be due to neutral mutational processes, other forces of neutral evol ution, and/or selective forces (Duret 2002). The mutational process of concerted evolu tion is one of many forces that can produce unequal frequencies of codon usage. Concerted evolut ion homogenizes repeated elements within a repeat array, and thus homoge nizes the codons that are used within that repeat region. This homogeniziation increases the codon usage frequency of a single codon. In this case, all the codons possible for a given loca tion in a repeat element have the same probability of fixation because none have a selective advantage (Duret 2002). Thus closely related species will have a similar se t of codons that have gone to fixation while more distant species will not. The codon bias produced by selection is different in that the codon with the greatest frequencies should be the same for all spec ies (if the selective forces are the same in all species) reguardless of the time elapsed. Thus, to determine if a bias in codon usage frequencie s exists due to concerted evol ution or due to other forces of evolution, it is important to l ook between closely related species. Additional forces of selection at the codon usage level are plentiful There are many models of selection that produce codon usage frequencies greater than or less than that pred icted by random base-pair mutatio ns (Fay and Wu 2003). Genes that undergo splicing of mRNA products, for example, are subject to selection for specific codon usage (Willie and Majewski 2004). In genes where conservation of primary structure is essential, there can also be selection for codons that either mutate

PAGE 20

11 less or are more likely to undergo a sy nonymous (rather than nonsynymous) mutation, although there are currently many counter examples to this thought. (Sharp et al. 1995). Codon usage may also be influenced by GC content in certain regions of DNA which produces a bias towards codons that are eith er GC or TA in the wobble position. For example, the mRNA of many spider silk protei ns are GC rich due to the requirement of specific codons that code for the required functional amino acids (Craig and Riekel 2002). 80%-90% of the codons used in thes e cases contain an A or T in the third position, possibly to off-set the GC-richness of the codons used (Craig and Riekel 2002). Majority of the examples that show selecti on of codon usage, however occur directly at the level of translation, and include the c oncentration of tRNA present in a cell and mRNA secondary structure. The concentration of tRNAs present in a cell are not equal and thus not all codons are translated at the same rate. Rare tRNA encounter ribosomes less frequenty than ubiquitous tRNAs. Therefore, a codon corresponding to a rare tRNA will take longer to translate than a codoncorresponding to a more common one. If th ere is a selective advantage to either a slower or faster ra te of translation, ther e can be a selective advantage for a codon usage bias within a gene For example, codons that correspond to rare tRNAs may be selected for in bacteria because they pause translation and allow proper protein folding (Guiez et al. 1993). In contrast, there are examples where codons that correspond to tRNAs presen t in high concentrations are selected for because they allow more effecient and rapid translation of the gene. For example, spider silk genes contain codons that correspond to the most frequent tRNAs found within the cells to insure rapid production of spid er silk protiens (Sharp et al. 1995). Further, the presence

PAGE 21

12 of codons that correspond to rare tRNAs in th ese silk fibroin genes seem to increase the discontinuous translation of s ilk fibroin, although this also appears to be influenced by secondary structure (see below) (Lizardi et al. 1979). Inspite of this evidence that codon usage is influenced by the frequencies of tRNAs, many gentic studies, including those in fruitflies, have not found a general correlati on between genes that are translated very frequently and codon usage frequencies (Sharp et al. 1995). Nevertheless, selection for or aginst rare frequencies of tRNA can, i ndeed, cause unequal codon usage frequencies in protien-coding genes. The neculitode sequence of transcribed mRNA relies on the codons present in the open reading frame. Transcribed mRNA is si ngle stranded and thus has the ability for base pairs to form within the molecule ba sed on its neculitide sequence. The resulting secondary structure can interfere with translat ion and, therefore, faci litate protein folding or control the total level of protein produced (Katz and Bu rge 2003). Therefore selection can influences the shape and/or stability of the mRNA secondary structure by influencing codon usage. In Bacteria, mRNA secondary structure has been shown to effect translation (Guisez et al. 1993). The mRNA of spider silk protein genes, and silk fiberon mRNA in silk worms appear to have significan t secondary structures that may have been selected to inhibit cDNA synthe sis and therefore f acillitate proper folding of the protein (Hayashi and Lewis 2001). But too much stab ility in the secondary structure of an mRNA molecule can produce a rigid structur e that would be selected aginst (Mita et al. 1988). Limiting translation by mRNA secondary structure is thought to be the cause for the human condition Fragile X syndrome (Schmittgen et al. 1994). In a manipulation expierement, when the codon usage in mRNA from a human gene was substituted to

PAGE 22

13 have a more stable secondary structure (yet conserve the amino acid sequence), the rate of translation decreased (Schmittgen et al. 1994). Further, mutations that lowered the stability of the proposed stem structure in an mRNA molecule in bacteria resulted in a nearly 3-fold increase in the s ynthesis rate of protien (Klionsky et al. 1986). This evidence suggests that selection for or agai nst the shape or stability of the mRNA secondary structure may also produce a bias in codon usage frequencies of the gene. Spicule matrix proteins Calcium carbonate skeletons in invertebrates Calcium carbonate based biomineralization is wi despread in invertebrates and therefore is a rich field of interest (Wilt 2002). Urochordates, Arthropods, and Mollusks all have structures that are the result of cal cium-carbonate biomineralization (Wilt et al. 2003). An overwhelming majority of the work on bi omineralization, howev er, has been done in echinoderms (Wilt et al. 2003). Biomineralization in echinoi d embryos is an ideal system to study because their skelitons are less complicated than vertebrate bones or teeth yet governed by similar principles and because of this biomineralization in echinoids has been studied for over 120 years (Wilt et al. 2003; Wilt 2002; Killian and Wilt 1996). Biomineralization in echinoderms All five extant classes of echinoderm s contain adults that produce calciumcarbonate skeletons (Wilt et al. 2003). The skeleton consists of small isolated ossicles scattered throughout the body wall in sea cucumbers (Holothuroid ea), articulated ossicles in sea stars (Asteroidea) and br ittle stars (Ophiuroidea), comple te test, teeth and spines in

PAGE 23

14 sea urchins (Echinoidea), and sturtctures in sea lilies (Crinoidea) (Wilt et al. 2003). In contrast, the larvae of asteroids and cr inoids do not contain skeletons and in holothuroideans the skeleton is dramatically reduced. Echinoids and ophiuroids are the only echinoderm larvae that contain complete skeletons (Wilt et al. 2003). In echinoderm larvae, the skeleton tissues are made by the descendants of the primary mesenchyme cells (PMC) (Wilt et al. 2003). The PMCs only form skeleton tissue and do so even if they are separated from the rest of the embryo and placed in only sea water and horse serum (Okazaki 1975; Wilt 2002; Wilt et al. 2003). In most sea urchins the PMCs come from the large microm eres that are the result of the unequal 5 th cell division in the vegetal pole of the em bryo, although there are species of sea urchins (including pencil urchins and di rect developing sea urchins) that do not form micromeres yet still contain PMCs that expre ss spicule matrix proteins (Makabe et al. 1995; Wilt 2002; 2003; Davidson et al. 1998). The 3-D pattern of the skeleton is laid down during and after gastrulation and depends on interact ion with the inner ectodermal wall of the blastocoel (Davidson et al. 1998). The descendants of the PMCs migrate into the blastocoel during gastrulation and are fu sed by slender cytoplasm cables (Arnone et al. 1997; Makabe et al. 1995; Wilt 2002; Davidson et al. 1998). The PMCs then form the spicules in the space between the fused cel ls (Wilt 2002). In this way the calcium carbonate based endoskeletons ar e bound by an epithelium yet th ey are not formed within a cell themselves (Wilt et al. 2003). Cells at the tips of the spicules continue to participate in biomineralization in this way th roughout the rest of larvae life (Wilt 2002).

PAGE 24

15 Spicule matrix proteins The spicules in sea urchins are made of roughly 95% mineralized calcite in the form of CaCO3 containing 5% MgCO 3 with about 0.1% Glycoprotein (Wilt et al. 2003). The precise molecular interac tion of the proteins is unclear due to the difficulty of studying the proteins while they are associated with the calcite (Wilt et al. 2003; Xu and Evans1999; Zhang et al. 2000). Some proteins from th e calcium-carbonate material, however, have been isolated and studied usi ng a 2D gel analysis (Wilt 2002; Killian and Wilt 1996). It is estimated there is about 45 different spicule matr ix proteins, although only four spicule matrix protei ns have been studied so far through direct isolation from the spicules (SM30, SM37, PM 27, and SM50). Three more potential spicule matrix proteins (SpSM29, SpSM32, and SpC-lectin) ha ve been identified from scanning an EST library (Illies et al. 2002; Wilt 2003). Those discovered from the EST library have yet to be confirmed by localization to the spicule matrix tissue and one of them, SpSM32 is so close to an SM50 transcript that it may be a splicing variant (Illies et al. 2002; Wilt 2003). SM50, PM27, SM37, and possibly SpSM29 and SpSM32 appear to be nonglycosylated, alkaline secreted proteins, and therefore unlike the majority of the proteins isolated from the 2-D gel analysis (Benson et al. 1987; Killian and Wilt 1996; Wilt et al. 2003). These proteins are also charactized by a C-type lectin domain (a calcium dependant region that selectivly binds to specific carbohydrate stude tures) and proline rich repeat regions, both of which have been found in mineralized tissues and structures in vertebrates and other invert ebrates (Drickamer 1988; Illies et al. 2002; Wilt 2002; Wilt et al. 2003). SpC-lectin is an acidic secreted protein that is lacking a proline rich repeat

PAGE 25

16 domain and a consensus N-gl ycosylated site (Illies et al. 2002; Wilt et al. 2003). Further studies are needed to confirm the role of SpC-lectin and the other proteins discovered from the EST library, as well as the acidic and Nglycosylated proteins yet to be isolated and studied (Wilt et al. 2003). The spicule matrix proteins, in general, a ppear to increase the flexural strength of the mineral so that it behaves like a hard gl ass-like material rather than a crystal (Wilt 2002; Wilt et al. 2003; Emlet 1982; Berman et al. 1988; 1990; 1993). For example, the concentrations of proteins in a particular se a urchin tooth correlated with the hardness in that specific area, indicating that the spicule matrix proteins influence the strength of the calcium carbonate matrix (Stock et al. 2002). In SM50, the repeated proline, methionine and glycine amino acids are known to confer a B-spiral configurati on thought to account for the flexible nature of the sea urchin sp icules and help the sp icule resist fractioning (Xu and Evans 1999; Zhang et al. 2000; Wustman et al. 2002; Wilt et al. 2002; Wilt 2003). In addition, it is thought the prolinerich repeat regions help various spicule matrix proteins link up to each other (Xu and Evans 1999; Zhang et al. 2000; Wustman et al. 2002; Wilt et al. 2002). Structure of the SM50 Gene SM50 is an ideal gene to study the propert ies of spicule matrix genes because it has a structure similar to other spicule matrix genes. It contains a C-type lectin domain located towards the 5 end of the protein-co ding region (Figure 1) followed by a series of tandem repeated sequences rich in proline a nd glycine that function in biomineralization (Illies et al. 2002). These repeated sequences are refered to individually as SM50

PAGE 26

17 repeats, and the portion of SM50 that includes them in tandem as SM50 repeat array. Each SM50 repeat is either 15 bp, 18 bp, or 21 bp long (5, 6, or 7 amino acids respectively) and imperfectly duplicated in tandem 14-32 times depending on the species (Meeds et al. 2001). In the functional protein, the SM 50 repeat array is predicted to form an elastic beta-spiral structure (Livingston et al. 1991). Recent structural studies support this prediction and also indicate that interactions between am ino acid residues in adjacent SM50 repeats stabilize the final structure (Xu and Evans 1999). Like most protien coding genes, limited synonomous and nonsynonomous substitutions in SM50 appear not to affect the functionality of the product and thus are present between species (Meeds et al. 2001). The SM50 repeat array is unusual in that the overall physical structure of the encoded protein of the SM50 repeat array appears to be functionally more important than the ex act amino acid sequence of each SM50 repeat in the array (Livingston et al. 1991; Meeds et al. 2001). Thus accounts for the variation in the length of the SM50 repeat array found in six different species of euechinoids that all produce functional products (Meeds et al. 2001; Katoh-Fukui et al. 1992). In fact, hybrid embryos of S. purpuratus and L. pictus expressed both copies of the SM50 gene although the length of the SM50 repeat array differes in each species (Brandhorst and Davenport 2001). Therefore, relaxed sele ction ther primary sequence of the SM50 protien allows many varients to persist in nature (Meeds et al. 2001). The tandem SM50 repeats appear to evol ve by concerted evolution; the SM50 repeats within a species are more similar to one another within a species than to the equivalent SM50 repeat in related species (Meeds et al. 2001). The structure of the SM50 repeat array is different in di fferent species, and concerted ev olution appears to have been

PAGE 27

18 influenced by the degree of sequence di vergence between the SM50 repeats, thereby influencing the model and freque ncy of unequal crossover (Meeds et al. 2001). This makes the SM50 repeat array a unique mode l to study how repetitive regions of DNA evolve under the three forces of selection, co ncerted evolution, and ba se pair substitution. The sea urchins in this study What are sea urchins? Sea urchins belong to the phylum echi nodermata. Echinoderm means spiny skin and all members of this phylum contain specialized spicules that fossilize well. Because of this, over 25 classes in this phylum have been recognized in the fossil record dating back to early Cambri an (Brusca and Brusca 2003). Only five classes of exclusivly marine species remain today in cluding the Crinoidea (feather stars and sea lilies), Asteroidea (sea stars), Ophiuroidea (bri ttle stars and basket stars), Holothuroidea (sea cucumbers), and Echinoidea (heart urchin s, sand dollars, and sea urchins) (Brusca and Brusca 2003). All members of class Echinoidea have a gl obal or disk-like body, skeletal plates that form a solid test, mov eable spines, and an internal jaw apparatus (Aristotles lantern) (Bru sca and Brusca 2003). In addi tion, sea urchins are globular (Brusca and Brusca 2003). Why study sea urchins? Echinoderms became a model species for developmental study because of the qualities of their life-history fr om adult to larvae. Adults are extremely fertile and it is possible to collect 30 billion eggs in a single season (Auffray et al. 2003). Collecting

PAGE 28

19 these eggs in many species involves only injections of KCl, exposure to gametes from the same or closely related species, or simply perturbation by manually shaking. The eggs are fully mature when released and can be fe rtilized instantly. Once fertilized, most will develop into mobile larv ae in 72 hours (Auffray et al. 2003). This quick process and the large numbers of eggs allow purification and identification of proteins and transcription factors that are expressed in sm all amounts in the embryo (Auffray et al. 2003; Rast 2003). In addition, the eggs can be manipulat ed to study the interactions of genes in greater detail by altering gene expression and createing transgenetic larvae using current molecular biology tools (Auffray et al. 2003; Rast 2003). The family Strongylocentrotridae contains species whose development has been studied for over a century (Biermann et al. 2003). As a model species, S. purpuratus has enabled us to study the interaction between gene s during development (Davidson et al. 2003). Relationships between species have been determined by mitochondrial and genomic DNA studies (Biermann et al. 2003; Lee 2003; see Figure 2) giving a framework to compare the properities of closel y related species Finally, the S. purpuratus genome has been sequenced and is current ly being annotated (Cameron et al 2000). Therefore any insights to the molecular evolution of genes with in this species will help with the analysis of this enormous data set. Sea urchins in this study The sea urchins in this study belong to two families, Toxopneustidae ( Lytechinus pictus and L. variegatus ) and Strongylocentrotridae ( Pseudocentrotus depressus, Hemicentrotus pulcherrimus, Allocentrotus fragilis Strongylocentrotus franciscanus,

PAGE 29

20 S. nudus S. purpuratus S. droebachiensis and S. pallidus ). The two families are though to have diverged 30-40 million years ago based on fossil and molecular data (Smith 1988). A consensus tree of all ten species rela tionships was made from all available data (Figure 2). Lytechinus species Lytechinus pictus is found in the Pacific Ocean from Santa Barbara, CA to Cedros island, Mexico while L. variegatus is found in the Atlantic Ocean from Gulf of Mexico to Cape Verde island (Emlet 1995). The rise of the Isthmus of Panama 3.1 million years ago likely divided the genus Lytechinus causing the two species to diverge (Zigler and Lessios 2004). A molecular clock of the COI gene calibrated to ot her tropical echinoids, however, suggests the split betw een Atlantic and Pacific Lytechinus may have predated the rise of the Isthmus of Panama (Zigler and Lessios 2004). Strongylocentrotridae species The family Strongylocentrotridae contai ns two distinct clades based on mtDNA data (Biermann et al. 2003; Lee 2003). One clade consists of P. depressus, S. franciscanus, and S. nudus and the other includes A. fragilis, S. purpuratus S. intermedius S. droebachiensis S. pallidus S. polyacanthu s, and H. pulcherrimus (Biermann et al. 2003; Lee 2003). These findings are al so supported by nuclear genes and indicate that the genus Strongylocentrotus is not monophyletic (Biermann 1998; Biermann et al. 2003). The molecular divergence be tween the two clades within the family Strongylocentrotridae is great enough that Lee (2003) suggested the assignment of

PAGE 30

21 a new genus-level classification to the clade of S. nudus and S. franciscanus. For clarification, I shall refe r to the clade containing P. depressus S. franciscanus, and S. nudus as the S. franciscanus clade and the clade including S. purpuratus, S. intermedius, S. droebachiensis, S. pallidus, A. fragilis, and H. pulcherrimus as the S. purpuratus clade The rapid cladogenesis of this family may have taken place in the North Pacific during the late Miocene a nd Pliocene (Smith 1988). Divergence time between the S. franciscanus clade and the S. purpuratus clade is estimated to be 13 million years ago (Lee 2003). This is a refinement of the prev ious estimation of 3.5-20 million years ago by Smith (1988) for the family Strongylocentrotridae. S. franciscanus clade Strongylocentrotus franciscanus is found in the northeaste rn Pacific Ocean from Kodiak/Sitka, AK to Cedros Is., Mexico (Eml et 1995). The other two species in this group are located in the north western Pacific Ocean with S. nudus endemic to the Sea of Japan and P. depressus found from Nagasaki to Tokyo Bay, Japan (Emlet 1995; Bazhin 1998). P. depressus was first placed as the basal taxa to the Strongylocentrotus genus by allozyme data, however the study may not have had the resolution to classify the species into the two main clades within Strongylocentrotus (Matsuoka 1987). Biermann et al. (2003) used mtDNA data and found P. depressus to be the most basal member of the Strongylocentrotridae family yet clearly within the S. franciscanus clade. S. nudus and S. franciscanus are therefore sister taxa in this clade (Biermann et al. 2003).

PAGE 31

22 Because Lee (2003) did not include a P. depressus sample in his study, it is unclear when this species may have diverged from S. nudus and S. franciscanus but it must have been after the divergence of th e two clades (13-19 million years ago) and before the divergence of S. franciscanus and S. nudus (5.7-8.1 million years ago) (Lee 2003). Genetic diversity studies indicate there is no population distinction in S. franciscanus (Palumbi and Wilson 1990; Debenham et al. 2000) and in S. nudus very little genotypic diversity (Manchenko and Yakovlev 2001). S. purpuratus clade Members of the S. purpuratus clade are divided into three groups based on species range. S. intermedius and H. pulcherrimus are found only in the northwest Pacific Ocean (Emlet 1995; Bazhin 1998). S. purpuratus and A. fragilis are found in the northeast Pacific Ocean while S. polyacanthus is common to the Aleutian weslands (Emlet 1995; Bazhin 1998). The last two species in this clade, S. droebachiensis and S. pallidus are circumarctic (Emlet 1995; Bazhin 1998; Biermann et al. 2003). It appears that the rapid diversification of the crown group, and the fluctuations in sea level, may have led to the partition and colonization of differen t habitats. (Biermann et al. 2003; Lee 2003) In the northwest Pacific Ocean, H. pulcherrimus may have undergone allopatric speciation due to the sea level change (Lee 2003). In the northeastern Pacific Ocean, S. purpuratus and S. droebachiensis are found in the intertidal and shallow subtidal areas but S. droebachiensis is more common at higher latitudes and extends a little de eper to about 300m (Emlet 1995) S. pallidus is found at depths to 1000 m, and A. fragilis is a strictly deep wate r species seldom found above

PAGE 32

23 200m (Emlet 1995). S. pallidus and S. polyacanthus are most abundant at high latitudes (Bazhin 1998). H. pulcherrimus is thought to be basal in this clade by both mtDNA phylogenies published (Lee 2003; Biermann et al. 2003). S. intermedius is either sister taxa to H. pulcherrimus (Biermann et al. 2003) or is part of a polytomy with S. purpuratus and the S. droebachiensis / S. pallidus clade (Lee 2003). S. pallidus and S. droebachiensis are thought to be the most recently diverged species pair and form a monophyletic group (Lee 2003), however it is unclear if A. fragilis is the sister taxa to this pair or sister taxa to S. pallidus (Biermann et al. 2003). Estimates for the time of divergence of H. pulcherrimus range from 7.2 14 million years ago (Lee 2003). The divergence times for the rest of this clade is still questionable due to the apparent ra pid speciation of the crown group. S. intermedius is thought to have diverged between 4.66.6 million years ago, although an older estimate of 8.5-12 million years ago is possible (Lee 2003). The divergence of the rest of this clade based on molecular clocks is thought to occur between 25 million years ago (Lee 2003; Manchenko and Yakovlev 2001; Palumbi and Kessing 1991). The fossil record of S. droebachiensis and S. pallidus however, indicated that th ese species moved into the Atlantic Ocean soon after the Bering Seaw ay first opened about 3.5 million years ago, which means this species likely diverged prior to that (Durha n and MacNeil 1967). Population diversity and gene flow of S. purpuratus, S. droebachiensis, and S. pallidus. Although there may be slight differentiation in S. purpuratus populations south of Point Conception, CA, no clear evidence of population subdivision has been found

PAGE 33

24 (Burton 1998). Many closely related genotype s, each represented by a small number of individuals, were found in mtDNA samples of S. purpuratus (Palumbi and Wilson 1990). This means there is a high percent of variation in S. purpuratus as a species, but this variation lacks organization. There appears to be three gene tically distinct populations of S. droebachiensis; one in the Pacific Ocean, one on the north western Atlantic Ocean, and one on the northeastern Atlantic Ocean (Addison and Hart 2005; A ddison and Hart 2004; Biermann et al. 2003; Palumbi and Wilson 1990). Sporadic migrations ev ents have prevented any population from becoming genetically isolat ed (Addison and Hart 2005). Gene flow between the Pacific Ocean population and the northwest Atlantic Ocean was thought to be more common than between those populations and the northeast Atlantic Ocean (Addison and Hart 2004; Biermann et al. 2003; Palumbi and Wilson 1990). The study done by Addison and Hart (2005), however, indi cated there are two patt erns of gene flow between these three populations. One patt ern includes all three populations while the other includes gene flow only between th e populations bordering North America. The sampling in previous studies may have only included individuals involved in the second gene flow pattern that would be consistent with the results. Populations of the sea urchin S. pallidus on opposite coasts of North America and from Norway are remarkably similar genetically (Biermann et al. 2003; Palumbi and Kessing 1991). In contrast, there were high amounts of variation in populations found in S. pallidus in Japan (Manchenko and Yakovlev 2001).

PAGE 34

25 Chapter One: Codon usage analysis suggests concerted evolution, substitutions, and selection influence the evolutionary hist ory of the SM50 repeat array in various Strongylocentrotridae and Lytechinus sea urchins. Introduction Analysis of codon usage frequencies is a powerful tool to examine the effects of concerted evolution and selec tion in a gene with a non-pe rfect repeat array. Both selection and neutral evol ution are capable of producing a bias in codon usage frequencies, but detailed analysis of the patterns of bias be tween and within species allow the distinction of the two (D uret 2002; Fay and Wu 2001). It has previously been suggested that the SM50 gene in sea urchins is subject to concerted evolution, and therefore we expect a bias in codon usage frequencies due to that model of neutral evolution (Meeds et al. 2001). If there is no selectiv e difference between the codons within this gene, all should have the same probability of fixation within the repetitive region (Duret 2002). Therefore cl osely related species will have a similar set of codons that are most frequent while more distant speci es may have different codons that are most frequent. Any alterations to this pattern may suggest selection on codon usage in SM50 (Fay and Wu 2003; Duret 2002). Selection due to tRNA frequencies, GC content, or mRNA secondary structure are evaluated as possible causes for a bias in codon usage frequencies.

PAGE 35

26 Materials and Methods Species utilized Ten species of sea urchins were used in this study; eight species of Strongylocentrotridae sea urchins [ S. purpuratus S. pallidus S. droebachiensis, A. fragilis, S. franciscanus (all from California), S. nudus H. pulcherrimus (Japan), and P. depressus (Korea)] and two species of Lytechinus [ L. pictus (California) and L. variegatus (Florida)]. Mitochondrial sequence separates the Strongylocentrotridae sea urchins into two clades. The S. purpuratus clade includes S. pallidus S. droebachiensis, and A. fragilis as closely related species followed by S. purpuratus and H. pulcherrimus (Biermann et al. 2003; Lee 2003; Figure 2). The S. franciscanus clade includes S. franciscanus and S. nudus as sister species followed by P. depressus (Biermann et al. 2003; Lee 2003; Figure 2). Divergence times for species within the family Strongylocentrotridae are likely to lie within the range of 3.5-20 million years ago while the genus Lytechinus and Strongylocentrotridae diverged some 30-40 million years ago (Smith 1988). Genomic DNA isolation DNA was isolated from three species of Strongylocentrotridae sea urchins; S. purpuratus S. pallidus S. droebachiensis, (All from California), a nd from two species of Lytechinus ; L. pictus (California) and L. variegatus (Florida). S. purpuratus samples were collected as fresh or frozen gonad or sperm and 2030g of each sample was homogenized with 167L of Qiagen (Qiagen, CA) buffer C1

PAGE 36

27 (1.28M sucrose, 40mM Tris-Cl pH 7.5, 20mM MgCl 2 4% Triton X-100), 167L of Qiagen buffer PBS, and 500L of nano-pure H 2 O then incubated on ice for 10 minutes. The solution was centrifuged at 12,000xg for 15 minutes and the supernatant removed. The remaining pellet was homogenized with 134L Qiagen buffer C1 and 400L of nano-pure H 2 O, and incubated at 0C for 10 minutes The solution was centrifuged again at 12,000xg for 15 minutes and the supernat ant removed. The pellet was then homogenized in 667L Qiagen buffer G2 ( 800mM guanidine HCl, 30mM Tris-Cl pH 8.0, 30mM EDTA pH 8.0, 5% Tween-20, and 0. 5% Triton X-100) and incubated at 0C for 20 minutes. Fifteen microliters of Pr oteinase K (20mg/ml) was added and the solution was incubated at 50 C for 2-24 hours to ensure digestion of proteins. The solution was then incubated at 95 C for 2 mi nutes. If visible fragments remained the solution was filtered through sterile cheeseclo th. A Qiagen-tip 100 (Qiagen, MA, USA) was equilibrated according to the manufact ure protocol for plasmid purification, and extraction of DNA from the prepared solution followed. DNA was stor ed in TE buffer at -20C. Spines and connective tissue, gonads, or sperm were obtained from S. droebachiensis, S. pallidus, and S. purpuratus and frozen or stored in ethanol. If stored in ethanol the sample was dried prior to ex traction. Twenty micrograms of each sample was processed using the Wizard SV 96 Genomic DNA Purification System (Qiagen, MA, USA) according to the ma nufacture protocol for extracting DNA from a mouse tail tips. All samples of purified DNA we re quantified on a 1.4% agarose gel by electrophoresis.

PAGE 37

28 DNA from two S. nudus individuals was shipped from Japan in ETOH. Samples were dried and resuspended in TE Samples of A. fragilis, and P. depressus were sent as purified DNA stored in TE. Polymerase Chain Reactio n (PCR) and Cloning SM50 sequences were amplified from genomic DNA with a proof-reading Taq enzyme and materials from th e MasterTaq Kit by Eppendorf ( Brinkmann Instruments, Hambirg, Germany) using the companys protocols and as described by Meeds et al. (2001) with the following modifications. Three-hundred to 500ng of DNA (rather than the 50-150ng specified in the MasterTa q Kit protocol) was required for PCR amplification, and the touch-down cycles were limited to 65C to 57C to reduce PCR artifacts. PCR products were purified us ing Montage PCR Centrifugal Filter Devices (Millipore, MA) and cloned using the TOPO TA Cloning Kit for Sequencing (Invitrogen, CA, USA) with One Shot Competent E. coli (both TOP10 Chemically Competent and TOP10 Electrocomp cells were used) from Invitrogen (CA, USA). Plasmids were extracted and purified using Perfectpr ep Plasmid Mini kit by Eppendorf ( Brinkmann Instruments, Hambirg, Germany) Sequencing Sequencing of samples was performed on an ABI Prism 377 with dRhodamine Terminator Cycle Sequencing Kit (Perkin Elmer Applied Biosystems) or a Beckman Coulter CEQ 8000 genetic Analysis System with CEQ DTCS-Quick Start Kit (Beckman Coulter) at the University of South Flor ida. Additional seque ncing was provided by

PAGE 38

29 Macrogen Inc. (Kasang-Dong, Korea), and by SeqWright DNA Technology Services (Houston, TX). Sequences were assembled, cleaned, and polished using Seqman wewe (version 5.03, DNASTAR, Inc.) then aligned by Clustal (version 1.81) and completed by hand. Additional SM50 sequences were collected from GenBank (m16231, S48755, X59616) and Meeds et al. (2001). Total Codon Usage Frequencies Codon frequency of the SM50 repeat arra y was compared to the codon frequency of the C-type lectin domain (CLD) and a samp le of protein-coding genes from the whole genome (WG) in ten species of sea urchins (T able 1). The CLD includes 400 bp at the 5 end followed by the SM50 repeat array wh ich includes the 15 bp -21 bp imperfect repeated units (Figure 1). Codon usage frequencies of the CLD and the SM50 repeat array were counted and calculated by ha nd, and codon usage frequencies for the WG were taken from Nakamura et al. 2000 (Table 1). Codon Usage Frequencies by Position in the SM50 Repeat The SM50 repeat array was organi zed into smaller SM50 repeats units similar to the ones found in Meeds (et al. 2001). Each SM50 repe at unit was arbitrarily started with codons for glutamine, proline, and glycine because this sequence of amino acids is highly conserved within the SM50 re peat array in all species. There are three types of SM50 repeats. The predominant one contains 7 amino acids with the sequence Q P G M/V/F/W G Q/R and is found in all sp ecies examined. The other SM50 repeats are truncated at the 3 end. A six amino aci d SM50 repeat containi ng the sequence Q P G

PAGE 39

30 F/W G N is present in all species within the S. purpuratus clade, while in S. nudus a 6 amino acid SM50 repeat of the sequence Q P G M G G is present only once and a five amino acid SM50 repeat of the sequence Q P G M G is present three times. Frequencies of codon usage by position are shown in Table 2. tRNA Frequencies The S. purpuratus genome project is near comple tion and therefore the frequency of tRNA genes located in the genome can be found. The number of tRNA genes found thus far in the genome was obtained from Statija and Wray (personal communication). They were organized by the codons that each tRNA would recognize. The frequencies of the tRNA genes were calculated as a fracti on of the total number of tRNA genes that code for the same amino acid, and recorded in Table 3. GC Content To determine if the codon usage freque ncies are due to a preference in GC content, the GC percentage of the SM50 re peat array was calculated using Gene Boy (Copyright 2003, Cold Spri ng Harbor Laboratory, http://www.dnai.org/c/index.html ) for all ten species (Table 4). Altered mRNA sequences To determine if codon usage frequences altered the stabili ty of the mRNA seconday structure in the SM50 repeat arra y, first DNA sequences representing alternate mRNAs were created on a computer. In Katz and Burge (2003), creating alternate mRNA sequences involved randomizing the c odons used but conserving the amino acid

PAGE 40

31 sequence of the genes analyzed. Because c oncerted evolution is homogenizing the codon usage in a given position with the SM50 re peat, altered mRNA sequences were created by changing all codons in a single position to an alternate codon that encodes an amino acid already present in that position. In all cases, only one position was changed at a time, and no new amino acids were introduced in any position. To ca lculate the stability of the secondary structures produced by the al tered mRNA, the sequences were placed in RNAfold (Hofacker 2003) and the free ener gy was calculated for each sequence. To compare the stability of the altered mRNA to the natural mRNA, the free energy of the altered mRNA sequence was subtracted from the free energy of the natural mRNA sequence. The resulting difference was divi ded by the free energy of the natural mRNA sequence and then multiplied by 100 to give a percent change. The results were graphed according to position and species (Figure 3A-G). Results Codon usage frequencies of th e WG are similar to the CLD The codon usage frequencies of the whol e genome (WG) in six species of sea urchins representing the three clades examined ( S. droebachiensis, S. purpuratus H. pulcherrimus, S. franciscanus, L. pictus and L. variegatus ), were compared to the C-type lectin domain of SM50 (CLD) (Figure 1; Table 1). In all six species studied, the most frequent codon in the WG is also the most frequent in CLD with a few exceptions (Tables 1A-C). In S. franciscanus, the only exception was found in arginine where the AGG codon is the most frequent in the WG but absent in the CLD (Table 1B). In both Lytechinus species the glutamine codon of CAA, the phenylalanine codon of UUC, and

PAGE 41

32 the arginine codon of AGA are the most frequent codons in the CLD, but not in theWG. The deviation from the WG in codon usage fr equencies in the CLD could be due to the small sample size of codons present in the CLD. Still, because the codon usage frequencies of the CLD are, in general, similar to the WG in the S. purpuratus and S. franciscanus clades, the CLD codon usage frequencie s can serve as a reflection of the WG codon usage frequencies for species in th ese clades without WG data available. Codon usage frequencies of the SM50 repeat array are different than the WG and the CLD Codon usage frequencies of all degenerate codons present in the SM50 repeat array were calculated and compared to the CLD (and the WG when possible) in all ten species of sea urchins separa ted into three clades; the S. purpuratus clade ( S. droebachiensis S. purpuratus H. pulcherrimus S pallidus and A. fragilis ; Table 1A) the S. franciscanus clade, (S. franciscanus, S. nudes and P. depressus ; Table 1B), and a clade containing the two Lytechinus species (L. variegatus and L. pictus ; Table1C). Examination revealed that synonomous codon usag e in the SM50 repeat is very different than in non-repetitive se quences (Tables 1A-C): Synonomous codon usage frequencies for all amino acids are more homogeneous in the SM50 repeat than in the CLD and the WG. In most cases within the SM50 repeat array, only one or two codons are found in hi gh frequencies for any amino acid while the others are low or absent (Tab le 1). In contrast, nearly all synonomous codons are found in relatively equal frequencies in the WG and the CLD (Table 1).

PAGE 42

33 The predominant synonomous codon in the SM50 repeat array is also not the most prevalent in the CLD or the WG in all species examined. In all five members of the S. purpuratus clade, the CAA codon for glutamine is the most frequent in the SM50 repeat array while the alternate CAG codon is most frequent in the WG and the CLD (Table 1A). The CCA codon for proline is th e most frequent in the SM50 repeat array while all synonomous praline codons are in equal frequencies in the WG and the CLD. The UUU codon for phenylalanine and CGA for ar ginine are used exclusively in the SM50 repeat array. In the WG and the CL D, however, the UUC codon for phenylalanine is most frequent while all synonomous argine codons are found in fr equencies less than 37.5%. Therefore, the codon usage frequencies of the SM50 repeat array are drastically different than in the WG or CLD in the S. purpuratus clade. Codon usage frequencies in the S. franciscanus clade are similar to the S. purpuratus clade, although there is mo re variation between species in all data sets (Table 1B). The CAA codon is again most frequent in the SM50 repeat array in all three species while the CAG codon is the most frequent in the WG and the CLD in S. franciscanus and S. nudus Alternativly, P. depressus the CAA codon is most frequent in the CLD. The CCA codon is also the most frequent in the SM50 repeat array while none of the proline codons in the WG and the CLD are found in high frequencies. Although phenylalanine is only used once in S. franciscanus and twice in S. nudus, the UUU codon is used exclusively in the SM50 repeat array while the UUC codon is most frequent in the WG and CLD. Also, only the CGA codon of argini ne is found in the SM50 repeat array in all three species while all codons are present in the WG and the CLD. The three S.

PAGE 43

34 franciscanus species also contain drastically di fferent codon usage frequencies in the SM50 repeat array than in the WG or CLD. The Lytechinus species contain different c odon usage frequencies than the Strongylocentrotidae species, yet the codon usage in the SM50 repeat array is still different than in the WG or CLD (Table 1C). In the WG, the CAG codon for glutamine is the most frequent, but the CAA codon is the most fr equent in the CLD. Once again the CAA codon for glutamine is the most frequent codon in the SM50 repeat array. Unlike the Strongylocentrotidae species, the CCU codon for proline is most frequent although still no proline codon is found in high frequencies in the WG and CLD. For phenylalanine, UUC is the most frequent in both Lytechinus species in the SM50 repeat array as well as in the CLD and the WG. Just like in th e other eight Strongylocentrotidae species examined, however, only CGA is present in the SM50 repeat array while all arginine codons are present in the WG and the CLD. The homogenization of the SM50 repeat array is not consistent in all amino acids. In some amino acids one codon is most freque nt, while in others two codons are found in equally high frequencies. For example, two codons for phenylalanine are found in high frequencies in L. pictu s (Table 1C). Also, contrary to the other codon usage frequencies within the SM50 repeat array, all four glycine codons in al l ten species appear to be present in equal frequencies. Both of these exceptions suggest analysis of homogenization in the SM50 repe at must take the mechanism of concerted evolution in consideration and is addressed by analyzi ng codon usage by position in the SM50 repeat.

PAGE 44

35 Comparison of codon usage by position fu rther illustrates evidence of concerted evolution The SM50 repeat array was organized into smaller SM50 repeats units similar to the ones found in Meeds ( et al. 2001). Each SM50 repeat unit was arbitrarily started with codons for glutamine, proline, and glyc ine because this sequence of amino acids is highly conserved within the SM50 repeat array in all species. Comparison of glycine codons by position in the SM50 repeat unmasks evidence of concerted evolution. Glycine is the only am ino acid that is presen t in three positions (positions 3, 5, and 6) within each SM50 repeat (Table 2).When examined separately, it is clear that the codon usage has been homoge nized independently in each position since they utilize different codons. In position 3, the GGC codon is used predominantly in the S. franciscanus clade and in the two Lytechinus species. GGU is the only other codon used in these two groups, and it is present in very few SM50 repeats (10.5%-16.7% of the codons in the S. franciscanus clade, and 6.7%7.1% in the two Lytechinus species) (Table 1). In the S. purpuratus clade there is more diversity in codon usage, and more differences between members of the clade. S. purpuratus S. droebachiensis, S. pallidus and A. fragilis all makes use of all codons except GGG, however, H. pulcherrimus uses primarily GGC with GGU making up the remainder In position 5, the GGU codon is most frequent in the S. purpuratus clade and the S. franciscanus clade, while the GGG codon is absent (except for S. franciscanus where it is used once). In contrast, the two Lytechinus species use three codons (GGC, GGG, and GGU),and they also have codon usage frequencies different from each other. L. pictus uses the GGG codon most often and L. variegatus uses GGC. In position 6 in all species, the GGA and the GGG codons

PAGE 45

36 are found almost exclusively. The GGC codon is found only in the S. purpuratus clade in very low numbers (0%-3.3%), and the GGU codon is found only in H. pulcherrimus (4.2%) and P. depressus (6.7%). In summary, glycine codons are, indeed, homogenized just as other codons are in the SM50 repeat although they are homogenized by position instead of by amino acid. Two positions (4 and 7) can encode for more than one amino acid. Again, homogenization seems to be dependant on am ino acid rather than on position. Position 4 can encode one of several non-polar amino acids; methionine, valine, phenylalanine or tryptophane. When phenylalanin e is present in all Strongyl ocentrotidae species, the UUU codon is used exclusively over all other amino acids. In L variegatus UUC is used primarily (69.2%) over UUU and the AUG codon for methionine. Position 7 can encode either glutamine or arginine. Only a single codon (CGA) is u tilized for arginine in all species examined. This differs by one base from the predominant CAA codon in position 1 in all species for glutamine (Table 2). Analysis of the most frequent codon (by amino acid and by position) illustrates clade-specific patterns. The most frequent codon in each amino acid in the SM50 repeat array is the same within a clade, yet different between clades. The two clades of the Strongylocentrotidae family, in general, share similar c odon usage frequencies while the two Lytechinus species differ. For example, the most freque nt proline codon in the Strongylocentrotidae family is CCA (above 77.4% in all species), but CCU is predominantly used in the two Lytechinus species (93.3%, 92.3%; Table 1). Codon usage frequencies by position

PAGE 46

37 illustrates differences between al l three clades (Table 2). In Position 3, the GGA codon is found in the S. purpuratus clade yet is absent in the othe r two clades. Codons for valine and tryptophane are al so only found in the S. purpuratus clade. Homogenization of codons in the SM50 repeat arra y correlate with clades, indi cating the divergence times of species may have some influence on codon usage frequencies. Levels of homogenization in the SM50 repeat array are not equal in all amino acids, nor in all positions. In some cases, two codons are found in high, almost equal frequencies (Table 2). In position4, roughly equal amounts of AUG fo r methionine and GUG for Valine exist in the S. purpuratus clade while in L. pictus both phenylalanine codons (UUU and UUC) are used in similar amount s. In position 6 in the S. purpuratus clade as well, the GGA and GGG codons are found in almost equal freque ncies. The use of two most frequent codons, however, is not the only example of unequal homogenization. There are a number of amino acids where a single codon is used almost exclusively in all ten species and the othe r codons are never found in the SM50 repeat array although they are present in th e WG and the CLD. Valine, in the S. purpuratus clade, is almost always coded for by th e GUG codon (69.2%-100%; Table1). The most extreme example, however, is found in argi nine where only one out of the six possible codons (CGA) is found in all ten species examined. In some cases the exclusion of certa in codons is position-specific. The GGG codon in position 3, and the GGC and GGU codons in position 6 are three such examples. The CAA codon for glutamine seems to be f ound in unusually high frequencies (greater

PAGE 47

38 than 79.2%) in all ten species examined (Table 1). This seems to be an artifact of a position-specific effect. In position 1, the CAA codon is found almost exclusively (86.7%-100%) in all species while in position 7 the frequency is more moderate (Table 2). The exclusion of certain codon by position is suspicious. Our hypothesis is that there may be an additional selective force th at is influencing the codon usage frequencies of this region. To investigate this, we exam ined tRNA frequencies, GC content, and the potential to form RNA secondary structures as possible explanations for what may be non-neutral evolution of the SM50 repeats. tRNA frequencies There is about an equal frequency of tRNA genes for each codon of glutamine, although the tRNA that recognizes CAG is more frequent. In proline, the tRNA that recognizes CCA and CCU are most frequent. In glycine, the tRNAs for GGC and GGA are most common. There are nearly twice as many tRNAs that recognize the UUC codon of phenylalanine than recognize UUU. The tR NA that recognizes the GUG of Valine is most common followed by GUA. Of the six possible codons for argi nine, there are many tRNAs that recognize AGA, CGA, and CGU in similar frequencies, while the AGG is less frequent and so far only one tRNA gene has been discovered that recognizes CGC and CGG.

PAGE 48

39 GC content The GC content was calculated for the SM50 repeat array in all ten species (Table 4). In all species, the GC conten t was 59.34%65.31% being lowest in the S. purpuratus clade and highest in the two Lytechinus species. mRNA secondary structure The percent change of the free energy of the altered SM50 mRNA was calculated using a computer simulated test. Positive changes reflect an increase in stability of the secondary structure of the altered mRNA, while negative changes mean the altered mRNA has a less stable secondary structure than the natural mRNA. If there is selection against a more stable mRNA structure, those codons that are seen in low frequencies should produce a dramatic increase in th e percent change (F igure 3A-G). Figure 3A reflects the change in free energy when only the codons in position 1 are altered. According to Table 2, glutamin e is the only amino acid present in this position and therefore only the CAA and C AG codons were tested. In all species, increasing the frequency of the CAA codon produces little change in the stability of the secondary structure, but increasing the fr equency of the CAG codon produces a >10% increase in the stability of the s econdary structure in all species. In Figure 3B, the results of altering th e codon usage frequency in position 2 is illustrated. Proline is the only amino acid pres ent in this position, and therefore altered mRNA with all four possible proline codons were tested. Although the CCG and CCU codons are found in the natural mRNA in low fre quencies, increasing their frequencies to 100% causes only a slight decreas e in the stability of the s econdary structure. The CCC

PAGE 49

40 codon produces an increase in stability in a ll species examined and is absent in the S. franciscanus clade and the two Lytechinus species examined, although in the S. purpuratus clade it is found in mode rate frequencies (~20%). Glycine is the only amino acid present in position 3 and all four codons possible for that amino acid were tested in Figure 3C. The majority of the codons produce either a less stable secondary structure or a very slight increase (found in the GGC codon). However, in the two Lytechinus species the GGG codon (which is a rare codon in these species) does produce an increase in the stability of the secondary structure. Position 4 (Figure 3D) is the position with the most diversity in amino acids and codons possible in the natural mRNA (Table 2) However, not all of the amino acids are present in all three clades examined. The S. purpuratus clade contains phenylalanine (UUC and UUU codons), methionine (AUG), valine ( GUA, GUC, GUG, GUU), and tryptophane (UUG). In the natural mRNA onl y the GUG codon is used for valine. Increasing the frequencies of the phenylalanin e and methionine codons does not produce dramatic results, but increasing the freque ncies of the GUC valine codon did (>10% increase in stability). The species in the S. franciscanus clade and the two Lytechinus species only contain phe nylalanine and methionine in position 4. In the S. franciscanus clade, increasing the frequency of either phenylalanine codon to 100% increases the stability of the secondary structure while the same test does not have a dramatic effect on the two Lytechinus species. Figure 3E illustrates the change in stabil ity of the secondary structure when the glycine in the fifth position is altered. None of the four possible codons produce an

PAGE 50

41 increase in stability of the secondary structure when their frequency was increased to 100%. In Figure 3F, however, the sixth positi on two of the four possible codons for glycine does produce an increase in the stabili ty of the secondary st ructure >15% in all species examined. These two codons, GGC an d GGU are present in very low frequencies in all of the natural mRNAs. In position 7 (Figure 3G), two possibl e amino acids, glutamine (CAA and CAG) and arginine (AGA, AGG, CGA, CGC, CGG, CGU) and all eight possible codons were tested. In all species increasing the frequency of either glutamine codon does not increase the stability of the secondary st ructure. Increasing the freque ncy of three of the arginine codons (CGC, CGG, CGU), however, does produ ce an increase in the stability of the secondary structure. None of these thr ee codons are present in any of the species examined. The other three codons, (AGA, AGG and CGA) do not produce an increase in the stability of the secondary structure. The CGA codon is present in all species examined, but the AGA and the AGG codons are not. Discussion There is evidence of concerted evolution in fluencing the synonymous codon usage in the SM50 repeat array. There are many cases of codon usage be ing influenced by selection. Codon usage of many protein-coding genes is influe nced by the codon usage bias of the WG (Perriere and Thioulouse 2002). If the most frequent synonymous codon in the SM50 repeat array is also the mo st frequent synonymous codon in the WG, selection due to WG

PAGE 51

42 may be the cause. In addition, there are many selective pressures that alter the codon usage bias for the entire length of a proteincoding gene that do not correlate with WG codon usage frequencies (Archetti 2004; Sharp et al. 2005 ). In all these cases, the codon usage of the entire protein-coding region of the gene is under selec tion. Therefore, if SM50 is under a similar selective pressure, th e most frequent codon in the non-repetitive region (CLD) should be the same as in the SM50 repeat array. In all sequences studied, the codon usage in the SM50 repeat is more homogenous than that of the CLD or the WG, confirming that the region is influenced by concerted evolution (Table 1). In most cases, the mo st frequent synonymous codon in the SM50 repeat array is not the most frequent in the WG or the CLD (Table 1). The lack of correlation indicates that any selection for specific codons due to any WG frequency or on the SM50 gene itself is not the cause of the codon usage bias in the SM50 repeat array. This confirms that the codon usage frequencies in the SM50 repeat array are governed by concerted evolution. Concerte d evolution alone, how ever, cannot account for the codon usage frequencies because the degree of homogenizati on is not equal in every amino acid. The mechanisum of concerted evoluti on influences homogenization of codons The mechanism of concerted evolut ion in SM50 involves unequal crossover followed by gene conversion (Meeds et. al 2001). In order to keep th e repeat uni ts intact, misalignment must involve a SM50 repeat unit, or a multiple thereof (see Chapter 2). Thus, only those codons that are in the same location within the repetitive element will be homogenized. Considering this mechanism, the SM50 repeat array was divided by the

PAGE 52

43 smallest repetitive unit of seven amino acids (Figure 1). Separation by position clarifies the inconsistencies of the previous analysis. For example, the apparent lack of concerted evolution in the glycine codons can be explai ned by the multiple locations of the glycine codons (Table 2). Glycine occupies positions 3, 5, and 6, and a different glycine codon is most frequent in each position in all speci es. This confirms that in each position frequencies of glycine codons ar e evolving independent of each other, and that they have indeed undergone concerted evolution. This is similar to the multiple codons found in spider silk (Hayshi and Lewis 200). In a number of species, there are positi ons in the SM50 repeat where two codons are used in high frequency. Thes e are usually distributed in a regular pattern that can be attributed to expansion of multiples of SM 50 repeats due to concerted evolution (Meeds et al. 2001; Table 2). Positions 2, 3 and 4 in the S. purpuratus clade are examples of this (Table 2). The AUG that encodes methionine in position 4 differs by only one base from the GUG codon that encodes valine, and ther e are roughly equal amounts of both codons. This likely reflects a substitution that was th en propagated by the models of concerted evolution (larger blocks of duplicatio ns) discussed in the Chapter 2. In L. pictu s, the two codons for phenylalanine are present in roughly equal amounts in the SM50 repeats (Table 1C). The presence of two codons in equal amounts reflects a model of concerted evolution involving duplication by a pair of SM50 repeats wi thin this species (see Chapter 2). Therefore it is possible that a repeat unit involving multiple SM50 repeats has been involved in concerted evolution. Substitutions following speciation and additional concerted evolutionary events can also alter the degree of homogenization. In the SM50 re peat array glutamine codons

PAGE 53

44 in position 1 seems to be an example. Wh ile predominantly encoded by CAA, there are one or two CAG codons in this position (Tab le 2). The SM50 repeat s that contain CAG codons do not correspond between different species (Meeds et al. 2001). In addition, the S. purpuratus clade has the highest variation in codon usage of the groups examined which may be due to a high rate of substitution followed by local expansions and contractions of parts of the SM50 repeat a rray (Biermann et al. 2003) When examined at the amino acid level, the two Lytechinus species appear to be almost identical in the sequence of the SM50 repeat array, however codon usage gives evidence of substitution following speciation (Meeds et al 2001). When the codon usage is examined at position 4, however, the increased prevalence of the UUC codon in L. variegatus clearly shows that this species has undergone concerted evolution followi ng speciation. In position 5 L. pictus uses the GGG codon most often and L. variegatus uses GGC. The divergence here would indicate that substitution and concerted evolution occurred following speciation in Lytechinus Codon usage leaves the footprin ts of concerted evolution Examination of the pattern of codon usag e across species and clades allows the inference of what substitutions occurred fo llowing species divergence. For example, in position 2, the S. purpuratus clade all utilizes the CCA codon for proline, but also utilize the CCC codon. The CCC codon is present in the central region of th e SM50 repeat array in all of these species (Appendix 1), and has been expanded through concerted evolution, although the model and pattern of concerted e volution differs between species (Table 1; Meeds et al. 2001; Chapter 2). A substitution in the ancestor to the S. purpuratus clade

PAGE 54

45 must have converted a CCA to a CCC in one SM50 repeat, and this served as a template during concerted evolution. The substituti on occurred following divergence from the S. franciscanus clade, since they do not utilize CCC (Table 1). The Lytechinus species instead utilizes CCU in this position. It is unclear what codon was utilized in the last common ancestor to Strongylocentrotidae and Lytechinus, but in the last common ancestor to the two Lytechinus species, a SM50 repeat with CCU in position 2 was used as template for concerted evolution. The an cestor to all species examined must have utilized GGC to encode glyc ine in position 3, but in the S. purpuratus clade two separate substitutions occurred that we re amplified through concerte d evolution (Table 2; Meeds et al. 2001; Chapter 2). Codon usage patterns at each position of the SM50 repeat supports existing phylogenetic relations hips, including recent reports that S. franciscanus, S. nudus and P. depressus constitute a separate clade (Biermann et al. 2003; Lee 2003; Figure 2). At some positions in the SM50 repeat, the pattern of codon usage does not appear to be due to neutral evolution. By looking at 10 species in two families, we can see that there have been substitutions within the SM50 repeats that have altered codon usage. These have often been expanded in number through concerted evolution. Variation in codon usage between species indicates that the substi tution process itself is neutral. The phylogenetic history of a species clearly has influenced the codon usage as well. Even though the different species have gone through different type s of concerted evolution, we can trace substitution events back to co mmon ancestors in many cases.

PAGE 55

46 Not all inconsistencies in the homogen ization of codon usage frequencies, however, can not be explained by a mechanis m of concerted evolution involving multiple SM50 repeats, nor by high frequencies of substitution. In some positions, one codon is found almost exclusively and all others are conspicuously absent. For example, CAA encodes glutamine in extremely high frequenc ies in position 1, yet both codons are found in moderate amounts in position 7. Also in position 7, only the CGA codon for arginine is found in all species. In positions 3, 5, and 6 some glycine codons are never utilized in most species. Some valine codons in position 4 are never or rarely observed as well. These patterns suggest that there may be some selective force acting on codon usage within the SM50 repeat. The th ree possibilities advanced fo r selection at the DNA level are tRNA frequencies (Sharp et al. 1995), GC content (Craig and Riekel 2002) and RNA secondary structure (Katz and Burge 2003). Each of these are discussed below. tRNA frequencies cannot account for the unusual codon usage frequencies. Genes that are highly translated, have unusual expression patterns, repetitive regions, and genes of certain lengths all use restricted sets of codons (Sharp et al. 1995). This is often attributed to tRNA frequencies pr esent in the cell. A classic example of this is in silk glands in B. mori that increase the production of tRNAs for the codons utilized in silk proteins when the silk worm start producing silk (Sharp et al. 1995; Lizardi et al. 1979). Efficient translation of these types of genes is thought to be facilitated by the presence of codons in the gene that ar e recognized by the most prevalent tRNA.

PAGE 56

47 Comparison of the codons used in the SM50 repeat array in S. purpuratus with the prevalence of tRNA genes in the S. purpuratus genome indicates that this is not the case with the SM50 gene. The tRNA genes en coding the anticodons th at recognize CAA and CAG are present in similar numbers (Tab le 3), while CAA is the most predominant codon. For proline, the tRNA that recognizes th e CCA codon, (the most frequent proline codon in the SM50 repeat array), is one of two most prevalent genes for tRNA-pro. However, the tRNA genes that recognize CCU are just as prevalent as CCA, but the CCU codon is not used in Strongylocentrotidae (Table 1). In no case is th ere a correlation with the codon usage in the SM50 repeat array a nd the prevalence of tRNA genes in the genome. Therefore this is not a selective force on the evolution of the SM50 genes. GC content cannot account for the unusual codon usage frequencies Selection against GC content has been proposed to account for the biased codon usage frequencies in genes that use amino acids whose codons are GC rich, as in silk proteins (Craig and Riekel 2002; Galtier et al. 2001). When a protein-coding region contains codons for amino aci ds that have a high content of G or C, the codons that contain A or T in the wobble position shoul d have the highest frequency. The SM50 repeat array is indeed rich in amino acids w ith a G or C in the fi rst or second positions, and therefore creating a GC content of about 60% in all ten species. The normal GC content varies from 40% to 80% (Galtier et al 2001). The codons that have the highest frequencies in the SM50 repeat array, howev er, do not always have an A or T in the wobble position (calculated from data in Table 1). Therefore we conc lude that selection on GC content cannot account for the codon usage frequencies either.

PAGE 57

48 There may be selection against a highly stable secondary structure in the mRNA Secondary structures that perform inhibitory functions have been found in the coding sequences of some genes (Katz and Burge 2003). For example, the secondary structure of the mRNA produced by ASH1 gene inhibits translati on (Chartrand et al. 1999). For this reason, it is commonly believe d that secondary structures should be avoided in protein-coding genes (Katz a nd Burge 2003). Still, mRNA secondary structures are sometimes unavoidable due to the amino acids which they must encode. The mRNA of spider silk, for example, contains a high secondary structure due to the use of the GC rich amino acids proline, glycin e, and alanine (Andersen 1970). Selection on codon usage is thought to have conserved the existing secondary structure (Mita et al. 1988). Sea urchin spicule matrix genes al so contain GC rich amino acids and the mRNAs are capable of forming stable secondary structures. If there is indeed a selective for maintaining the mRNA secondary structur e (in SM50), there shoul d be a correlation between rare codons and an increase in s econdary stcture (Car lini et al. 2001). Some positions show evidence of codon usage re stricted by selecti on whiles others do not Altering the codon usage frequencies at position 1 and at 6 produce a large increase in the stability of th e secondary structure in all te n species examined (Figure 3A and 3F). The rarity of the CGA codon in position 1 and of the GGC and GGU codons in position 6 may be due to sele ction at the mRNA level. In positions 7 and 4, some rare codons may have been selected against while others may be the result of the neutrality of concerted evolution. Only one arginine

PAGE 58

49 codon (CGA) is present in position 7 in any of the ten species examined (Table 2). Increasing the frequency of the CGC and CGU codons produce mRNAs with secondary structures of increased stability, and therefor e may be selected against (Figure 3G). Alternatively, increasing the frequencies of the AGA, AGG, and CGG codons did not consistently produce mRNAs with secondary st ructures more stable than the natural mRNA. It is possible that in position 7 the lack of the CGC and CGU codons are due to selective constraints, while the lack of the AGA, AGG, and CGG codons are due to models of neutral evolution in stead. Valine is only found in the S. purpuratus clade in position 4. The secondary stru cture analysis indicates th at the GUC codon, which is absent from all species, may be selected agai nst (Figure 3D). In c ontrast, increasing the frequency of either of the two rare co dons (GUU and GUA) produces a relatively small change, suggesting the negative selection woul d be weak and is lik ely not a factor. The predominance of the GUG codon, versus the other two codons, likely lies in its derivation from an AUG (methionine) codon, but there may be some selection against the GUC codon in the Strongylocentrotidae family. This suggests that although there may be some selective force preventing the usage of some co dons, others may be rare or absent due to neutral evolution instead. In other positions, not a single rare c odon seems to alter the secondary structure of the mRNA, and therefore can not be due to selection. Position 5 i llustrates this very clearly (Figure 3E). None of the codons possible in posi tion 5 produce a more stable secondary structure when their frequencies we re increased to 100%. Therefore, some of the codon usage is, indeed, truly neutral.

PAGE 59

50 An additional complication arises when increasing the frequency of the rare or absent codons produces mRNAs with more st able secondary struct ures in all species within a clade, but not in all clades. This suggests there may be clade-specific limitations. In position 4, all species have the amino acid phenylalanine, (Table 2) but it is only in the S. franciscanus clade where increasing the fre quencies of either codon for that amino acid (UUC or UUU) produces an mR NA with an increase secondary structure stability (Figure 3D). It is possible that the low frequencie s of the two codons in this clade are due to a selective constraint, but that constraint is not found in the S. purpuratus or the Lytechinus clades. Increasing the frequency of two codons in position 3 (GGA and GGG) and one in position 2 (CCG) also pr oduces mRNAs with increased secondary structure stability in the two Lytechinus species, but not the others (Figure 3). It is possible that the absence or low fr equency of these codons in the two Lytechinus species is a result of selection ag ainst stable RNA secondary structure, but the sequence differences in the S. purpuratus and S. franciscanus clades remove the ability of these codons to stabilize se condary structures. Conclusions Our data suggest that the main force driving codon usage in the SM50 repeat arrays examined is neutral evolution base d on substitutions and concerted evolution. The need to encode a functional protein with a defined structure limits the substitutions that are allowed. Substitutions that occur following concerted evolution cause sequence diversity in isolated repeats, while substitutions that are propagated by concerted evolution create a discernable pa ttern to the repeats. There is some evidence that usage of

PAGE 60

51 some codons in certain positions of the SM50 re peat array could confer the ability of the SM50 mRNA to form a stable secondary stru cture that could inhi bit translation. This would select against the use of these codons, and we propose that this has played a role in codon usage at some positions in the SM50 repeat.

PAGE 61

52 Chapter Two: Models of concerted evolution Introduction Concerted evolution has been well docume nted in microsatellites DNA (reviewed by Ugarkovic and Plohl 2002), tandemly repeat ed genes, and protein-coding multigene families (reviewed by Liao 1999). Concerte d evolution through unequal crossover of tandemly repeated segments of DNA also occu rs within the coding region of a singlecopy protein genes (Bierman 1998; Swanson and Vacquier 1998; Hayashi and Lewis 2000; Meeds et al. 2001). Unequal cro ssover homogenizes seque nces and changes the number of repeated segments in these tandem arrays (Dover 1982; Walsh 1987b; Dover 1993; Elder and Turner 1995; Oh ta 2000). Base pair substitu tions counteract unequal crossover by diversifying each repeated segment. Sufficient diversification will hinder misalignment, thereby preventing unequal crosso ver, and thus causing affected repeats to stabilize in number and evolve inde pendently (Smith 1976; Dover 1986; Murti et al. 1992; Thomas 1998). If this tandem array of DNA produces a functional protein, unequal crossover and base subs titution must operate within the constraints of purifying selection, or result in non-functional proteins. This can produce diseases such as fragile X syndrome, spinobulbar muscular atrophy, and Huntingtons diseas e (Lupski 1998; Baldi et al. 1999; Parniewski and Staczek 2002). Although unequal crossover, base substitutions, and selection may all be infl uencing a single tandem array, often one force is strong enough to overwhelm the affects of th e others. Because of this, most examples of repetitive genes that unde rgone concerted evolution cont ain little seque nce variation between repeats. It is difficult to determ ine when and how concerted evolution takes

PAGE 62

53 place in these sequences. Thus, the interplay of concerted evolution (through unequal crossover), base pair substitutions, and purif ying selection is not completely understood. This study examines concerted evolution in a gene that, while under purifying selection, still contains variation in repeat number and diversity in nucleotide sequence, thereby providing insights into the ba lance of models involved. SM50 is a spicule matrix gene involved in the embryonic skeleton development of sea urchins (Wilt 2002, Wilt et al. 2003). All studied spicule matrix genes contain a Ctype lectin domain and a series of tandem ly repeated sequences rich in proline and glycine suggesting that the repeated sequences are functionally involved in biomineralization (Illies et al. 2002). In SM50, each repeat ed sequence is 15 bp, 18 bp or 21 bp long (SM50 repeat) and imperfectly ta ndemly duplicated 14-30 times depending on the species (Meeds et al. 2001). Although the SM50 repeats are functional, limited synonymous and non-synonymous substitutions are found both within and between species (Meeds et al. 2001). This indicates that the overall physical structure of the encoded protein is more important than ex act amino acid sequence of each SM50 repeat. In fact, in hybrid embryos of S. purpuratus and L. pictus both copies of SM50 are expressed and functional indi cating that large amounts of variation can be tolerated (Brandhorst and Davenport 2001). Thus, alt hough selection is ac ting on the tandem array of SM50 repeats, it is relaxed enough to allow variation to persist. The repeat array in SM50 gives further in sight into how concerted evolution, base pair substitution, and purifyi ng selection interact during th e evolution of small repeats within the protein-coding regi on of a single copy gene. The pattern of base substitution should direct the expansion of the tandem array of SM50 repeats. In each species,

PAGE 63

54 different types of unequal crossover events involving single repeat units, double repeat units, and larger repeat units may have occurr ed (see Chapter 1). In this study the pattern of repeats is examined in an attempt to understand how concerted evolution has taken place in a variety of species. Also, variation within select species for recent events of base pair substitution and unequal crossover is ex amined and alleles in three species that differ in the number of repeats have been found. The sequence of the SM50 repeats in these alleles are similar e nough to propose models for how misalignment and unequal crossover took place to give rise to them. Materials and Methods Species utilized and genomic DNA isolation DNA sequences of S. purpuratus H. pulcherrimus S. franciscanus L. pictus and L. variegatus were used in the analysis (Appendi x 1; Chapter 1). Additional DNA was isolated from individuals of S. purpuratus S. pallidus S. droebachiensis, and S. nudus using the same methods of extraction pr evious mentioned. Individuals of S. purpuratus S. droebachiensis and S. pallidus were collected from various regions along the pacific coast of North America, additional samples of genomic DNA of S. pallidus were supplied from Norway, and samples of S. nudus were supplied from Japan (Appendix 4). Polymerase Chain Reaction (PCR) and Cloning SM50 sequences were amplified from genomic DNA as described above (see Chapter 1; Meeds et al. 2001) with the following modification. Primer

PAGE 64

55 5ACGGATCCTTYTCXCARGAYAACCARAT GGARATGGA 3 was replaced by 5MRGAYAACCARATGGAGAAYGAGGTT3 in the S. pallidus sample from Norway. PCR products were purified, cloned and extracted as before (see Chapter 1). Sequencing Sequencing was provided by Macrogen Inc. (Kasang-Dong, Korea), and by SeqWright DNA Technology Services (Houst on, TX). Sequences were assembled, cleaned, and polished as before (see Chapter 1). Analysis of DNA sequences using dot plot analysis Blocks of duplications were identified though the use of dot plot analysis which compares two sequences through a sliding wi ndow and places a dot on a graph where the two sequences meet the required amount of si milarity (identical or within a few base pairs). Parallel lines on the dot plots represent locations where duplications occur that would be difficult to find by eye (Thomas 1998). The SM50 repeat array of each sample was compared to itself in Dottup (EMBO SS) or Dotmatcher (EMBOSS) run from Institute Pasteur I Catherine Letondal ( http://bioweb.pasteur.fr/seqa nal/interfaces/dottup.html#input and http://bioweb.pasteur.fr/seqanal/interfaces/dotmatcher.html#input ). All sequences were placed in DOTTUP first to find exact sequence matches, and then placed Dotmatcher to allow for mismatches due to substitutions. Because SM50 in all species contains long 21 bp SM50 repeats and some contain additi onal truncated 18 bp SM50 repeats (Meeds et al. 2001), dot plot analysis of combinations of 21 bp and 18 bp, and allowing up to 3 bp

PAGE 65

56 mismatches were made. Dot plots that illustra te duplications with the greatest clarity are shown. Dot plot anaysis was used to compare and allele to itself, and to compare different alleles found in the same species. When comparing alleles of different lengths, the longer allele is always on the horizontal axis and the shorter is on the vertical axis. The center-line indicates where the alleles meet th e required sequence identity and is in bold. The break in the center-line indicates the region where extra SM50 repeats occur in the longer allele but not in the shorter. Grey dotted lines indicate where on the longer allele the extra SM50 repeats are located. Because th e longer allele is always on the horizontal, when the center-lines over-lap the additiona l SM50 repeats are tandemly repeated in the longer allele. Parall el lines that are discussed are in bold. Results Analysis of SM50 repeat duplications in representative species in three clades Dot plot analysis was used to examine the pattern of repeat duplications in representatives of the three clades used in this and prev ious studies (Meeds et al. 2001). The sequences used were published previously (Meeds et al. 2001) and are included in Appendix 1. Species in the S. purpuratus clade all have truncated 18 bp SM50 repeats interspersed among the 21 bp SM50 repeats sh ared with all other species examined (Figure 1). The pattern of s hort and long SM50 repeats differs in the two species, as well as the sequence diversity within the repeats. S. purpuratus has alternating 18 bp and 21 bp SM50 repeats at the 5 end of the repeat array, and a pattern of two 21 bp and one 18 bp SM50 repeats at the 3 end as

PAGE 66

57 illustrated by looking at the direct sequence (A ppendix 1). In the center of the array there is a combination of the two patterns. The dot plot analysis with a window of 39 bp illustrates the alternating 18 and 21 bp SM50 repeats (Figure 4). Previous work, however, showed the sequence dive rsity in the SM50 repeats in S. purpuratus is the greatest seen in any species examined (Meeds et al. 2001). Because of this, dot plot analysis does not illustrate the alternati ng two 21 bp and one 18 bp SM50 repeat pattern (data not shown). The dot plot analysis indicates that a large duplication occurred recently in the center of the S. purpuratus SM50 repeat (Figure 4). The 99 bp analysis (allowing 3 mismatches) suggests a combin ation of three 21 bp and two 18 bp SM50 repeats may have duplicated in a single even t followed by three subs titutions (Figures 4, 7). The dot plots using smaller window sizes pick up subsets of this region. H. pulcherrimus has a longer sequence of a lternating 18 bp and 21 bp SM50 repeats at the 5 end perfec tly duplicated in tandem (Meeds et al. 2001; Appendix 1). The 3 end of the SM50 repeat array consists of long 21 bp SM50 repeats that vary in sequence. The dot plot analysis using a 39 bp window shows the pattern of 18 bp and 21 bp SM50 repeats at the 5 end of the SM50 re peat array (Figure 4) has been duplicated several times. It is possible a higher or der duplication in a si ngle event could have caused this pattern as well, as illustrated by the duplicati on seen in the 78 bp window (Figures 4, 7). In S. franciscanus, there are no truncated SM50 repeats of 15 bp, or 18 bp. (Meeds et al. 2001; Appendix 1). The central portion of the SM50 repeat array consists of two consecutive 21 bp SM50 repeats of differing sequence duplicated in tandem several times. These regions are illustrated in the 42 bp window dot plot analysis (Figure 4).

PAGE 67

58 This pattern could have been generated by several duplications of 42 bp, or by a higher order duplication, as suggested by the dot plot analysis with a 84 bp window (Figures 4, 7). The two Lytechinus species also have primarily 21 bp SM50 repeats ( L. pictus has a single 18 bp SM50 repeat) (Meeds et al. 2001; Appendix 1). Bu t the 21 bp dot plot analysis are different in each species indicating the most recent changes have occurred in different places within the SM50 repeat array of each specie s (Figure 4). Duplicates of a pair of 21 bp SM50 repeats have also occurred in each species illustrated by the dot plot analysis of a 42 bp window (Figures 4 and 7). Analysis from additional species Although no variation in size of the SM50 repeat array has been detected in S. purpuratus (data not shown), variation was found in the length of the SM50 repeat array in S. droebachiensis, S. pallidus and S. nudus (Figures 5A-D). The new S. droebachiensis sequences were identical to previously pub lished sequences and each other, except for a few missing/additional SM50 repeats. The S. pallidus SM50 repeat array is more similar to S. droebachiensis than to any other species although the patterns of 21 bp and 18 bp SM50 repeats are different, suggesting concer ted evolution occurred after speciation. The S. pallidus alleles from Washington and Norway al so vary only in the number of SM50 repeats. The Washington and Norway allele s, however, are different enough from each other to indicate that concer ted evolution had occurred sepa rately in the two populations. The S. nudus alleles retain some of the alternating 21 bp SM50 repeats seen in S.

PAGE 68

59 franciscanus but there has been divergence, and the S. nudus alleles contain unusual 15 bp repeats in the center of the SM50 repeat array (Figure 5A). Patterns of duplication within species Two alleles of S. nudus were found that differed by a single 21 bp SM50 repeat (Figure 5A). When the two alleles were compared to each other using dot plot analysis, a region of extra sequence was identified in the long allele abou t 300 bp into the SM50 repeat at the 5 end (Figure 6Aa). The overlap seen in the center-line indicates that this SM50 repeat has the same sequence as an SM50 repeat next to it. Comparison of the longer allele (S.nudJP18) to itse lf indicates a region near th e extra SM50 repeat where the DNA sequence is identical (Figure 6Ab) and a possible location of misalignment during unequal crossover (Figure 8). Comparison of th e shorter allele (S.nudJP17) to itself also indicated an area of possible misalignment (F igures 6Ac, 9). Therefore the difference between these two alleles could be explained by either a deletion of a single SM50 repeat in the short allele, or by a duplication in the long allele. Also, two allelespf differe nt lengths were found in S. droebachiensis (Figure 5B). A dot plot comparison between the two alleles identified a region about 400 bp into the SM50 repeat array that was dissimilar between the two alleles (Figure 6Ba). The centerlines overlap indicating a re gion of DNA sequences that are repeated identically in tandem. Comparing the long allele (S.droWA30) to itself identified a region of identical sequence (Figure 6Bb) and th erefore a possible location of misalignment (Figure10). Analysis of the short allele (S.droWA28a) to itself yields the same results (Figures 6Bc,

PAGE 69

60 11). Again, the difference between the long and s hort alleles could have been the result of either a deletion or a duplication event. Two alleles were found in the S. pallidus clones form Norway. When compared to each other, these clones showed a section where a sequence of 63 bp is present in the long allele that is not in the short allele (Figure 6Ca). In th is case, there is a gap between the center-lines indicating th e sequence found in the long allele is not a tandem duplication. Therefore, the shorter allele mu st be the result of a deletion in the longer allele; a duplication in the shorter allele will not produce the longer allele. The repetitive region that led to the deletion can be seen when the long allele (S.palNor30) is compared to itself (Figure 6Cb). The region deleted from the long allele to give rise to the short (S.palNor27) is repeats 25-27 (Figures 5C, 12). Two alleles of S. pallidus from Washington were found that had a much larger difference in the number of SM50 repeats th an any previous alleles studied. The SM50 repeat array in the long alle le (S.palWA32) has a regular arrangement of a single 18 bp SM50 repeat followed by three 21 bp SM50 rep eats that have duplicated as a block of 81 bp (Figure 1D). In the long allele, this 81 bp block of four SM50 repeats occurs 4 times, followed by a single 18 bp SM50 repeat, five 21 bp SM50 repeats, and one more block of 81 bp (Figure 1D). The short allele (S.palWA24) is missing 2 of the 81bp blocks for a total of two 18 bp and six 21 bp SM50 repeats (F igure 1D). This is illustrated in the comparison of these two allele s using a 162 bp window dot plot analysis (Figure 6Da). The break in the center-line seen when using the 162 bp window suggests that the difference between the two alleles can only be explained by a deleti on event in the first allele (Figures 6Db, 13).

PAGE 70

61 Because the 162 bp consist of two identical 81 bp blocks of SM50 repeats, it is possible the difference in the two alleles could have happened in two steps of 81 bp instead of one step of 162 bp. Comparison of the long a llele (S.palWA32) and the short allele (S.palWA24) using an 81 bp window produces an additional centerline within the gap (Figure 6Dc). This indicates the changes could have happened in two steps (either duplications or deletions) that ga ve rise to an intermediate that differed from the original by a single 81 bp block of SM50 repeats. Comparison of the long allele (S.palWA32) with itself (Figure 6Dd) and the short allele (S.palWA24) with itself (Figure 6De) using a 81 bp window indicate areas of possible misalignments that could have produced the intermediate product. One possible sequence of this hypothetical intermediatewas proposed (S.palHYP, Figure 5D). Comparison of this hypothetical sequence to itself indicates it does have a location of potentia l misalignment (Figure 6Df). The overlap in the center-lines when compared to either the long allele (S.palWA32) (Figure 6Dg) or the short allele (S.palWA24) (Figur e 6Dh) indicates the short allele could be produced by a deletion of the hypothetical sequence, and the long allele could be produced by the duplication of the hypothetic al sequence (Figure 14). Discussion Concerted evolution has happened in d ifferent patterns since speciation The patterns of dot plot analysis are very different in each of the eight species examined, indicating that concerted evolution has occurred in different locations within the SM50 repeat array since spec iation (Figures 4, 6). Both S. purpuratus and H. pulcherrimus have the alternating short and long SM50 repeat s characteristic of the S.

PAGE 71

62 purpuratus clade, suggesting that the appearan ce of the truncated 18 bp SM50 repeat forced misalignment to occur by combinations of short 18 bp and long 21 bp SM50 repeats. The duplication of these short and long SM50 repeats likely led to the arrangement at the 5 end of the H. pulcherrimus repeat (Figure 4). This could have happened in steps of 39 bp, or through higher or der duplications as il lustrated in the 78 bp window analysis (Figures 4, 7). Similar dupl ications of short and long SM50 repeats occurred in S. purpuratus and the most obvious duplicati on detected by the dot plot analysis is a 99 bp duplication in the center of the SM50 repe at array, indicating that a higher order duplication took place (Figures 4, 7). In S. franciscanus, substitutions led to a 21 bp SM50 repeat with novel sequence at positions 5 and 7 (Chapter 1; Meeds et al. 2001; Appendix 1). This SM50 repeat would only be able to align with itself, a nd therefore concerted e volution involving this repeat would have to loop out two SM50 repeats, leading to the duplication of a 42 bp block made of two 21 bp SM50 repeats. This 42 bp block was then duplicated several times producing a pattern of alternating sequences (Figure 4,7). It is unclear whether the sequence of the SM50 repeat array observed in our samples was a result of three individual duplications of the 42 bp block, or a by higher order duplication. The dot plot analysis of L. pictus indicates that duplica tions occurred over the entire length of the SM50 repeat almost uniformly, while the same analysis for L. variegatus indicates little duplication towards the 3 end of the SM50 repeat array (Figure 4). L. pictus has only 21 bp SM50 repeats. Duplicati on appears to have involved a single repeat in one instance, and a pair of repeats in another (Figure 4). The two Lytechinus species were thought to have undergone c oncerted evolution using similar models

PAGE 72

63 (Meeds et al. 2001) but the variation in th e patterns of dot plot analysis suggest otherwise. Concerted evolution may be occurring differently in separate populations of S. pallidus Larger duplications occurred at the begi nning of the SM50 rep eat array in the S. pallidus Norway alleles in a different location than in the S. pallidus from Washington (Figures 6C, D). The two types of alleles have clearly arisen th rough different concerted evolution events, which could happen if there was an abse nce of gene flow between populations. In a previous study, microsatel lites were successfu lly amplified in populations of S. droebachiensis from the northeast Pacific Ocean but not in populations from the northwest Atlantic Ocean, sugges ting restricted gene flow between the populations (Addison and Hart 2004). Although this study is the first indication of restricted gene flow in S. pallidus populations, ther e is not enough evid ence to say this definitely. Unequal crossover could have occurred in blocks of pairs or multiple SM50 repeats rather than single units It is suggested that the deletions of codons leading to 15 bp and 18 bp SM50 repeats and single base pair s ubstitutions leads to duplicatio ns of these blocks of SM50 repeats due to constraints put on the misali gnment during unequal crossover (Figure 7). The variation within species allows developm ent of models illust rating exactly how the SM50 repeat array has undergone concerted e volution (Figures 8-14). In species where

PAGE 73

64 there is little sequence diversity between th e SM50 repeats, misalignment can occur in many different sections of the SM50 array. In this case, duplicat ion and deletion of a single SM50 repeat is most likely. Based on dot plot analysis, re gions of SM50 repeat arrays in many of the species examined are expected to undergo concerted evolution in this way. The variation we found in S. nudus samples involved the duplication/deletion of a single SM50 repeat from the longer allele within a region of similar 21 bp repeats (Figures 8, 9). Also, it is seen where substitutions can effect misalignment and unequal crossover. In S. droebachiensis duplication/deletion events have occurred by misaligning a pair of adjacent SM50 repeats (Figures 6B, 10, 11). In S. pallidus from Norway, we were able to devise a mode l involving a single deletion of 63 bp from the longer allele (Figure 12). Sequence diversit y in the SM50 repeats, howev er, prevent the misalignment of the shorter allele with itself and therefor e there is not a model for the creation of the longer allele by the duplicati on of 63 bp in the shorter al lele (see Figure 6C). The S. pallidus from Washington exhibits evidence for a higher order deletion/duplication event of either 162 bp or two events involving 81 bp (Figures 5D, 6Da, 6Dc, 13, 14). Truncation and mutation of SM50 repeat further regulat e misalignment. In addition to base pair mutation pr eventing misalignment, members of the S. purpuratus clade contain truncated 18 bp SM50 re peats (Appendix 1). This requires misalignment to involve a combination of the long 21 bp and truncated 18 bp SM50 repeats. The dot plot analysis indicated this has o ccurred in this clade (Figure 4) and the models proposed for the S. pallidus alleles from Washington confirm this (Figures 13,

PAGE 74

65 14). In both models, one or two blocks c onsisting of one 18 bp SM50 repeat followed by three 21 bp SM50 repeats were duplicated or deleted by a single concerted evolution event. The length of the looped out re gion during misalignment may be a factor in the ability to undergo concerted evolution S. purpuratus also contains an area of sequence similarity where misalignment could have occurred (Figure 4). But no length variation wa s detected in any samples regardless of location collected (Appendix 4) indicating either variation has been eliminated from the population through drift or selection, or that this model of concerted evolution was no longer possible. Examination of the S. purpuratus sequence indicates a misalignment could be generated, but it would require a 99 bp loopout ( not shown). It is possible that looping out 99 bp is not possible. If this were indeed the case, we would reject the models for S. pallidus (Washington) involving th e deletion of eight SM50 repeats (Figure 13) and favor the two step model involving 81 bp each (Figure 14). The balance of these three forces in the array of tandem SM50 repeats has allowed the observation of the interactions between concerted evolution, substitution and selection. It is the uni que balance that has allowed us to piece together possible steps that create the alleles we have seen in the population. Based on these m odels, it appears that concerted evolution can affect selected parts of the SM50 repeat array at different times and in different ways, and that mutations will diversify the SM50 repeats and change the locations where they are able to misalign. Th is will change the degree and the patterns of duplications and deletions of multiple SM50 repeats. Understanding the interactions

PAGE 75

66 between these three forces will help us understand the evolution of protein-coding tandem arrays of DNA.

PAGE 76

67 Summary Two powerful methods of analysis were us ed to study the inte raction of neutral evolution and selection in SM50; codon usage frequencies and dot plot analysis. Codon usage frequencies confirm that conc erted evolution is the driving force shaping the SM50 repeat array. Concerted evolution produces a bias in codon usage frequencies that is more similar in closely re lated species than in di stantly related ones. Base pair substitutions provide some diversity in codon usage frequencies. However, in some positions in the SM50 repeat all species examined have similar codon usage bias, which would not be expected if concerted evolution was the only force acting on codon usage bias. Analysis of mRNA secondary structure stability i ndicates there may be selection against codons that incr ease the stability of the struct ure. This selection force, however, is only evident in certain positions within the SM50 repeat. Therefore there is evidence that codon usage is shaped by forces of concerted evolu tion and substitution, but is limited by selection. Dot plot analysis also confirms that c oncerted evolution by unequal crossover has occurred since speciation. The patterns of duplicate SM50 repeats is different in each species indicating that rounds of unequal crossover occurre d in different locations involving different SM50 repeats. These duplic ations occurred in single SM50 repeats, but they may have also occurred in pairs or higher order duplica tion/deletion events. Substitutions in the SM50 repeats limit the locations of misalignments, restricting the unequal crossover events. W ith this in mind, models of unequal crossover in three

PAGE 77

68 species were developed to e xplain the diversity in the SM 50 repeat array discovered. The models illustrated how substitutions limit concerted evolution. SM50 has enabled us to study the intera ction between neut ral evolution and selection. I have examined th e footprints of molecular ev olution in a protein-coding region of tandem repeats, and hope this study leads to a greater understanding of how these sequences of DNA came to be, a nd how they will evolve over time.

PAGE 78

Figure 1 =C Lectin domain = SM50 repeat array = PCR primers 69 QPGVGGRQPGFGNQPGMGGQQPGMGGQQPGVGNQPGVGGR Long Short Position 1 2 3 4 5 6 7 Long repeat Q P G M/V/F G G Q/R Short repeat Q P G M/V/F G N Figure 1: Diagram of the protein-coding portion of the SM50 gene in S. purpuratus. Primers were designed to amplify a 1.1 kb segment of S. purpuratus including about 400 bp of the C-type lectin domain and the entire SM50 repeat array. The SM50 repeat array is made of 17-32 SM50 repeats in tandem, depending on the species. The SM50 repeats consist of 15 bp -21 bp all that start with the amino acids sequence Q P G in the first three positions.

PAGE 79

70 Figure 2 Figure 2: Schematic relationships of species used in this study based on Biermann et al. (2003) and Lee (2003). Mitochondrial sequence data places A. fragilis, H. pulcherrimus, and P. depressus within the genus Strongylocentrotus (Biermann et al. 2003; Lee 2003). This genus can be separated into two clades, one involving S. franciscanus, S. nudus, and P. depressus (S. franciscanus clade) and the other including S. purpuratus, S. pallidus, A. fragilis S. droebachiensis, and H pulcherrimus (S. purpuratus clade). In the S. franciscanus clade, S. nudus and S. franciscanus are sister taxa (Biermann et al. 2003; Lee 2003). In the S. purpuratus clade there exists a polytomy of S. pallidus, A. fragilis, and S. droebachiensis with S. purpuratus being the sister taxa to this and H. pulcherrimus. Divergence times for species within the genus Strongylocentrotus are likely to lie within the range of 3.5-20 Mybp, while the genus Lytechinus and Strongylocentrotus (including H. pulcherrimus) diverged some 30-40 Mybp (Smith 1988). S. pallidus A. fragilis S. droebachiensis S. purpuratus H. pulcherrimus S. purpuratus clade S. franciscanus S. nudus P. depressus S. franciscanus clade L. pictus L. variegatus Lytechinus clade

PAGE 80

Table 1A H. pulcherrimus S. purpuratus S. droebachiensis A. fragilis S. pallidus SM50 C-Type Whole SM50 C-Type Whole SM50 C-Type Whole SM50 C-Type SM50 C-Type Amino Repeat Lectin Genome Repeat Lectin Genome Repeat Lectin Genome Repeat Lectin Repeat Lectin Acids Codon Array Domain 24630 Array Domain 135693 Array Domain 729 Array Domain Array Domain Q CAA 90 (27) 40(4) 40 94.6 (35) 25.0(2) 41 92.7 (38) 25.0(2) 16 95.1 (39) 25.2(2) 97.6 (41) 12.5(1) CAG 10 (3) 60(6) 60 5.4 (2) 75.0(6) 59 7.3 (3) 75.0(6) 84 4.9 (2) 75.0(6) 2.4 (1) 87.5(7) P CCA 83.3 (20) 38.5(10) 33 83.3 (25) 36.4(8) 34 82.8 (24) 42.9(9) 31 77.4 (24) 42.9(9) 77.4 (24) 42.9(9) CCC 12.5 (3) 23.0(6) 25 13.3 (4) 18.1(4) 23 13.8 (4) 19.0(4) 25 19.4 (6) 19.0(4) 12.9 (4) 32.8(5) CCG 4.2 (1) --(0) 15 3.3 (1) 9.1(2) 13 3.4 (1) --(0) 14 3.2 (1) --(0) 3.2 (1) --(0) CCU --(0) 38.5(10) 27 -(0) 36.4(8) 30 --(0) 38.1(8) 31 --(0) 38.1(8) 6.5 (2) 33.3(7) G GGA 6.3 (4) 47.6(10) 29 21.7 (18) 61.5(8) 41 18.5 (15) 61.5(8) 36 17.4 (15) 61.5(8) 15.9 (14) 61.5(8) GGC 39.1 (25) 19.0(4) 27 26.5 (22) 23.1(3) 20 24.7 (20) 23.1(3) 22 23.3 (20) 23.1(3) 23.9 (21) 23.1(3) GGG 17.2 (11) 9.5(2) 14 12 (10) --(0) 12 12.3 (10) --(0) 16 12.8 (11) --(0) 17 (15) --(0) GGU 37.5 (24) 23.8(5) 30 39.8 (33) 15.4(2) 28 44.4 (36) 15.4(2) 26 46.5 (40) 15.4(2) 43.2 (38) 15.4(2) F UUC --(0) 77.8(7) 63 --(0) 80.0(4) 62 --(0) 80.0(4) 50 --(0) 80.0(4) --(0) 100(5) UUU 100 (9) 22.2(2) 37 100 (6) 20.0(1) 38 100 (5) 20.0(1) 50 100 (4) 20.0(1) 100 (5) --(0) V GUA 25 (1) 11.8(2) 18 --(0) 22.2(2) 18 --(0) 22.2(2) 9 --(0) 22.2(2) --(0) --(0) GUC --(0) 35.3(6) 35 --(0) 44.4(4) 33 --(0) 44.4(4) 30 --(0) 44.4(4) --(0) 50.0(4) GUG 75 (3) 17.6(3) 26 100 (9) 11.2(1) 26 100 (11) 11.2(1) 38.3 100 (13) 11.2(1) 69.2 (9) 25.0(2) GUU --(0) 35.3(6) 21 --(0) 22.2(2) 23 --(0) 22.2(2) 23 --(0) 22.2(2) 30.8 (4) 25.0(2) R AGA --(0) 16.7(2) 26 --(0) 14.3(1) 24 --(0) 12.5(1) 9 --(0) 12.5(1) --(0) 12.5(1) AGG --(0) 25.0(3) 26 --(0) 14.3(1) 22 --(0) 12.5(1) 15 --(0) 12.5(1) --(0) 25.0(2) CGA 100 (10) 8.3(1) 14 100 (13) 14.3(1) 14 100 (11) 12.5(1) 13 100 (15) 12.5(1) 100 (14) 12.5(1) CGC --(0) 25.0(3) 11 --(0) 14.3(1) 13 --(0) 12.5(1) 32 --(0) 12.5(1) --(0) 12.5(1) CGG --(0) --(0) 7 --(0) 14.3(1) 8 --(0) 12.5(1) 6 --(0) 12.5(1) --(0) --(0) CGU --(0) 25.0(3) 16 --(0) 28.5(2) 19 --(0) 37.5(3) 26 --(0) 37.5(3) --(0) 37.5(3) 71

PAGE 81

72 1B S. franciscanus S. nudus P. depressus SM50 C-Type Whole SM50 C-Type SM50 C-Type Amino Repeat Lectin Genome Repeat Lectin Repeat Lectin Acids Codon Array Domain 7034 Array Domain Array Domain Q CAA 91.7 (22) 37.5(3) 48 79.2 (19) 37.5(3) 87 (20) 62.5(5) CAG 8.3 (2) 62.5(5) 52 20.8 (5) 62.5(5) 13 (3) 37.5(3) P CCA 88.2 (15) 36.8(7) 40 100 (18) 35.0(7) 93.8 (15) 35.0(7) CCC --(0) 26.4(5) 22 --(0) 30.0(6) --(0) 30.0(6) CCG 11.8 (2) --(0) 10 --(0) 5.0(1) 6.3 (1) 5.0(1) CCU --(0) 36.8(7) 29 --(0) 30.0(6) --(0) 30.0(6) G GGA 28.3 (15) 50.0(8) 37 31.4 (16) 53.3(8) 19.1 (9) 46.2(6) GGC 34 (18) 31.1(5) 19 33.3 (17) 26.7(4) 29.8 (14) 23.1(3) GGG 15.1 (8) 6.3(1) 9 11.8 (6) 6.7(1) 14.9 (7) 12.5(2) GGU 22.6 (12) 12.6(2) 35 23.5 (12) 13.3(2) 36.2 (17) 18.2(2) F UUC --(0) 100(5) 61 --(0) 100(5) --(0) 100(5) UUU 100 (1) --(0) 39 100 (2) --(0) --(0) --(0) R AGA --(0) 28.6(2) 22 --(0) 25.0(2) --(0) 25.0(2) AGG --(0) --(0) 28 --(0) --(0) --(0) --(0) CGA 100 (8) 28.6(2) 8 100 (5) 25.0(2) 100 (6) 12.5(1) CGC --(0) 14.2(1) 6 --(0) 12.5(1) --(0) 12.5(1) CGG --(0) --(0) 4 --(0) --(0) --(0) 12.5(1) CGU --(0) 28.6(2) 32 --(0) 37.5(3) --(0) 37.5(3)

PAGE 82

73 1C L. pictus L. variegatus SM50 C-Type Whole SM50 C-Type Whole Amino Repeat Lectin Genome Repeat Lectin Genome Acids Codon Array Domain 4913 Array Domain 31907 Q CAA 84 (21) 87.5(7) 39 95.8 (23) 100(8) 44 CAG 16 (4) 12.5(1) 61 4.2 (1) --(0) 56 P CCA --(0) 41.2(7) 32 --(0) 43.8(7) 36 CCC --(0) 11.8(2) 27 --(0) 18.8(3) 21 CCG 6.7 (1) 11.8(2) 13 7.1 (1) 6.3(1) 14 CCU 93.3 (14) 35.3(6) 28 92.9 (13) 31.3(5) 29 G GGA 11.4 (5) 33.3(4) 39 7.1 (3) 40(4) 35 GGC 43.2 (19) 16.7(2) 25 47.6 (20) 10(1) 21 GGG 36.4 (16) 16.7(2) 13 31 (13) 10(1) 13 GGU 9.1 (4) 33.3(4) 24 14.3 (6) 40(4) 31 F UUC 57.1 (8) 42.9(3) 66 81.8 (9) 100(5) 57 UUU 42.9 (6) 57.1(4) 34 18.2 (2) --(0) 43 R AGA --(0) 40(2) 21 --(0) 33.3(2) 28 AGG --(0) 20(1) 23 --(0) 16.7(1) 23 CGA 100 (3) 20(1) 8 100 (2) 16.7(1) 12 CGC --(0) --(0) 13 --(0) --(0) 11 CGG --(0) --(0) 5 --(0) 16.7(1) 9 CGU --(0) 20(1) 30 --(0) 16.7(1) 17 Tables 1A-C: Codon usage frequency was calculate d by amino acid of the SM50 repeat array and compared to the C-type lectin domain (CLD) in five species in the S. purpuratus clade (Table 1A) three species in the S. franciscanus clade (Table 1B) and two Lytechinus species (Table 1C) The SM50 repeat array includes the 15-21 bp imperfectly repeated units (see Figure 1). Th e C-type lectin domain includes 400 bp of the gene upstream of the SM50 repeat array (Figure 1). Where possible, codon usage frequencies of a created from a sample of protein-coding genes were also used to represent the codon usage frequencies of th e whole genome. Whole genome codon usage frequencies were taken from the Internati onal DNA Sequence Databases: Status for the Year 2000 (Nakamura et al. 2000, http://www.kazusa.or.jp/codon/ ). The number in italics corresponds to the nu mber of genes from GenBank used in the current calculations.

PAGE 83

Table 2 Position 1 Position 2 Position 3 Glutamine Proline Glycine Species CAA CAG CCA CCC CCG CCU GGA GGC GGG GGU H. pulcherrimus 96.0(24) 4.0(1) 83.3(20) 12.5(3) 4.2(1) --(0) --(0) 70.8(17) --(0) 29.2(7) S. purpuratus 96.8(30) 3.2(1) 83.3(25) 13.3(4) 3.3(1) --(0) 16.1(5) 58.1(18) --(0) 25.8(8) S. droebachiensis 93.3(28) 6.7(2) 82.8(24) 13.8(4) 3.4(1) --(0) 3.3(1) 63.3(19) --(0) 33.3(10) A. fragilis 96.9(31) 3.1(1) 77.4(24) 19.4(6) 3.2(1) --(0) 6.3(2) 59.4(19) --(0) 34.4(11) S. pallidus 100.0(32) --(0) 77.4(24) 12.9(4) 3.2(1) 18.2(2) 9.4(3) 59.4(19) 3.1(1) 28.1(9) S. franciscanus 100.0(17) --(0) 88.2(15) --(0) 11.8(2) --(0) --(0) 89.5(17) --(0) 10.5(2) S. nudus 94.1(16) 5.9(1) 100.0(18) --(0) --(0) --(0) --(0) 83.3(15) --(0) 16.7(3) P. depressus 100.0(14) --(0) 93.8(15) --(0) 6.3(1) --(0) --(0) 87.5(14) --(0) 12.5(2) L. pictus 86.7(13) 13.3(2) --(0) --(0) 6.7(1) 93.3(14) --(0) 93.3(14) --(0) 6.7(1) L. variegatus 100.0(14) --(0) --(0) --(0) 7.1(1) 92.9(13) --(0) 92.9(13) --(0) 7.1(1) Position 4 Phenylalanine Methionine Valine Tryptophan UUC UUU AUG GUA GUC GUG GUU UGG H. pulcherrimus --(0) 36.0(9) 48.0(12) 4.0(1) --(0) 12.0(3) --(0) --(0) S. purpuratus --(0) 19.4(6) 41.9(13) --(0) --(0) 29.0(9) --(0) 9.7(3) S. droebachiensis --(0) 16.7(5) 33.3(10) --(0) --(0) 36.7(11) --(0) 13.3(4) A. fragilis --(0) 12.5(4) 31.3(10) --(0) --(0) 40.6(13) --(0) 15.6(5) S. pallidus --(0) 15.6(5) 31.3(10) --(0) --(0) 28.1(9) 12.5(4) 12.5(4) S. franciscanus --(0) 5.9(1) 94.1(16) S. nudus --(0) 11.8(2) 88.2(15) P. depressus --(0) --(0) 100.0(16) L. pictus 53.3(8) 40.0(6) 6.7(1) L. variegatus 69.2(9) 15.4(2) 7.7(1) 74

PAGE 84

75 Table 2 (Continued) Position 5 Position 6 Glycine Glycine GGA GGC GGG GGU GGA GGC GGG GGU H. pulcherrimus 4.0(1) 32.0(8) --(0) 64.0(16) 12.5(3) --(0) 45.8(11) 4.2(1) S. purpuratus 9.7(3) 9.7(3) --(0) 80.6(25) 33.3(10) 3.3(1) 33.3(10) --(0) S. droebachiensis 13.3(4) --(0) --(0) 86.7(26) 34.5(10) 3.4(1) 34.5(10) --(0) A. fragilis 9.4(3) --(0) --(0) 90.6(29) 33.3(10) 3.3(1) 36.7(11) --(0) S. pallidus 9.4(3) --(0) --(0) 90.6(29) 25.8(8) 3.2(1) 45.2(14) --(0) S. franciscanus 33.3(6) 5.6(1) 5.6(1) 55.6(10) 56.3(9) --(0) 43.8(7) --(0) S. nudus 38.9(7) 11.1(2) --(0) 50.0(9) 60.0(9) --(0) 40.0(6) --(0) P. depressus 12.5(2) --(0) --(0) 87.5(14) 46.7(7) --(0) 46.7(7) 6.7(1) L. pictus --(0) 33.3(5) 46.7(7) 20.0(3) 35.7(5) --(0) 64.3(9) --(0) L. variegatus --(0) 50.0(7) 14.3(2) 35.7(5) 21.4(3) --(0) 78.6(11) --(0) Position 7 Glutamine Arginine CAA CAG AGA AGG CGA CGC CGG CGU H. pulcherrimus 20.0(3) 13.3(2) --(0) --(0) 66.7(10) --(0) --(0) --(0) S. purpuratus 26.3(5) 5.3(1) --(0) --(0) 68.4(13) --(0) --(0) --(0) S. droebachiensis 45.5(10) 4.5(1) --(0) --(0) 50.0(11) --(0) --(0) --(0) A. fragilis 33.3(8) 4.2(1) --(0) --(0) 62.5(15) --(0) --(0) --(0) S. pallidus 37.5(9) 4.2(1) --(0) --(0) 58.3(14) --(0) --(0) --(0) S. franciscanus 33.3(5) 13.3(2) --(0) --(0) 53.3(8) --(0) --(0) --(0) S. nudus 25.0(3) 33.3(4) --(0) --(0) 41.7(5) --(0) --(0) --(0) P. depressus 40.0(6) 20.0(3) --(0) --(0) 40.0(6) --(0) --(0) --(0) L. pictus 61.5(8) 15.4(2) --(0) --(0) 23.1(3) --(0) --(0) --(0) L. variegatus 75.0(9) 8.3(1) --(0) --(0) 16.7(2) --(0) --(0) --(0) Table 2: Codon usage frequencies of the SM50 repeat array calculated by position. Because the amino acid sequence of Q P G is conserved in all SM50 repeats, we arbitrarily assigned these amino acids to positions 1-3 (see Figure 1). A ll codons regardless of the amino acid th ey code for are included in the frequencies for each position. Methionine and tryptophan have n on-degenerate codons. Valine and tryptophan are not present in the S. franciscanus clade and Lytechinus species and therefore are excluded from calculations in these clades.

PAGE 85

76 Table 3 Amino Codon tRNA Genes Acids wedentified In Genome Q CAA 45.45 (20) CAG 54.55 (24) P CCA 47.37 (27) CCC 4.48 (3) CCG 14.93 (10) CCU 40.30 (7) G GGA 42.31 (33) GGC 50.56 (40) GGG 12.36 (11) GGU --(0) F UUC 64.29 (9) UUU 35.71 (5) V GUA 35.00 (14) GUC 5.56 (4) GUG 44.44 (32) GUU --(0) R AGA 27.00 (27) AGG 17.00 (17) CGA 23.00 (23) CGC 1.00 (1) CGG 1.00 (1) CGU 31.00 (31) Table 3: Codon usage frequencies of tRNA genes found in S. purpuratus. Frequencies of tRNA genes were calculated based on the tRNA genes found to date in the S. purpuratus genome (Statija and Wra y, personal communication).

PAGE 86

77 Table 4 Species Length of SM50 Percentage Repeat Array (bp) GC content H. pulcherrimus 498 59.34 S. purpuratus 618 60.12 S. droebachiensis 609 60.35 A. fragilis 651 61.39 S. pallidus 651 60.46 S. franciscanus 390 63.01 S. nudus 357 61.80 P. depressus 324 62.33 L. pictus 321 64.75 L. variegatus 303 65.31 Table 4: The length of the SM50 repeat array a nd the percentage GC content of the SM50 repeat array calculated in all ten species.

PAGE 87

Figure 3A Position 1-25-15-55152535455565H. pulcherrimusS. purpuratusS. droebachiensisA. fragilisS. pallidusS. franciscanusS. nudusP. depressusL. pictusL. variegatus % Free Energy Change P1CAA P1CAG 3B Position 2-25-15-55152535455565H. pulcherrimusS. purpuratusS. droebachiensisA. fragilisS. pallidusS. franciscanusS. nudusP. depressusL. pictusL. variegatus % Free Energy Change P2CCA P2CCC P2CCG P2CCU 78

PAGE 88

3C Position 3-25-15-55152535455565H. pulcherrimusS. purpuratusS. droebachiensisA. fragilisS. pallidusS. franciscanusS. nudusP. depressusL. pictusL. variegatus % Free Energy Change P3GGA P3GGC P3GGG P3GGU 3D 79 Position 4-25-15-55152535455565H. pulcherrimusS. purpuratusS. droebachiensisA. fragilisS. pallidusS. franciscanusS. nudusP. depressusL. pictusL. variegatus % Free Energy Change P4UUC P4UUU P4AUG P4GUA P4GUC P4GUG P4GUU P4UGG

PAGE 89

3E Position 5-25-15-55152535455565H. pulcherrimusS. purpuratusS. droebachiensisA. fragilisS. pallidusS. franciscanusS. nudusP. depressusL. pictusL. variegatus % Free Energy Change P5GGA P5GGC P5GGG P5GGU 3F Position 6-25-15-55152535455565H. pulcherrimusS. purpuratusS. droebachiensisA. fragilisS. pallidusS. franciscanusS. nudusP. depressusL. pictusL. variegatus % Free Energy Change P6GGA P6GGC P6GGG P6GGU 80

PAGE 90

3G Position 7-25-15-55152535455565H. pulcherrimusS. purpuratusS. droebachiensisA. fragilisS. pallidusS. franciscanusS. nudusP. depressusL. pictusL. variegatus % Free Energy Change P7CAA P7CAG P7AGA P7AGG P7CGA P7CGC P7CGG P7CGU Figures 3A-G: Graphical representation of the change in mRNA secondary structure stability produced by an altered mRNA sequence. The original mRNA sequence for a representative of each species was altered to reflect a codon usage frequency not found in nature. In all cases, only one position was changed by substituting all the codons in that position with a single codon. Only codons for amino acids that were found in that position were used. The altered mRNA sequence was placed in RNAfold ( Hofacker et al. 2000, http://rna.tbi.univie.ac.at/ updated 2003) and free energy of the thermodynamic ensemble was calculated. The free energy of the altered mRNA sequence was divided by the free energy of the real mRNA sequence and then multiplied by 100 to give a percent change. 81

PAGE 91

Figure 4 Window Size S. purpuratus H. pulcherrimus 21 39 42 78 99* 82

PAGE 92

Figure 4 (continued) Windo w Size S. franciscanus L. pictus L. variegatus 21 42 84 Figure 4: Dot plot analysis of the SM50 repeat array of selected species compared to themselves. The number on the left corresponds to the window of similarity analyzed. All dot plots of various combinations of 18 bp and 21 bp SM50 repeats using thresholds allowing up to three substitutions were done. Only the informative plots are shown here. All plots shown are perfectly matched except for the S. purpuratus 99 bp analysis where three mismatches were allowed (marked by an asterisk). Parallel lines represent locations where the sequences meet the required amount of similarity (identical or within a few base pairs) within the analysis. The regions of duplications are in different locations between each species illustrating that concerted evolution occurred after speciation. In addition, duplications are seen in larger analysis windows suggesting theSM50 repeat arrays could have expanded and contracted through different patterns of duplications greater than a single SM50 repeat. 83

PAGE 93

84 Figure 5A SM50 repeat # S.nudJP18 S.nudJP17 1 CAA CCA GGC ATG GGT GGA CAA CCA GGC ATG GGT GGA 2 CAA CCA GGC ATG GGA GGG CGA CAA CCA GGC ATG GGA GGG CGA 3 CAA CCA GGC ATG GGC GGA CAA CAA CCA GGC ATG GGC GGA CAA 4 CAA CCA GGT TTT GGA GGG CGA CAA CCA GGT TTT GGA GGG CGA 5 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 6 CAA CCA GGT TTT GGA GGG CGA CAA CCA GGT TTT GGA GGG CGA 7 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 8 CAG CCA GGT ATG GGA GGG CGA CAA CCA GGT ATG GGA GGG CGA 9 CAA CCA GGC ATG GGT GGA CAG CAA CCA GGC ATG GGT GGA CAG 10 CAA CCA GGC ATG GGT GGA CAA CCA GGC ATG GGT GGA 11 CAA CCA GGC ATG GGA CAA CCA GGC ATG GGA 12 CAA CCA GGC ATG GGA CAA CCA GGC ATG GGA 13 CAA CCA GGC ACT GGA CAA CCA GGC ACT GGA 14 CAA CCA GGC ATG GGC GGG CGA CAA CCA GGC ATG GGC GGG CGA 15 CAA CCA GGC ATG GGT GGA CAG CAA CCA GGC ATG GGT GGA CAG 16 CAA CCA GGC ATG GGT GGA CAG CAA CCA GGC ATG GGT GGA 17 CAA CCA GGC ATG GGT GGA CGA CCA GGC ATG GGT GGG CAG 18 CGA CCA GGC ATG GGT GGG CAG

PAGE 94

85 5B SM50 repeat # S.droWA30 S.droWA28a 1 CAG CCG GGC ATG GGA CAA CCG GGC ATG GGA 2 CAA GGC GGC TTT GGT AAT CAA CAA GGC GGC TTT GGT AAT CAA 3 CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA 4 CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT 5 CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA 6 CAA CCA GGC TGG GGT GGA CAA CAA CCA GGC TGG GGT GGA CAA 7 CAA CCA GGT GTG GGA GGG CGA CAA CCA GGT GTG GGA GGG CGA 8 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 9 CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA 10 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 11 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 12 CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA 13 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 14 CAA CCA GGA GTG GGT GGG CGA CAA CCA GGA GTG GGT GGG CGA 15 CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT 16 CAA CCC GGC ATG GGT GGA CAA CAA CCC GGC ATG GGT GGA CAA 17 CAA CCA GGT GTG GGA GGG CAA CAA CCA GGT GTG GGA GGG CAA 18 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 19 CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA 20 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 21 CAA CCA GGT GTG GGT GGA CGA CAA CCA GGT GTG GGT GGA CGA 22 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC TTT GGT AAT 23 CAA CCA GGT GTG GGT GGA CGA CAG CCA GGT GTG GGT GGA CAA 24 CAA CCA GGC TTT GGT AAT CAA CCA GGC ATG GGT GGA CAA 25 CAG CCA GGT GTG GGT GGA CAA CAA CCA GGT GTG GGA GGG CGA 26 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC TTT GGT AAT 27 CAA CCA GGT GTG GGA GGG CGA CAA CCA GGT GTG GGT GGG CGA 28 CAA CCA GGC TTT GGT AAT CAA CCA GGC ATG GGT GGC CAG 29 CAA CCA GGT GTG GGT GGG CGA 30 CAA CCA GGC ATG GGT GGC CAG

PAGE 95

86 5C SM50 repeat # S.palNor30 S.palNor27 1 CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA 2 CAA CCA GGC TGG GGT GGA CAA CAA CCA GGC TGG GGT GGA CAA 3 CAA CCA GGT GTG GGA GGG CGA CAA CCA GGT GTG GGA GGG CGA 4 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 5 CAA CCC GGT GTT GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA 6 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 7 CAC CCA GGA GTG GGT GGG CGA CAA CCA GGA GTG GGT GGG CGA 8 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 9 CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA 10 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 11 CAA CCC GGA GTG GGT GGG CGA CAA CCC GGA GTG GGT GGG CGA 12 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 13 CAA CCC GGT GTG GGT GGA CAA CAA CCC GGT GTG GGT GGA CAA 14 CAA CCA GGA GTG GGT GGG CGA CAA CCA GGA GTT GGT GGG CGT 15 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 16 CAA CCC GGT GTG GGT GGG CGA CAA CCC GGC GTG GGT GGG CGA 17 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 18 CAA CCA GGT GTG GGT GGA CGA CAA CCA GGT GTG GGT GGA CGA 19 CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT 20 CAA CCA GGT GTG GGT GGA GGA CAA CCA GGT GTG GGT GGA CGA 21 CAA CCC GGC ATG GGT GGA CAA CAA CCC GGC ATG GGT GGA CAA 22 CAA CCA GGC GTG GGT GGG CGA CAA CCA GGC GTG GGT GGG CGA 23 CAA CCC GGC ATG GGT GGA CAA CAA CCC GGC ATG GGT GGA CAA 24 CAA CCA GGT GTG GGA GGG CGA CAA CCA GGT GTG GGT GGG CGA 25 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGG CAG 26 CAA CCC GGC GTG GGT GGG CGA CAA CCA GGC ATG GGC GGG CGA 27 CAA CCA GGC ATG GGT GGG CAG CAA CCA GGC ATG GGT GGG CAG 28 CAA CCA GGC ATG GGT GGG CAG 29 CAA CCA GGC ATG GGC GGG CGA 30 CAA CCA GGC ATG GGT GGG CAG

PAGE 96

5D SM50 Hypothetical Product (not seen) repeat # S.palWA32 S.palWA24 S.palHYP 1 CAA CCG GGC ATG GGA CAA CCG GGC ATG GGA CAA CCG GGC ATG GGA 2 CAA GGC GGC TTT GGT AAT CAA CAA GGC GGC TTT GGT AAT CAA CAA GGC GGC TTT GGT AAT CAA 3 CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA 4 CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT 5 CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA 6 CAA CCA GGC TGG GGT GGA CAA CAA CCA GGC TGG GGT GGA CAA CAA CCA GGC TGG GGT GGA CAA 7 CAA CCA GGT GTG GGA GGG CGA CAA CCA GGT GTG GGA GGG CGA CAA CCA GGT GTG GGA GGG CGA 8 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 9 CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA 10 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 11 CAA CCA GGA GTG GGT GGG CGA CAA CCA GGA GTG GGT GGG CGA CAA CCA GGA GTG GGT GGG CGA 12 CAA CCA GGC TGG GGT AAT CAA CCA GGC TTT GGT AAT CAA CCA GGC TGG GGT AAT 13 CAA CCC GGT GTT GGT GGG CGA CAA CCA GGT GTT GGT GGG CGA CAA CCC GGT GTT GGT GGG CGA 14 CAA CCA GGG ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA CAA CCA GGG ATG GGT GGA CAA 15 CAA CCA GGA GTG GGT GGG CAA CAA CCT GGT GTG GGT GGG CGA CAA CCA GGA GTG GGT GGG CAA 16 CAA CCA GGC TGG GGT AAT CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC TTT GGT AAT 17 CAA CCC GGT GTT GGT GGG CGA CGA CCA GGT GTG GGT GGA CGA CAA CCA GGT GTT GGT GGG CGA 18 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC TTT GGT AAT CAA CCA GGC ATG GGT GGA CAA 19 CAA CCA GGA GTG GGT GGG CGA CAA CCA GGT GTG GGT GGA CGA CAA CCT GGT GTG GGT GGG CGA 20 CAA CCA GGC TTT GGT AAT CAA CCC GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGA CAA 21 CAA CCA GGT GTT GGT GGG CGA CAA CCT GGC GTT GGA GGG CGA CAA CCA GGT GTG GGT GGA CGA 22 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT 23 CAA CCT GGT GTG GGT GGG CGA CAA CCA GGT GTG GGT GGG CGA CAA CCA GGT GTG GGT GGA CGA 24 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGC CAG CAA CCC GGC ATG GGT GGG CAA 25 CAA CCA GGT GTG GGT GGA CGA CAA CCT GGC GTT GGA GGG CGA 26 CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT 27 CAA CCA GGT GTG GGT GGA CGA CAA CCA GGT GTG GGT GGG CGA 28 CAA CCC GGC ATG GGT GGG CAA CAA CCA GGC ATG GGT GGC CAG 29 CAA CCT GGC GTT GGA GGG CGA 30 CAA CCA GGC TTT GGT AAT 31 CAA CCA GGT GTG GGT GGG CGA 32 CAA CCA GGC ATG GGT GGC CAG Figure 5: The SM50 repeat arrays of S. droebachiensis, S. pallidus, and S. nudus alleles used in this study. Figure 5A: The longer allele in S. nudus (S.nudJP18) differs from the shorter allele (S.nudJP17) by a single 21 bp SM50 repeat. Figure 5B: The pattern of 18 and 21 bp SM50 repeats in S. droebachiensis are conserved but the longer allele (S.droWA30) differs from the shorter allele (S.droWA28a) by two SM50 repeat units. Figure 5C-D: In S. pallidus, two different patterns of SM50 repeats were found. The alleles from Washington (S.palWA32 and S.palWA24, Figure 4B) differ by 8 SM50 repeats (Figure 5C). The alleles from Norway (S.pallNO30 and S.pallNO27) differ by 63 bp (Figure 5D). Possible models for the creation of the smaller alleles from the larger are based on these alleles (Figures 8-14). 87

PAGE 97

Figure 6A a. b. c. 21 bp Window Size, Perfect match 21 bp Window Size, Perfect match 18bp Window Size, Perfect match 6B a. b. c. 42 bp Window Size, perfect match 42 bp Window Size, perfect match 21 Window Size, 1 mismatch 88

PAGE 98

6C a. b. 63 bp Window Size, 1 mismatch 63 bp window Size, 4 mismatches 6D a. b. 162 bp Window Size, 2 mismatches 162 bp Window Size, 4 mismatches 89

PAGE 99

Figure 6D (continued) c. d. e. 81 bp window size, 2 mismatches 81 bp window size, 2 mismatches 81 bp window size, 3 mismatches f. g. h. 81 bp window size, 2 mismatches 81 bp window size, 2 mismatches 81 bp window size, 2 mismatches 90 Figure 6A-D: Dot plot analysis of alleles found in S. nudus (S.nudJP17, S.nudJP18; Figure 6A), S. droebachiensis (S.droWA30, S.droWA28a; Figure 6B) and in S. pallidus from Norway (S.palNor27, S.palNor30; Figure 6C) and Washington (S.palWA32, S.palWA24; Figure 6D) illustrating the differences between similar alleles from the same species. Alleles were compared to each other to identify regions of duplications or deletions between them, and then compared to themselves to identify areas of sequence similarity that allow misalignment. The window size and number of mismatches are recorded below each dot plot. The center-line indicates where the alleles meet the required sequence identity and is in bold. The break in the center-line indicated the region where extra SM50 repeats occur in the longer allele but not in the shorter. Grey dotted lines indicate where on the longer allele the extra SM50 repeats are located or regions of possible misalignment. Because the longer allele is always on the horizontal, when the center-lines over-lap the extra SM50 repeats are tandemly repeated in the longer allele. Parallel lines that are discussed are in bold.

PAGE 100

91 Figure 7 Figure 7: Schematic of larger order duplications in SM50 repeats. Only the SM50 repeat array portion of the genes is shown. The larger, open rectangles = 21bp SM50 repeats, the smaller, dark rectangles = 18bp SM50 repeats. Open or dark rectangles are not meant to convey sequence information; the sequences vary in those SM50 repeats, and the sequence relationship among those SM50 repeats s is not shown here. The patterned rectangles indicate that similar patterns share sequence identity. The brackets above each gene indicated where larger duplications may have occurred. In S. franciscanus there are two possible duplication events that could have led to the observed pattern.

PAGE 101

Figure 8A 14 15 16 17 18 S.nudJP18 CAACCAGGCATGGGCGGGCGA CAACCAGGCATGGGTGGACAG CAACCAGGCATGGGTGGACAG CAACCAGGCATGGGTGGA CGACCAGGCATGGGTGGGCAG S.nudJP17 CAACCAGGCATGGGCGGGCGA CAACCAGGCATGGGTGGACAG --------------------CAACCAGGCATGGGTGGA CGACCAGGCATGGGTGGGCAG Q P A G C M G C G C R Q P A G C M G T G A Q G Q P A G C M G T G A Q G Q P A G C M G T G A R P A G C M G T G G Q G B. 15 CGA CAACCAGGCATGGGTGGA 14 16 17 18 S.nudJP18 CAACCAGGCATGGGCGGG CAG CAACCAGGCATGGGTGGACAG CAACCAGGCATGGGTGGA CGACCAGGCATGGGTGGGCAG 14 15 16 18 S.nudJP18 CAACCAGGCATGGGCGGG CGA CAACCAGGCATGGGTGGACAG CAACCAGGCATGGGTGGA CGACCAGGCATGGGTGGGCAG 17 CAG CAACCAGGCATGGGTGGA C. 13 14 15 16 16 17 18 Long Product QP A G C TG A QP A G C MG C G C R QP A G C MG T G A Q G QP A G C MG T G A Q G QP A G C MG T G A Q G QP A G C MG T G A RP A G C MG T G G R (Not observed) 13 14 15 17 18 Short product QP A G C TG A QP A G C MG C G C R QP A G C MG T G A Q G QP A G C MG T G A RP A G C MG T G G R (S.nudJP17) Figure 8: Model for the creation of S.nudJP17 from the misalignment and crossover of two S.nudJP18 alleles. Figure 8A: Sequences of the SM50 repeat array in S.nudJP18 and S.nudJP17 are aligned and numbered according to the longer allele (S.nudJP18). Letters below the alignment correspond to the amino acids present in the protein with the base pair of the wobble position indicated with a superscript. S.nudJP17 is lacking 21 bp found in SM50 repeat #16 of S.nudJP18 (Figure 5A). Figure 8B: The amino acid sequence and dot plot analysis (Figure 6A) was used as a guide to misalign two copies of the longer allele (S.nudJP18). Areas that are looped out to create the misalignment are in italics and indicated by braced lines. The location of a possible crossover event within the area of misalignment is illustrated by crossed lines. Figure 8C: The amino acid sequences of the two products that would result of the crossover illustrated in Figure 8B. The shorter product is missing SM50 repeat #16 and therefore contains the same sequence as S.nudJP17. The longer product is not observed. 92

PAGE 102

Figure 9A 14 15 15 16 17 S.nudJP18 CAACCAGGCATGGGCGGGCGA CAACCAGGCATGGGTGGACAG CAACCAGGCATGGGTGGACAG CAACCAGGCATGGGTGGA CGACCAGGCATGGGTGGGCAG S.nudJP17 CAACCAGGCATGGGCGGGCGA CAACCAGGCATGGGTGGACAG --------------------CAACCAGGCATGGGTGGA CGACCAGGCATGGGTGGGCAG Q P A G C M G C G C R Q P A G C M G T G A Q G Q P A G C M G T G A Q G Q P A G C M G T G A R P A G C M G T G G Q G B. 15 CAACCAGGCATGGGTGGACAG 14 16 17 S.nudJP17 CAACCAGGCATGGGCGGGCGA CAACCA GGCATGGGTGGA CGACCAGGCATGGGTGGGCAG 14 15 17 S.nudJP17 CAACCAGGCATGGGCGGGCGA CAACCA GGCATGGGTGGA CGACCAGGCATGGGTGGGCAG 16 16 CAG CAACCAGGCATGGGTGGA C. 13 14 16 17 Short Product QP A G C TG A QP A G C MG C G C R QP A G C MG T G A RP A G C MG T G G R (not observed) 13 14 15 15 17 18 Long Product QP A G C TG A QP A G C MG C G C R QP A G C MG T G A Q G QP A G C MG T G A Q G QP A G C MG T G A RP A G C MG T G G R (S.nudJP18) Figure 9: Model for the creation of S.nudJP18 from the misalignment and crossover of two S.nudJP17 alleles. Figure 9A: S.nudJP18 and S.nudJP17 sequences are aligned and numbered corresponding to the SM50 repeat numbers given to S.nudJP17 (Figure 5A). Letters below the alignment correspond to the amino acids present in the protein with the base pair of the wobble position indicated with a superscript. SM50 repeat #15 is found twice in S.nudJP18. Figure 8B: The amino acid sequence and dot plot analysis (Figure 6A) was used as a guide to misalign two copies of the shorter allele (S.nudJP17). Areas that are looped out to create the misalignment are indicated by itallics and braced lines. The location of a possible crossover event within the area of misalignment is illustrated by crossed lines. Figure 8C: The amino acid sequences of the two products that would result of the crossover illustrated in Figure 8B. The longer product contains a duplicate SM50 repeat #15 and therefore contains the same sequence as S.nudJP18. The shorter product is not observed. 93

PAGE 103

Figure 10A 19 20 21 S.droWA30 CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA S.droWA28a CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA Q P C G T V G T G G R Q P A G C M G T G A Q A Q P A G T V G T G A R 22 23 24 25 CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA CAACCAGGCTTTGGTAAT CAGCCAGGTGTGGGTGGACAA ----------------------------------------CAACCAGGCTTTGGTAAT CAGCCAGGTGTGGGTGGACAA Q P A G C M G T G A Q A Q P A G T V G T G A R Q P A G C F G T N Q G P A G T V G T G A Q A B. 20 21 CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA 19 22 23 24 25 S.droWA30 CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA CAACCAGGC TTTGGTAAT CAGCCAGGTGTGGGTGGACAA 19 20 21 22 25 S.droWA30 CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA CAACCAGGC TTTGGTAAT CAGCCAGGTGTGGGTGGACAA 23 24 ATGGGTGGACAA CAACCAGGTGTGGGTGGACGA CAACCAGGC C. 19 20 21 22 23 22 23 24 25 Long product QP C G T VG T G G R QP A G C MG T G A Q A QP A G T VG T G A R QP A G C MG T G A Q A QP A G T VG T G A R QP A G C MG T G A Q A QP A G T VG T G A R QP A G C FG T N Q G P A G T VG T G A Q A (Not Observed) 19 20 21 24 25 Short product QP C G T VG T G G R QP A G C MG T G A Q A QP A G T VG T G A R QP A G C FG T N Q G P A G T VG T G A Q A (S.droWA28a) Figure 10: Model for the creation of S.droWA28a from the misalignment and crossover of two S.droWA30 alleles. Figure 10A: S.droWA30 and S.droWA28a sequences are aligned and numbered according to S.droWA30 (Figure 5B). Letters below the alignment correspond to the amino acids present in the protein with the base pair of the wobble position indicated with a superscript. S.droWA28a is lacking 42 bp found in SM50 repeat #22 and #23 of S.droWA30. Figure 10B: The amino acid sequence and dot plot analysis (Figure 6B) was used as a guide to misalign two copies of the longer allele (S.droWA30). Areas that are looped out to create the misalignment are indicated by itallics and braced lines. The location of a possible crossover event within the area of misalignment is illustrated by crossed lines. Figure 10C: The amino acid sequences of the two products that would result of the crossover illustrated in Figure 10B are shown. The shorter allele is missing SM50 repeats #22 and #23 and therefore contains the same sequence as S.droWA28a. The longer allele is not observed. 94

PAGE 104

Figure 11A 19 20 21 S.droWA30 CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA S.droWA28a CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA Q P C G T V G T G G R Q P A G C M G T G A Q A Q P A G T V G T G A R 20 21 22 23 CAACCAGGCATGGGTGGACAA CAACCAGGTGTGGGTGGACGA CAACCAGGCTTTGGTAAT CAGCCAGGTGTGGGTGGACAA ----------------------------------------CAACCAGGCTTTGGTAAT CAGCCAGGTGTGGGTGGACAA Q P A G C M G T G A Q A Q P A G T V G T G A R Q P A G C F G T N Q G P A G T V G T G A Q A B. 20 21 GGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCA 19 22 23 S.droWA28a CAACCC GGTGTGGGTGGACGA CAACCAGGC TTTGGTAAT CAGCCAGGTGTGGGTGGACAA 19 20 23 S.droWA28a CAACCC GGTGTGGGTGGGCGA CAACCAGGC TTTGGTAAT CAGCCAGGTGTGGGTGGACAA 21 22 ATGGGTGGACAA CAACCAGGTGTGGGTGGACGA CAACCAGGC C. 19 20 21 20 21 22 23 Long Product QP C G T VG T G G R QP A G C MG T G A Q A QP A G T VG T G A R QP A G C MG T G A Q A QP A G T VG T G A R QP A G C FG T N Q G P A G T VG T G A Q A (S.droWA30) Short Product 19 22 23 (Not Observed) QP C G T VG T G G R QP A G C FG T N Q G P A G T VG T G A Q A Figure 11: Model for the creation of S.droWA30 from the misalignment and crossover of two S.droWA28a alleles. Figure 11A: S.droWA30 and S.droWA28a sequences are aligned and numbered according to S.droWA28a (Figure 5B). Letters below the alignment correspond to the amino acids present in the protein with the base pair of the wobble position indicated with a superscript. S.droWA30 contains a duplicate 42 bp found in SM50 repeat #20 and #21 of S.droWA28a. Figure 11B: The amino acid sequence and dot plot analysis (Figure 6B) was used as a guide to misalign two copies of the shorter allele (S.droWA28a). Areas that are looped out to create the misalignment are indicated by itallics and braced lines. The location of a possible crossover event within the area of misalignment is illustrated by crossed lines. Figure 11C: The amino acid sequences of the two products that would result of the crossover illustrated in Figure 10B are shown. The longer product contains duplicate SM50 repeats #20 and #21 and therefore contains the same sequence as S.droWA30. The shorter product is not observed. 95

PAGE 105

Figure 12A 24 25 26 27 S.palNor30 CAACCAGGTGTGGGAGGGCGA CAACCAGGCATGGGTGGACAA CAACCCGGCGTGGGTGGGCGA CAACCAGGCATGGGTGGGCAG S.palNor27 CAACCAGGTGTGGGTGGGCGA ------------------------------------------------------------Q P A G T V G T G G R Q P A G C M G T G A Q A Q P C G C V G T G G R Q P A G C M G T G G Q G 28 29 30 CAACCAGGCATGGGTGGGCAG CAACCAGGCATGGGCGGGCGA CAACCAGGCATGGGTGGGCAG CAACCAGGCATGGGTGGGCAG CAACCAGGCATGGGCGGGCGA CAACCAGGCATGGGTGGGCAG Q P A G C M G T G G Q G Q P A G C M G C G G R Q P A G C M G T G G Q G B. 25 26 27 CAACCAGGCATGGGTGGACAA CAACCCGGCGTGGGTGGGCGA CAACCAGGCATGGGTGGGCAG 24 28 29 30 S.palNor30 CAACCAGGTGTGGGAGGGCGA CAACCAGGC ATGGGTGGGCAG CAACCAGGCATGGGCGGGCGA CAACCAGGCATGGGTGGGCAG CCGAATAACCC 24 25 26 27 S.palNor30 CAACCAGGTGTGGGAGGGCGA CAACCAGGC ATGGGTGGACAA CAACCCGGCGTGGGTGGGCGA CAACCAGGCATGGGTGGGCAG CCGAATAACCC 28 29 30 CCAGGCATGGGTGGGCAG CAACCAGGCATGGGCGGGCGA CAACCAGGCATGGGTGGGCAG C. 24 25 26 27 25 26 27 28 29 30 Long product QP A G T VG T G G R QP A G C MG T G A Q A QP C G C VG T G G R QP A G C MG T G G Q G QP A G C MG T G A Q A QP C G C VG T G G R QP A G C MG T G G Q G QP A G C MG T G G Q G QP A G C MG C G G R QP A G C MG T G G Q G (Not observed) 24 28 29 30 Short product QP A G T VG T G G R QP A G C MG T G G Q G QP A G C MG C G G R QP A G C MG T G G Q G (S.palNor27) Figure 12: Model for the creation of S.pallNO27 from the misalignment and crossover of two S.pallNO30 alleles. Figure 12A: S.pallNO27 and S.pallNO30 sequences are aligned and numbered according to S.pallNO30 (Figure 5C). Letters below the alignment correspond to the amino acids present in the protein with the base pair of the wobble position indicated with a superscript. S.pallNO27 is lacking 63 bp found in SM50 repeat #25-#27 of S.pallNO30. Figure 12B: The amino acid sequence and dot plot analysis was used to misalign two copies of the longer allele (S.pallNO30). Areas that are looped out to create the misalignment are indicated by itallics and braced lines. The location of a possible crossover event within the area of misalignment is illustrated by crossed lines. Figure 4C: The amino acid sequences of the two products that would result from the crossover illustrated in Figure 12B. The shorter allele is missing SM50 repeats #25-#27 and therefore contains the same sequence as S.palNor27. Mismatches in the misalignment region are in bold. The longer product is not observed. Mismatches in the misaligned region are in bold. 96

PAGE 106

Figure 13A 7 8 9 10 11 S.palWA32 CAACCAGGTGTGGGAGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA S.palWA24 CAACCAGGTGTGGGAGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA Q P A G T V G A G G R Q P A G C W G T N Q P C G T V G T G C R Q P A G C M G T G A Q A Q P A G A V G T G G R 12 13 14 15 16 CAACCAGGCTGGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGGATGGGTGGACAA CAACCAGGAGTGGGTGGGCAA CAACCAGGCTGGGGTAAT ----------------------------------------------------------------------------------------------Q P A G C W G T N Q P C G T V G T G G R Q P A G G M G T G A Q A Q P A G A V G T G G R Q P A G C W G T N 17 18 19 20 21 CAACCCGGTGTTGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA CAACCAGGCTTTGGTAAT CAACCAGGTGTTGGTGGGCGA ------------------------------------------------------------CAACCAGGCTTTGGTAAT CAACCAGGTGTTGGTGGGCGA Q P C G T V G T G G R Q A P A G C M G T G A Q A Q P A G A V G T G G R Q P A G C F G T N Q P A G T V G T G G R B 8 9 10 11 CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA 12 13 14 15 CAACCAGGCTGGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGGATGGGTGGACAA CAACCAGGAGTGGGTGGGCAA 7 16 17 18 S.palWA32 CAACCCGGTGTGGGTGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGCATGGGTGGACAA 7 8 9 10 S.palWA32 CAACCAGGTGTGGGAGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA 19 20 21 CAACCAGGAGTGGGTGGGCGA CAACCAGGCT TTGGTAAT CAACCAGGTGTTGGTGGGCGA 11 12 21 CAACCAGGAGTGGGTGGGCGA CAACCAGGCT TTGGTAAT CAACCAGGTGTTGGTGGGCGA 13 14 15 16 GGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGGATGGGTGGACAA CAACCAGGAGTGGGTGGGCAA CAACCAGGCATGGGTGGACAA 17 18 19 20 CAACCAGGCTGGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGAGTGGGTGGGCGA CAACCAGGCT C 7 8 9 10 11 12 13 14 15 16 17 Long product QP A G T VG A G G R QP A G C WG T N QP C G T VG T G C R QP A G C MG T G A Q A QP A G A VG T G G R QP A G C WG T N QP C G T VG T G G R QP A G G MG T G A Q A QP A G A VG T G G R QP A G C WG T N QP C G T VG T G G R (Not observed) 18 19 12 13 14 15 16 17 18 19 20 Q A P A G C MG T G A Q A QP A G A VG T G G R QP A G C WG T N QP C G T VG T G G R QP A G G MG T G A Q A QP A G A VG T G G R QP A G C WG T N QP C G T VG T G G R Q A P A G C MG T G A Q A QP A G A VG T G G R QP A G C FG T N 7 8 9 10 11 20 21 Short product QP A G T VG A G G R QP A G C WG T N QP C G T VG T G C R QP A G C MG T G A Q A QP A G A VG T G G R QP A G C FG T N QP A G T VG T G G R (S.palWA24) Figure 13: Model for the creation of S.palWA24 from misalignment and crossover of two S.palWA32 alleles. Figure 13A: S.palWA32 and S.palWA24 sequences are aligned and numbered according to S.palWA32 (Figure 5D). Letters below the alignment correspond to the amino acids present in the protein with the base pair of the wobble position in superscript. S.palWA24 is lacking 162 bp found in SM50 repeat #12-#19 of S.palWA32. Figure13 B: The amino acid sequence and dot plot analysis are used to misalign two copies of S.palWA32. Itallics and braced lines indicate areas that are looped out to create the misalignment. Crossed lines illustrate a location of a possible crossover event. Figure 13C: The amino acid sequences of the two products that would result of the crossover illustrated in Figure 13B. The shorter product is missing SM50 repeats #12-#19 and therefore contains the same sequence as S.palWA24. The longer product is not observed. 97

PAGE 107

98 Figure 14A 7 8 9 10 11 S.palWA32 CAACCAGGTGTGGGAGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA S.palHYP CAACCAGGTGTGGGAGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA S.palWA24 CAACCAGGTGTGGGAGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA Q P A G T V G A G G R Q P A G C W G T N Q P C G T V G T G C R Q P A G C M G T G A Q A Q P A G A V G T G G R 12 13 14 15 16(12) CAACCAGGCTGGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGGATGGGTGGACAA CAACCAGGAGTGGGTGGGCAA CAACCAGGCTGGGGTAAT CAACCAGGCTGGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGGATGGGTGGACAA CAACCAGGAGTGGGTGGGCAA ---------------------------------------------------------------------------------------------------------------Q P A G C W G T N Q P C G T V G T G G R Q P A G G M G T G A Q A Q P A G A V G T G G R Q P A G C W G T N 17(13) 18(14) 19(15) 20 21 22 CAACCCGGTGTTGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA CAACCAGGCTTTGGTAAT CAACCAGGTGTTGGTGGGCGA CAACCAGGCATGGGTGGACAA ------------------------------------------------------------CAACCAGGCTTTGGTAAT CAACCAGGTGTTGGTGGGCGA CAACCAGGCATGGGTGGACAA ------------------------------------------------------------CAACCAGGCTTTGGTAAT CAACCAGGTGTTGGTGGGCGA CAACCAGGCATGGGTGGACAA Q PC GT V GT GG R Q A P A G C M G T G A Q A Q P A G A V G T G G R Q P A G C F G T N Q P A G T V G T G G R Q P A G C M G T G A Q B 8 9 10 11 CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA CAACCAGGAGTGGGTGGGCGA 7 12 13 14 S.palHYP CAACCAGGTGTGGGAGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGGATGGGTGGACAA 7 8 9 10 S.palHYP CAACCAGGTGTGGGAGGGCGA CAACCAGGCTGGGGTAAT CAACCCGGTGTGGGTGGGCGA CAACCAGGCATGGGTGGACAA 15 20 21 22 CAACCAGGAGTGGGTGGGCAA CAACCAGGCT TTGGTAAT CAACCAGGTGTTGGTGGGCGA CAACCAGGCATGGGTGGACAA 11 12 CAACCAGGAGTGGGTGGGCGA CAACCAGGCT TTGGTAAT CAACCAGGTGTTGGTGGGCGA CAACCAGGCATGGGTGGACAA 13 14 15 20 GGGGTAAT CAACCCGGTGTTGGTGGGCGA CAACCAGGGATGGGTGGACAA CAACCAGGAGTGGGTGGGCAA CAACCAGGCT C 7 8 9 10 11 12 13 14 15 Long product QP A G T VG A G G R QP A G C WG T N QP C G T VG T G C R QP A G C MG T G A Q A QP A G A VG T G G R QP A G C WG T N QP C G T VG T G G R QP A G G MG T G A Q A QP A G A VG T G G R (S.palWA32) 12(16) 13(17) 14(18) 15(19) 20 21 22 QP A G C WG T N QP C G T VG T G G R Q A P A G C MG T G A Q A QP A G A VG T G G R QP A G C FG T N QP A G T VG T G G R QP A G C MG T G A Q 7 8 9 10 11 20 21 22 Short product QP A G T VG A G G R QP A G C WG T N QP C G T VG T G C R QP A G C MG T G A Q A QP A G A VG T G G R QP A G C FG T N QP A G T VG T G G R QP A G C MG T G A Q (S.palWA24) Figure 14: Model for the creation of S.palWA24 and S.palWA32 from the misalignment and crossover of a hypothetical product (S.palHYP, shown in gray). Figure 14A: S.palWA32, S.palHYP, and S.palWA24 sequences are aligned and numbered according to S.palWA32 (Figure 5D). Numbers in parentheses indicate SM50 repeats that are identical in sequence, (#12-#15 are identical to #16-#19). Letters below the alignment correspond to the amino acids present in the protein with the base pair of the wobble position in superscript. S.palWA24 is lacking SM50 repeats #12-#19. S.palHYP, is lacking SM50 repeats #16-#19. Figure 14B: Itallics and braced lines indicate areas that are looped out to create the misalignment Crossed lines illustrate the location of a possible crossover event. Figure 14C: The amino acid sequences of the two resulting products. The longer product contains additional SM50 repeats #12-15 which are the same sequence as #16-19 and therefore contains the same sequence as S.palWA32. The shorter product is missing SM50 repeats #12-#19 and therefore contains the same sequence as S.palWA24. Mismatches in the misalignment region are in bold.

PAGE 108

99 References Addison J.A., and M.W. Hart 2005. Colonizatio n, Dispersal, and Hybr idization Influence Phylogeography of North Atlantic Sea Urchins ( Strongylocentrotus droebachiensis ). Evolution 59(3): 532-543. Addison, J.A. and M.W. Hart 2004. Analysis of population genetic structure of the green sea urchin ( Strongylocentrotus droebachiensis ) using microsatellites. Marine Biology 144:243-251. Andersen S. 1970. Aminoa acid composition of spider silks. Comparative Biochemical and Physiology. 35:705-711. Archetti, M. 2004. Selection on codon usage for error minimization at the protein level. Journal of Molecula r Evolution. 59:400-415. Arnone, M. I., L. D. Bogarad, A. Collazo, C. V. Kirchhamer, A. R. Cameron, J. P. Rast, A. Gregorians, and E. H. Davidson 1997. Green Fluorescent Protein in the sea urchin: New expieremental approaches to transcriptional regulatory analysis in embryos and larvae. Development 124:4649-4659. Auffray C., S. wembeaud, M. Roux-Rouqui e, and L. Hood 2003. From Functional Genomics to Systems Biology: Concepts and Practices. Comptes Rendus Biologies 326:879-892. Baldi P., S. Brunak, Y. Chauvin, and A. G. Pederson 1999. Structural bias for triplet repeat disorders: a computational an alysis. Bioinformatics 15(11) 918-929. Bazhin, A. 1998. The sea urchin genus Strongylocentrotus in the seas of Russia: taxonomy and ranges. In: Moor, R. and Telford, M. (Eds), Echinoderms: San Francisco. A.A. Balkema, Rotterdam, Netherlands. 563-566. Benson, S., H. Sucov, L. Stephens, E. H. Davidson, and F. H. Wilt 1987. A linagespecific gene encoding a major matrix prot ein of the sea urchin embreyo spicule. Authentication of the cloned gene and its developmental expression. Developmental Biology 120(2): 499-506. Biermann, C. H. 1998. The Molecular Evolution of Sperm Bindin in Six Species of Sea Urchins (Echinoida: Strongylocentrotidae ). Molecular Biology and Evolution 15(12):1761-1771.

PAGE 109

100 Biermann, C. H., B. D. Kessing, and S. R. Palumbi 2003. Phylogeny and development of marine model species: Strongylocentrotid sea urchins. Evolutionary Development 5(4):360-71. Brusca, R. C., and G. J. Brusca 2003. Invert ebrates. Sunderland, Massachusetts: Sinauer Associates, Inc. Burton, R. S. 1998. Intraspecific phylogeography across the point conception biogeographic boundry. Evolution 52(3):734-745. Cameron, R. A., G. Mahairas, J. P. Rast, P. Martinez, T. R. Biondi, S. Swartzelle, J. C. Wallacee, A. J. Poustkag, B. T. Livings ton, G. A. Wray, C. A. Ettensohn, H. Lehrachg, R. J. Brittena, E. H. Davidson, and L. Hood 2000. A Sea urchin Genome Project: Sequence Scan, Virtua l map, and additional resources. Proceedings of the National Academy of Sciences 97(17):9514-9518. Carlini, D. B., Y. Chen, and W. Stepha n. 2001. The relationship between third codon position nucleotide content, codon Bias, mRNA secondary structure, and gene expression in the drosophilid alcohol dehydrogenase genes Adh and Adhr Genetics 159 (2): 623-633. Craig, C. L., and C. Riekel 2002. Comparative ar chitecture of silks, fibrous proteins and their encoding genes in insects and sp iders. Comparative Biochemistry and Physiology Part B 133:493-507. Davidson, E. H., A. Cameron, and A. Ransick 1998. Specification of cel l fate in the sea urchin embreyo: summarry and some proposed models. Development 125: 32693290. Davidson, E. H., D. R McClay, and L. H ood 2003. Regulatory gene networks and the properties of the developmental process. Proceedings of the National Academy of Sciences USA 100:1475-1480. Dover, G. A. 1982. Molecular Drive: a cohesi ve mode of species evolution. Nature 299:111-117. Dover, G. A. 1986. Molecular drive in multigen e families: how biological novelties arise, spread and are assimilated. Trends in Genetics 2:159-165. Dover, G. A. 1993. Evolution of genetic redundancy for advanced players. Current Opinion in Genetics and Development 3(6):902-10. Drickamer, K. 1988. Two Distinct Classes of Carbohydrate-recognition Domains in Animal Lectins. Journal of Bi ological Chemistry. 263(20):9557-9560.

PAGE 110

101 Elder, J. F Jr., and B. J. Turner 1995. C oncerted evolution of re petitive DNA sequences in eukaryotes. The Quarterly Re view of Biology 70(3):297-320. Emlet, R. B. 1995. Developmental mode and species geographic range in regular sea urchins (Echinodermata: Echinoi dea). Evolution 49(3): 476-489. Galtier, N., G. Piganeau, D. Mouchiroud, and L. Duret 2001. GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics 159 :907911 Graur, D., and W. Li 2000. Fundamentals of molecular evolution. Sinauer Associates weCN, Sunderland, MA. Guisez, Y., J. Robbens, E. Remaut, and W. Fiers 1993. Folding of the MS2 coat protein in Escherichia coli is modulated by translationa l pauses resulting from mRNA secondary structure and codon usage: A hypothesis. Journal of Theoritical Biology 162:243-252. Hayashi, C. H., and R. V. Lewis 2001. Spider flagelliform silk: lessons in protein design, gene structure, and molecular evolution. BioEssays 23:750-756. Hayashi, C. Y., and R. V. Lewis 2000. Molecula r architecture and evolution of a modular spider silk protein gene Science 287(5457):1477-9. Hofacker, I. L. 2003. Vienna RNA secondary st ructure server. Nucleic Acids Research 31(13): 3429-3431. http://rna.tbi.univie.ac.at/ Hofacker, I. L., W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster 1994. Fast folding and comparison of RNA secondary structures (The Vienna RNA Package). Monatshefte fur Ch emie (Chemical Monthly) 125:167-188. Illies, M. R., M. T. Peeler, A. M. Dechtia ruk, C. A. Ettensohn 2002. wedentification and developmental expression of new biomineralization proteins in the sea urchin Strongylocentrotus purpuratus Development Genes and Evolution 212(9):41931. Jink-Robertson, S., and T. Petes 1970. Experi mental determination of rates of concerted evolution. Methods in Enzymology224:631-646. Katoh-Fukui Y, T. Noce, T. Ueda, Y. Fujiwara, N. Hashimoto,S. Tanaka, T. Higashinakagawa 1992. wesolation and characterization of cDNA encoding a spicule matrix protein in Hemicentrotus pulcherrimus micromeres. Internation Journal of Developmental Biology 36(3): 353-61.

PAGE 111

102 Katz, L., and C. B. Burge 2003. Widespread se lection for local RNA secondary structure in coding regions of bacterial genes. Genome Research 13:2042-2051. Killian, C. E. and F. H. Wilt 1996. Charact erization of the proteins comprising the integral matrix of Strongylocentrotus purpuratus embryonic spicules. The Journal of Biological Chemistry 271(15):9150-9159. Killian, C. E., and F. H. Wilt 1989.The accumula tion and translation of a spicule matrix protein mRNA during sea urchin embre yo development. Developmental biology 133(1): 148-56. Kimura, M. 1976. How genes evolve; a populat ion geneticists view. Annales de Gntique 19(3):153-168. Kimura, M. 1977. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275-276. Kimura, M. 1986. DNA and the neutral theory. Philosophical Transactions of the Royal Society of London, series B 312:343-354. Klionsky, D. J., D. G. Skalniks, and R. D. Simoni 1986. Differential translation of the genes encoding the proton-translocating ATPase of Escherichia coli. Journal of Biological Chemistry 261(18):8096-8099. Lee, Y. H. 2003. Molecular phylogenies and divergence times of sea urchin species of Strongylocentrotidae, Echinoida. Mol ecular Biology and Evolution 20(8):12111221. Lerat, E., P. Capy, and C. Biemont 2002 Codon usage by transposable elements and their host genes in five species. Journa l of Molecular Evolution 54:625-637. Liao, D. 1999. Concerted evolution: molecula r model and biological implications. The American Journal of Human Genetics 64:24-30. Liao, D., T. Pavelitz, J. R. Kidd, K. K. Kidd,and A. M. weiner 1997. Concerted evolution of the tandemly repeated genes encoding human U2 and snRNA (the RNU2 locus) involves rapid intrachromosoma l gene conversion. The EMBO Journal 16(3):588-98. Livingston, B.T., R. Shaw, A. Bailey, and F. H. Wilt 1991. Characterization of a cDNA encoding a protein involved in formation of the skeleton during development of the sea urchin Lytechinus pictus Developmental Biology 148: 473-480.

PAGE 112

103 Lizardi, P. M., V. Mahdavi, D. Shields, G. Candelas 1979. Discontinuo us translation of silk fiberoin in a reticulocyte cell-free system and in intact silk gland cells. Proceedings of the National Academy of Sciences 76(12):6211-6215. Lupski, J. R. 1998. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human diseas e traits. Comparitive Genomics 14(10):417-422. Makabe, K. W., C. V. Kirchhamer, R. J. Britten, and E. H. Davidson 1995. Cisregulatory control of the SM50 gene, an early marker of skeletogenic lineage specification in the sea urchin embryo. Development 121:1957-1970. Manchenko, G. P. and S. N. Yakovlev 2001. Genetic divergence between three sea urchin species of the genus Strongyl ocentrotus from the Sea of Japan. Biochemical Systematics and Ecology 29:31-44. Matsuoka, N. 1987 Biochemical study on the taxonomic situation of the sea urchin Pseudocentrotus depressus. Zoological Science 4:339-347. Meeds T., E. Lockard, B. T. Livingston 2001. Special evolutionary properties of genes encoding a protein with a simple ami no acid repeat. Journal of Molecular Evolution 53:180-190. Mita, K., S. Wechimura, M. Zama, T. C. James 1988. Specific codon usage patterns and its implications on the secondary struct ure of silk fibroin mRNA. Journal of Molecular Biology 203:917-925. Murti, J. R., M. Bumbulis, J. C. Schi menti, 1992. High-frequency germ line gene conversion in transgenic mice. Mol ecular and Cell Biology 12(6):2545-52. Nakamura, Y., T. Gojobori, and T. wekemura 2000 International DNA sequence databases: status for the year 2000. Nucleic Acid Research 28:292. http://www.kazusa.or.jp/codon/ Nei, M. 1987. Molecular evolutionary geneti cs. Columbia University, New York Press Ohta, T. 1997. The meaning of near-neutral ity at coding and noncoding regions. Gene 205(1-2):261-7. Ohta, T. 2000. Evolution of gene families. Gene 259:45-52. Palumbi, S. R., and A. C. Wilson 1990. Mitochondrial DNA diversity in the sea urchins Strongylocentrotus purpuratus and S. droebachiensis. Evolution 44(1):403-415.

PAGE 113

104 Palumbi, S. R., and B. D. Kessing 1991. Popul ation biology of the trans-artic exchange: MtDNA sequence similarity between pacific a nd atlantic sea urchins. Evolution: 45(8): 1790-1805. Parkin, E.J., and R. K. Butlin 2004. Within and between individual sequence variation amoung weTS1 copies in the meadow grasshopper Chorthipps parallelus indicates frequent intrachrosomal ge ne converstion. Molecular Biology and Evolution 21(8): 1595-1601. Parniewski, P., P. Staczek 2002. Molecular m odels of TRS instability. AMEDEO 516:125. Pierriere, G. and Thioulouse, J. 2002. Use and misuse of correspondence analysis in codon usage studies. Nucleic Ac ids Research. 30(20)4548-4555. Rast, J. P. 2003. Development gene networks and evolution. Journal of Strutural and Functional Genomics. 3:225-234. Sharp, P. M., M. Averof, A. T. Lloyd, G. Matassi, and J. F. Peden 1995. DNA Sequence Evolution: The Sounds of Silence. Ph ilosophical Transacti ons of the Royal Society B: Biological Sciences 349:241-247. Smith, A. B. 1988. Phylogenetic relationship, divergence times, and rates of molecular evolution for camardont sea urchins. Molecular Biology and Evolution 5(4):345365. Smith, G. P. 1976. Evolution of repeated DNA sequences by unequal crossover. Science 191:528-535. Swanson, W. J. and V. D. Vacquier 1998. Concerted evolution in an egg receptor for a rapidly evolving abalone sper m protein. Science 281:710-712. Teshima, K. M., and H. Innan 2003. The e ffect of gene convers ion on the divergence between duplicated gene s. Genetics 166:1553-1560. Thomas, G. H. 1998. Molecular evolution of spectrin repeats. BioEssays 20(7):600. Thompson, J. D., D. H. Higgins, and T. J. Gibson 1994. ClustalWimproving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penaltie s and weight matrix choice. Nucleic Acids Research 22(22): 4673. Thompson-Stewart D., G. H. Karpen, and Sp radling A. C. 1994. A transposable element can drive the concerted evolution of ta ndemly repetitious DNA. Proceedings of the National Academy of Sciences 91:9042-9046.

PAGE 114

105 Ugarkovic, D., and M. Plohl 2002. Variation in satellite DNA profiles-causes and effects. EMBO Journal 21(22):5955-5959. Walsh, J. B. 1987a. Persistence of tandem arra ys: implications for satellite and simplesequence DNAs. Genetics 115:553-567. Walsh, J. B. 1987b. Sequence-dependent gene conversion: can duplicated genes diverge fast enough to escape conversion? Genetics 117(3):543-57. Willie, E., and J. Majewski 2004. Evidence fo r codon bias selection at the pre-mRNA level in eukaryotes. Trends in Genetics 20(11):534-538. Wilt, F. H. 1999 Matrix and mineral in th e sea urchin larval skeleton. Journal of Structural Biology 126:216-226. Wilt, F. H. 2002. Biomineralization of the spicules of sea urchin embryos. Zoological Sciences 19(3):253-61. Wilt, F. H., C. E. Killian, and B. T. Livings ton 2003. Development of calcareous skeletal elements in invertebrates. Differentiation. 71(4-5):237-50. Xu, G., and J. S. Evans 1999. Model peptide studies of sequence repe ats derived from the intercrystalline biomineralizati on protein, SM50. I. GVGGR and GMGGQ repeats. Biopolymers 49:303-312. Yauk, C. L. 2004. Advances in the application of germline tandem repeat instability for in situ monitoring. Reviews in mutations research 566:169-182. Zigler, K. S., and H. A. Lessios 2004. Sp eciation on the coasts of the new world: phylogeography and the evolution of bindin in the sea urchin genus Lytechinus Evolution 58(6)1225-1241.

PAGE 115

Appendix 1: DNA Sequences of the SM50 repeat array used to calculate codon usage frequencies and used in dot plot analysis. In species where length variation has been detected, the longer allele was used to increase sample size. The SM50 repeat array begins directly after the unique sequence of CCG GAA found in all species, and ends with a proline, asparagine repeated section. SM50 H. pulcherrimus S. purpuratus Repeat # GGC CAA GGC CAA 1 CAA CCG GGC ATG GGA CAA CCG GGC ATG GGA 2 CAA GGC --TTT GGC AAT CAA CAA GGC GGC TTT GGT AAT CAA 3 CAA CCA GGC TTT GGT AAT CAA CCA GGC ATG GGT GGG CGA 4 CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC TTT GGT AAT 5 CAA CCA GGC TTT GGC AAT CAA CCA GGA ATG GGT GGG CGA 6 CAA CCA GGT ATG GGT GGG CGA CAA CCA GGC TTT GGT AAT 7 CAA CCA GGC TTT GGC AAT CAA CCA GGA ATG GGA GGG CGA 8 CAA CCA GGT ATG GGT GGG CGA CAA CCA GGC TGG GGT AAT 9 CAA CCA GGC TTT GGC AAT CAA CCC GGT GTG GGT GGG CGA 10 CAA CCA GGT GTG GGT GGG CGA CAA CCA GGC ATG GGT GGA CAA 11 CAA CCA GGC TTT GGT AAT CAA CCA GGC TGG GGT AAT 12 CAA CCC GGC ATG GGT GGG CGA CAA CCC GGT GTG GGT GGA CGA 13 CAA CCA GGC TTT GGC AAT CAA CCA GGC ATG GGT GGA 14 CAA CCA GGT GTG GGT GGG CGA CAA CCA GGA GTG GGC GGG CGA 15 CAA CCA GGC TTT GGC AAT CAA CCA GGC TTT GGT AAT 16 CAA CCA GGC ATG GGT GGA CAA CAA CCC GGC ATG GGT GGA CAA 17 CAA CCA GGT GTG GGT GGG CGA CAA CCA GGC ATG GGT GGA CAA 18 CAG CCA GGC TTT GGT AAT CAA CCA GGC TGG GGT AAT 19 CAA CCA GGT ATG GGT GGA AAC CAA CCC GGT GTG GGT GGG CGA 20 CAA CCC GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA 21 CAA CCA GGC ATG GGC GGG CGA CAA CCA GGA GTG GGC GGG CGA 22 CAA CCC GGC GTA GGT GGT CGA CAA CCA GGT GTG GGT GGA CGA 23 CAA CCA GGC ATG GGT GGG CAG CAA CCA GGC TTT GGT AAT 24 CAG CCA GGT GTG GGT GGA CGA 25 CAA CCA GGC ATG GGT GGA CAA 26 CAA CCA GGT ATG GGT GGA 27 CAA CCA GGA GTG GGC GGG CGA 28 CAA CCA GGT ATG GGA GGG CGA 29 CAA CCA GGC TTT GGT AAT 30 CAA CCA GGT GTG GGT GGG CGA 31 CAA CCA GGC ATG GGT GGC CAG 106

PAGE 116

Appendix 1 (Continued) SM50 S. droebachiensis A. fragilis S. pallidus Repeat # GGC CAA GGC CAA GGC CAA 1 CAG CCG GGC ATG GGA CAA CCG GGC ATG GGA CAA CCG GGC ATG GGA 2 CAA GGC GGC TTT GGT AAT CAA CAA GGC GGC TTT GGT AAT CAA CAA GGC GGC TTT GGT AAT CAA 3 CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA 4 CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT 5 CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA CAA CCA GGC ATG GGT GGG CGA 6 CAA CCA GGC TGG GGT GGA CAA CAA CCA GGC TGG GGT GGA CAA CAA CCA GGC TGG GGT GGA CAA 7 CAA CCA GGT GTG GGA GGG CGA CAA CCA GGT GTG GGA GGG CGA CAA CCA GGT GTG GGA GGG CGA 8 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT 9 CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA 10 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 11 CAA CCA GGC TGG GGT AAT CAA CCA GGC TGG GGT AAT CAA CCA GGA GTG GGT GGG CGA 12 CAA CCC GGT GTG GGT GGG CGA CAA CCC GGT GTG GGT GGG CGA CAA CCA GGC TGG GGT AAT 13 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGA GTG GGT GGG CGA CAA CCC GGT GTT GGT GGG CGA 14 CAA CCA GGA GTG GGT GGG CGA CAA CCA GGC TGG GGT AAT CAA CCA GGG ATG GGT GGA CAA 15 CAA CCA GGC TTT GGT AAT CAA CCC GGT GTG GGT GGG CGA CAA CCA GGA GTG GGT GGG CAA 16 CAA CCC GGC ATG GGT GGA CAA CAA CCA GGA GTG GGT GGG CGA CAA CCA GGC TGG GGT AAT 17 CAA CCA GGT GTG GGA GGG CAA CAA CCA GGC TTT GGT AAT CAA CCC GGT GTT GGT GGG CGA 18 CAA CCA GGC TGG GGT AAT CAA CCC GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 19 CAA CCC GGT GTG GGT GGG CGA CAA CCA GGT GTG GGT GGA CGA CAA CCA GGA GTG GGT GGG CGA 20 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC TTT GGT AAT 21 CAA CCA GGT GTG GGT GGA CGA CAA CCC GGT GTG GGT GGG CGA CAA CCA GGT GTT GGT GGG CGA 22 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA 23 CAA CCA GGT GTG GGT GGA CGA CAA CCA GGC TGG GGT AAT CAA CCT GGT GTG GGT GGG CGA 24 CAA CCA GGC TTT GGT AAT CAA CCC GGT GTG GGT AGG CGA CAA CCA GGC ATG GGT GGA CAA 25 CAG CCA GGT GTG GGT GGA CAA CAA CCA GGT GTG GGT GGA CGA CAA CCA GGT GTG GGT GGA CGA 26 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC TTT GGT AAT 27 CAA CCA GGT GTG GGA GGG CGA CAG CCA GGT GTG GGT GGA CGA CAA CCA GGT GTG GGT GGA CGA 28 CAA CCA GGC TTT GGT AAT CAA CCA GGC ATG GGT GGA CAA CAA CCC GGC ATG GGT GGG CAA 29 CAA CCA GGT GTG GGT GGG CGA CAA CCA GGT GTG GGA GGG CGA CAA CCT GGC GTT GGA GGG CGA 30 CAA CCA GGC ATG GGT GGC CAG CAA CCA GGC TTT GGT AAT CAA CCA GGC TTT GGT AAT 31 CAA CCA GGT GTG GGT GGG CGA CAA CCA GGT GTG GGT GGG CGA 32 CAA CCA GGC ATG GGT GGC CAG CAA CCA GGC ATG GGT GGC CAG 107

PAGE 117

Appendix 1 (Continued) SM50 S. franciscanus S. nudus P. depressus Repeat # GGC CAA GGC CAA CCG GGC ATG GGA 1 CAA CCA GGC ATG GGA CAA CCA GGC ATG GGT GGA CCA GGC ATG GGT GGT CAA 2 CCG GGC ATG GGC ATG GGA CAA CCA GGC ATG GGA GGG CGA CAA CCA GGC ATG GGT GGG CGA 3 CCG GGC GGT GGT GGT CGA CAA CCA GGC ATG GGC GGA CAA CAA CCA GGC ATG GGT GGG CAA 4 CAA CCA GGC TTT GGG CAA CAA CCA GGT TTT GGA GGG CGA CAA CCA GGC ATG GGT GGG CAA 5 CAA CCA GGC ATG GGA GGG CGA CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGG CGA 6 CAA CAA GGC ACG GGT GGG TGG CAA CCA GGT TTT GGA GGG CGA CAA CCA GGC ATG GGT GGA CAA 7 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAA CAA CCA GGT ATG GGT GGG CGA 8 CAA CCA GGC ATG GGA GGG CGA CAG CCA GGT ATG GGA GGG CGA CAA CCA GGC ATG GGT GGA CAA 9 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGT GGA CAG CAA CCA GGT ATG GGA GGG CGA 10 CAA CCA GGC ATG GGA GGG CGA CAA CCA GGC ATG GGT GGA CAA CCA GGC ATG GGT GGA CAG 11 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ATG GGA CAA CCA GGC ATG GGT GGA CGA 12 CAA CCA GGC ATG GGA GGG CGA CAA CCA GGC ATG GGA CAA CCA GGC ATG GGT GGA CAG 13 CAA CCA GGC ATG GGT GGA CAA CAA CCA GGC ACT GGA CAA CCA GGC ATG GGT GGA CAA 14 CAA CCA GGT ATG GGA GGG CGA CAA CCA GGC ATG GGC GGG CGA CAA CCA GGC ATG GGT GGA CGA 15 CAA CCA GGC ATG AGT GGA CAG CAA CCA GGC ATG GGT GGA CAG CAA CCA GGC ATG GGT GGG CAG 16 CAA CCA GGC ATG GGT GGA CGA CAA CCA GGC ATG GGT GGA CAG 17 CAA CCG GGC ATG GGT GGA CAG CAA CCA GGC ATG GGT GGA 18 CAA CCA GAC ATG GGT GGA CGA CGA CCA GGC ATG GGT GGG CAG 19 CAA CCA GGC ATG GGT GGG CAG 108

PAGE 118

Appendix 1 (Continued) SM50 L. pictus L. variegatus Repeat # GGT CAA GGT CAA 1 CAA CCT GGC TTC GGT GGG CAA CAA CCT GGC ATC GGC GGG CAA 2 CAA CCT GGC TTC GGC GGA CGA CAA CCT GGC TTC GGC GGG CAA 3 CAA CCT GGC TTC GGT GGG CAA CAA CCT GGC GTC GGC GGA CGA 4 CAA CCT GGC TTC GGG CAA CAA CCT GGC TTC GGT GGG CAA 5 CAA CCT GGC TTC GGG GGG CGA CAA CCT GGC TTC GGC GGG CAA 6 CAA CCT GGC TTC GGG GGG CGA CAA CCT GGC TTC GGC GGG CGA 7 CAA CCT GGC TTC GGC GGG CAA CAA CCT GGC TTC GGT GGG CAA 8 CAA CCT GGC TTT GGG GGA CAA CAA CCT GGC TTC GGC GGG CAG 9 CAA CCT GGC TTC GGC GGG CAA CAA CCT GGC TTC GGT GGG CAA 10 CAA CCT GGC TTT GGC GGG CAG CAA CCT GGC TTC GGC GGG CAA 11 CAG CCT GGC TTT GGG GGA CAA CAA CCT GGC TTC GGT GGG CAA 12 CAA CCT GGC TTT GGC GGG CAG CAA CCT GGC TTT GGG GGA CAA 13 CAG CCT GGC TTT GGG GGA CAA CAA CCG GGT TTT GGT GGG GGA CCA 14 CAA CCG GGT TTT GGT GGG GGA CCA CAA CGA CCT GGC ATG GGG GGA 15 CAA CGA CCT GGC ATG GGG GGA 109

PAGE 119

Appendix 2: amino acid sequences of the SM50 repeat array used in codon usage and dot plot analysis. SM50 H. pulcherrimus S. purpuratus S. droebachiensis A. fragilis S. pallidus Repeat # G Q G Q G Q G Q G Q 1 Q P G M G Q P G M G Q P G M G Q P G M G Q P G M G 2 Q G F G N Q Q G G F G N Q Q G G F G N Q Q G G F G N Q Q G G F G N Q 3 Q P G F G N Q P G M G G R Q P G M G G R Q P G M G G R Q P G M G G R 4 Q P G M G G R Q P G F G N Q P G F G N Q P G F G N Q P G F G N 5 Q P G F G N Q P G M G G R Q P G M G G R Q P G M G G R Q P G M G G R 6 Q P G M G G R Q P G F G N Q P G W G G Q Q P G W G G Q Q P G W G G Q 7 Q P G F G N Q P G M G G R Q P G V G G R Q P G V G G R Q P G V G G R 8 Q P G M G G R Q P G W G N Q P G W G N Q P G W G N Q P G W G N 9 Q P G F G N Q P G V G G R Q P G V G G R Q P G V G G R Q P G V G G R 10 Q P G V G G R Q P G M G G Q Q P G M G G Q Q P G M G G Q Q P G M G G Q 11 Q P G F G N Q P G W G N Q P G W G N Q P G W G N Q P G V G G R 12 Q P G M G G R Q P G V G G R Q P G V G G R Q P G V G G R Q P G W G N 13 Q P G F G N Q P G M G G Q P G M G G Q Q P G V G G R Q P G V G G R 14 Q P G V G G R Q P G V G G R Q P G V G G R Q P G W G N Q P G M G G Q 15 Q P G F G N Q P G F G N Q P G F G N Q P G V G G R Q P G V G G Q 16 Q P G M G G Q Q P G M G G Q Q P G M G G Q Q P G V G G R Q P G W G N 17 Q P G V G G R Q P G M G G Q Q P G V G G Q Q P G F G N Q P G V G G R 18 Q P G F G N Q P G W G N Q P G W G N Q P G M G G Q Q P G M G G Q 19 Q P G M G G N Q P G V G G R Q P G V G G R Q P G V G G R Q P G V G G R 20 Q P G M G G Q Q P G M G G Q P G M G G Q Q P G M G G Q Q P G F G N 21 Q P G M G G R Q P G V G G R Q P G V G G R Q P G V G G R Q P G V G G R 22 Q P G V G G R Q P G V G G R Q P G M G G Q Q P G M G G Q Q P G M G G Q 23 Q P G M G G Q Q P G F G N Q P G V G G R Q P G W G N Q P G V G G R 24 Q P G M G G R Q P G V G G R Q P G F G N Q P G V G R R Q P G M G G Q 25 Q P G M G G Q Q P G M G G Q Q P G V G G Q Q P G V G G R Q P G V G G R 26 Q P G M G G Q P G M G G Q Q P G M G G Q Q P G F G N 27 Q P G V G G R Q P G V G G R Q P G V G G R Q P G V G G R 28 Q P G M G G R Q P G F G N Q P G M G G Q Q P G M G G Q 29 Q P G F G N Q P G V G G R Q P G V G G R Q P G V G G R 30 Q P G V G G R Q P G M G G Q Q P G F G N Q P G F G N 31 Q P G M G G Q Q P G V G G R Q P G V G G R Q P G M G G Q Q P G M G G Q 110

PAGE 120

Appendix 2 (Continued) SM50 S. franciscanus S. nudus P. depressus L. pictus L. variegatus Repeat # G Q G Q P G M G G Q G Q 1 Q P G M G Q P G M G G P G M G G Q Q P G F G G Q Q P G I G G Q 2 P G M G M G Q P G M G G R Q P G M G G R Q P G F G G R Q P G F G G Q 3 P G G G G R Q P G M G G Q Q P G M G G Q Q P G F G G Q Q P G V G G R 4 Q P G F G Q Q P G F G G R Q P G M G G Q Q P G F G Q Q P G F G G Q 5 Q P G M G G R Q P G M G G Q Q P G M G G R Q P G F G G R Q P G F G G Q 6 Q Q G T G G W Q P G F G G R Q P G M G G Q Q P G F G G R Q P G F G G R 7 Q P G M G G Q Q P G M G G Q Q P G M G G R Q P G F G G Q Q P G F G G Q 8 Q P G M G G R Q P G M G G R Q P G M G G Q Q P G F G G Q Q P G F G G Q 9 Q P G M G G Q Q P G M G G Q Q P G M G G R Q P G F G G Q Q P G F G G Q 10 Q P G M G G R Q P G M G G Q P G M G G Q Q P G F G G Q Q P G F G G Q 11 Q P G M G G Q Q P G M G Q P G M G G R Q P G F G G Q Q P G F G G Q 12 Q P G M G G R Q P G M G Q P G M G G Q Q P G F G G Q Q P G F G G Q 13 Q P G M G G Q Q P G T G Q P G M G G Q Q P G F G G Q Q P G F G G G P 14 Q P G M G G R Q P G M G G R Q P G M G G R Q P G F G G G P Q R P G M G G 15 Q P G M S G Q Q P G M G G Q Q P G M G G Q Q R P G M G G 16 Q P G M G G R Q P G M G G Q 17 Q P G M G G Q Q P G M G G 18 Q P D M G G R R P G M G G Q 19 Q P G M G G Q 111

PAGE 121

Appendix 3: Amino acid sequences of the SM50 repeat array used in concerted evolution model analysis. SM50 Repeat # S.droWAdmb30 S.droWAdmb29a S.droWAdmb28a 1 Q P G M G Q P G M G Q P G M G 2 Q G G F G N Q Q G G F G N Q Q G G F G N Q 3 Q P G M G G R Q P G M G G R Q P G M G G R 4 Q P G F G N Q P G F G N Q P G F G N 5 Q P G M G G R Q P G M G G R Q P G M G G R 6 Q P G W G G Q Q P G W G G Q Q P G W G G Q 7 Q P G V G G R Q P G V G G R Q P G V G G R 8 Q P G W G N Q P G W G N Q P G W G N 9 Q P G V G G R Q P G V G G R Q P G V G G R 10 Q P G M G G Q Q P G M G G Q Q P G M G G Q 11 Q P G W G N Q P G W G N Q P G W G N 12 Q P G V G G R Q P G V G G R Q P G V G G R 13 Q P G M G G Q Q P G M G G Q Q P G M G G Q 14 Q P G V G G R Q P G V G G R Q P G V G G R 15 Q P G F G N Q P G F G N Q P G F G N 16 Q P G M G G Q Q P G M G G Q Q P G M G G Q 17 Q P G V G G Q Q P G V G G R Q P G V G G Q 18 Q P G W G N Q P G W G N Q P G W G N 19 Q P G V G G R Q P G M G G Q Q P G V G G R 20 Q P G M G G Q Q P G V G G R Q P G M G G Q 21 Q P G V G G R Q P G M G G Q Q P G V G G R 22 Q P G M G G Q Q P G V G G R Q P G F G N 23 Q P G V G G R Q P G F G N Q P G V G G Q 24 Q P G F G N Q P G V G G Q Q P G M G G Q 25 Q P G V G G Q Q P G M G G Q Q P G V G G R 26 Q P G M G G Q Q P G V G G R Q P G F G N 27 Q P G V G G R Q P G F G N Q P G V G G R 28 Q P G F G N Q P G V G G R Q P G M G G Q 29 Q P G V G G R Q P G M G G Q 30 Q P G M G G Q 112

PAGE 122

113 Appendix 3 (Continued) SM50 Repeat # S.palWAfd32 S.palWAfd24 S.palNor30 S.palNor27 1 Q P G M G Q P G M G Q P G M G G R Q P G M G G R 2 Q G G F G N Q Q G G F G N Q Q P G W G G Q Q P G W G G Q 3 Q P G M G G R Q P G M G G R Q P G V G G R Q P G V G G R 4 Q P G F G N Q P G F G N Q P G W G N Q P G W G N 5 Q P G M G G R Q P G M G G R Q P G V G G R Q P G V G G R 6 Q P G W G G Q Q P G W G G Q Q P G M G G Q Q P G M G G Q 7 Q P G V G G R Q P G V G G R H P G V G G R Q P G V G G R 8 Q P G W G N Q P G W G N Q P G W G N Q P G W G N 9 Q P G V G G R Q P G V G G R Q P G V G G R Q P G V G G R 10 Q P G M G G Q Q P G M G G Q Q P G M G G Q Q P G M G G Q 11 Q P G V G G R Q P G V G G R Q P G V G G R Q P G V G G R 12 Q P G W G N Q P G F G N Q P G W G N Q P G W G N 13 Q P G V G G R Q P G V G G R Q P G V G G Q Q P G V G G Q 14 Q P G M G G Q Q P G M G G Q Q P G V G G R Q P G V G G R 15 Q P G V G G Q Q P G V G G R Q P G W G N Q P G W G N 16 Q P G W G N Q P G M G G Q Q P G V G G R Q P G V G G R 17 Q P G V G G R R P G V G G R Q P G M G G Q Q P G M G G Q 18 Q P G M G G Q Q P G F G N Q P G V G G R Q P G V G G R 19 Q P G V G G R Q P G V G G R Q P G F G N Q P G F G N 20 Q P G F G N Q P G M G G R Q P G V G G G Q P G V G G R 21 Q P G V G G R Q P G V G G R Q P G M G G Q Q P G M G G Q 22 Q P G M G G Q Q P G F G N Q P G V G G R Q P G V G G R 23 Q P G V G G R Q P G V G G R Q P G M G G Q Q P G M G G Q 24 Q P G M G G Q Q P G M G G Q Q P G V G G R Q P G V G G R 25 Q P G V G G R Q P G M G G Q Q P G M G G Q 26 Q P G F G N Q P G V G G R Q P G M G G R 27 Q P G V G G R Q P G M G G Q Q P G M G G Q 28 Q P G M G G Q Q P G M G G Q 29 Q P G V G G R Q P G M G G R 30 Q P G F G N Q P G M G G Q 31 Q P G V G G R 32 Q P G M G G Q

PAGE 123

114 Appendix 3 (Continued) SM50 Repeat # S.nudJP18 S.nudJP17 1 Q P G M G G Q P G M G G 2 Q P G M G G R Q P G M G G R 3 Q P G M G G Q Q P G M G G Q 4 Q P G F G G R Q P G F G G R 5 Q P G M G G Q Q P G M G G Q 6 Q P G F G G R Q P G F G G R 7 Q P G M G G Q Q P G M G G Q 8 Q P G M G G R Q P G M G G R 9 Q P G M G G Q Q P G M G G Q 10 Q P G M G G Q P G M G G 11 Q P G M G Q P G M G 12 Q P G M G Q P G M G 13 Q P G T G Q P G T G 14 Q P G M G G R Q P G M G G R 15 Q P G M G G Q Q P G M G G Q 16 Q P G M G G Q Q P G M G G 17 Q P G M G G R P G M G G Q 18 R P G M G G Q

PAGE 124

115 Appendix 4: Summary of alleles used in study. DNA sample names and locations collected from are listed for samples sequen ced during this study. When two different alleles from a single individual were found, the DNA sample ends in a lower case letter. Number of repeat indicates the amount of 5-7 amino acids SM50 repeats present in the SM50 repeat array. In S. droebachiensis, alleles with different pa tterns of 6 and 7 amino acids in the SM50 repeats were found and classified differently. The final allele name is a combination of the species, locati on, and number of SM50 repeats. Species Name of DNA Sample Location collected from # of Repeat Allele Name H. pulcherrimus GenBank # S48755 25 S. purpuratus 2, 3, 4, 5, 6, Pt. Arena, CA 31 7, 8, 10, 12 13, 14, 15, 16 Ft. Bragg, CA 31 20, 21, 1E4, Orange County, CA 31 6C4, 6G4, 6H4 11B4, 11D4, San Clemente 31 S. purp GenBank #M16231 31 S. droebachiensis 2E2 FHL, WA. "Dead Man's Bay" 30 S.droWA30 4A1 FHL, WA. "Dead Man's Bay" 29 S.droWA29a 2G2, 2H2s FHL, WA "Embryology class" 29 S.droWA29a 2E1 Juneau, AK. "The Shrine" 29 S.droAK29a 4G1 FHL, WA. "Dead Man's Bay" 28 S.droWA28a 3G1a Juneau, AK. "The Shrine" 28 S.droAK28a 2H2c FHL, WA "Embryology class" 28 S.droWA28b 3G1b Juneau, AK. "The Shrine" 28 S.droAK29b A. fragilis CA, Vacquier Lab" 32 S. pallidus 3B2c8, 2D2c1, 3E2 FHL, WA. "Ferry Dock" 32 S.palWA32 3B2c9 FHL, WA. "Ferry Dock" 24 S.palWA24 4A2a Norway 0 Population" 27 S.palNor27 4A2b Norway 0 Population" 30 S.palNor30 S. franciscanus Meeds et al, 2001 19 S. nudus snu1N1, snu2N1a Japan, 18 S.nudJP18 snu2N1b 17 S.nudJP17 P. depressus Korea, 15 L. pictus GenBank # X59616. 15 L. variegatus Meeds et al, 2001 14


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001709521
003 fts
005 20060614112154.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 060516s2005 flu sbm s000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001401
035
(OCoLC)68903378
SFE0001401
040
FHM
c FHM
049
FHMM
090
QH307.2 (Online)
1 100
Hussain, Sofia.
0 245
Concerted evolution in SM50, a gene with unusual repeat structure
h [electronic resource] /
by Sofia Hussain.
260
[Tampa, Fla.] :
b University of South Florida,
2005.
502
Thesis (M.S.)--University of South Florida, 2005.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 124 pages.
520
ABSTRACT: Genes present in multiple copies and genes that contain regions of repetitive sequences can undergo concerted evolution, which results in homogenization of the nucleotide sequence of the genes or repetitive regions. In regions of tandem repeats, this occurs through misalignment of repeat units followed by unequal crossover, which generates two products with differing numbers of repeat units. Gene conversion is thought to lead to one of these products becoming fixed in a species. The homogenous sequence of previously studied genes that have been thought to undergo this process has made it difficult to determine the exact models involved. Here I examine concerted evolution in SM50, a sea urchin gene that encodes a protein involved in biomineralization. The repetitive region in the SM50 gene varies in length between species, and there is variability in each repeat unit as well.I examine the codon usage in SM50 in a variety of species, and discuss how purifying selection, substitutions, concerted evolution, and selection at the level of DNA sequence have played a role in the evolution of this gene. I also examine the structure and sequence of the repeat units, and purpose models that have led to the evolution of the repeat pattern seen in the different species examined. Finally, I have found variation in the number of repeat units within several species. This has allowed us to deduce the specific models of unequal crossover that led to this variation. The unique variation in the repetitive region of SM50 has enabled us to describe a model of how substitutions affect the model of misalignment and unequal crossover.
590
Adviser: Brian T. Livingston.
653
Molecular evolution.
Sea urchin.
DNA.
Spicule matrix genes.
Neutral evolution.
690
Dissertations, Academic
z USF
x Biology
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.1401