Autosomal Microsatellite and mtDNA Genetic Analysis
in Sicily (Italy)
V. Romano1,a, F. Cal`ı2, A. Ragalmuto2, R. P. D’Anna1,a, A. Flugy1,a, G. De Leo1,a, O. Giambalvo3,
A. Lisa7, O. Fiorani7, C. Di Gaetano5, A. Salerno1,b,4, R. Tamouza6, D. Charron6, G. Zei7,
G. Matullo5 and A. Piazza5,*
1,aDipartimento di Biopatologia e Metodologie Biomediche, Universit`a di Palermo, Via Divisi 83 and 1,bCorso Tukory 211, Palermo,
Italy
2Istituto OASI (I.R.C.C.S.), Via Conte Ruggero 73, Troina (EN), Italy
3Dipartimento di Metodi Quantitativi per le Scienze Umane, Facolt`a di Economia, Universit`a di Palermo, Viale delle Scienze,
Palermo, Italy
4Istituto di Metodologie Diagnostiche CNR, Palermo, Italy
5Dipartimento di Genetica, Biologia e Biochimica, Universit`a di Torino, Via Santena 19, Torino, Italy
6Laboratoire d’Immunologie et d’Histocompatibilit´e, AP-HP, IUH and INSERM U396, Hˆopital Saint-Louis, Paris, France
7Istituto di Genetica Molecolare, CNR, Pavia, Italy
Summary
DNA samples from 465 blood donors living in 7 towns of Sicily, the largest island of Italy, have been collected
according to well defined criteria, and their genetic heterogeneity tested on the basis of 9 autosomal microsatellite
and mitochondrialDNApolymorphisms for a total of 85 microsatellite allele and 10 mtDNA haplogroup frequencies.
A preliminary account of the results shows that: a) the samples are genetically heterogeneous; b) the first principal
coordinates of the samples are correlated more with their longitude than with their latitude, and this result is even
more remarkable when one outlier sample (Butera) is not considered; c) distances among samples calculated from
allele and haplogroup frequencies and from the isonymy matrix are weakly correlated (r = 0.43, P = 0.06) but such
correlation disappears (r = 0.16) if the mtDNA haplogroups alone are taken into account; d) mtDNA haplogroups
and microsatellite distances suggest settlements of people occurred at different times: divergence times inferred from
microsatellite data seem to describe a genetic composition of the town of Sciacca mainly derived from settlements
after the Roman conquest of Sicily (First Punic war, 246 BC), while all other divergence times take root from the
second to the first millennium BC, and therefore seem to backdate to the pre-Hellenistic period.
A more reliable association of these diachronic genetic strata to different historical populations (e.g. Sicani, Elymi,
Siculi ), if possible, must be postponed to the analysis of more samples and hopefully more informative uniparental
DNA markers such as the recently available DHPLC-SNP polymorphisms of the Y chromosome.
Introduction
The first evidence of human presence in Sicily, the
largest of the Mediterranean islands (25,708 sq. km),
can be traced back to the Paleolithic (Tusa, 1983);
∗Corresponding Author: Alberto Piazza, Dipartimento di Genetica,
Biologia e Biochimica Via Santena 19, 10126 Torino,
Italy. Tel: +39-011-6706650; Fax: +39-011-6706582. E-mail:
alberto.piazza@unito.it
since then the island was settled by Neolithic farmers
from Anatolia and the Near East, by Italic peoples
from the Italian peninsula, by Phoenicians, Greeks,
Romans, Byzantines, Arabs, and Normans (Finley,
1968). Whether these invasions and settlements had a
real demographic impact on the structure of the population
has been only a speculative matter, mostly based
on the study of material culture and literary sources. Because
of the complex history and prehistory of Sicily, its
genetic history has attracted the interest of some scholars
42 Annals of Human Genetics (2003) 67,42–53 C University College London 2003
Genetic Structure of Sicily
(e.g., Piazza et al. 1988; Rickards et al. 1998) but still very
little is known.
The present study surveys seven samples from Sicily
using molecular markers of the nuclear and mitochondrial
genomes: because of their higher resolution,
these markers should indeed be more reliable than “classical”
markers for comparing geographically (and genetically)
close populations. Surname data have been
also collected and analysed: their transmission and
differentiation which simulate male specific traits, can
be usefully compared with the female transmitted mitochondrial
types.
Materials and Methods
The Samples
Blood donors belonged to seven small towns from different
parts of Sicily (see Fig. 1), selected because they
share historical, ethnic and archaeological interest. The
geographical part of Sicily, the towns (with latitude and
longitude), the provinces which they belong to and the
sample sizes are as follows:
North-eastern: Troina (37.49N, 14.36E), province of
Enna, 111 individuals.
South-western: Sciacca (37.31N, 14.03E), Agrigento,
89.
North-western: Castellammare del Golfo (38.01N,
12.40E), Trapani, 64; Caccamo (37.56N, 13.40E),
Palermo, 52.
Central: Piazza Armerina (37.23N, 14.22E), Enna, 44.
Central-south: Butera (37.11N, 14.11E), Caltanissetta,
47.
MESSINA
PALERMO
TRAPANI AGRIGENTO
CALTANISSETTA
ENNA
CATANIA
SIRACUSA
RAGUSA
Caccamo
Castellammare
del Golfo
Sciacca
Butera
Troina
Piazza
Armerina
Ragusa
50km
Figure 1 Geographical map of Sicily showing the location of
the seven samples analysed in this paper and the provinces they
belong to (in capital letters).
South-eastern: Ragusa (36.55 N, 14.36E), Ragusa,
58.
The criterion by which each blood donor was selected
for this study was that the birthplaces of his or
her maternal and paternal grandparents were the same
as that of the donor. All the donors were informed about
the aims of this study and signed a consent form.
Knowledge on the possible pre-historical and historical
settlers of the places where the samples were collected
is summarized in Table 1. A special case is that of
Caccamo. Speculations on its origin have been based
on possible etymologies of its name: from Greek
κακκαβη and Latin caccabus (meaning “pot”), or from
Carthaginian caccabe (meaning “head of horse”). It is
likely, however, that it was inhabited earlier than these
sources may suggest. The Greek and Arab presence in
this town is documented by many toponymes. Normans
built the town as it is structured nowadays in 1093
when Count Roger put it under the jurisdiction of the
Agrigento church.
DNA Analysis
The following nine autosomal microsatellite polymorphisms
were analysed: TH01 (Polymeropoulos et al.
1991a), vWA31/A (Kimpton et al. 1992), FES/FPS
(Polymeropoulos et al. 1991b), F13A01 (Polymeropoulos
et al. 1991b), TPOX (Anker et al. 1992), FGA
(Mills et al. 1992), CSF1PO (Hammond et al. 1994),
PAH-STR (Goltsov et al. 1993), and LIPOL (Zuliani &
Hobbs, 1990). Detailed information on these polymorphisms
is given in Table 2.
DNA was extracted from peripheral blood as previously
described (Cal`ı et al. 1997). The STR polymorphism
of the PAH locus was typed as described
in Zschocke et al. 1994. PCR analysis for the following
loci: HUMvWA31/A and HUMFES/FPS,
HUMTH01, F13A01, TPOX, FGA and CSF1PO was
performed as described in the ABI PRISMTM STR
Primer Set protocol (Perkin Elmer, USA). The PCR reaction
for LIPOL-STRwas performed in 50 μl containing:
5 ng of genomic DNA; 1 U Taq DNA-polymerase
(Perkin Elmer, USA); 5 μl reaction buffer 10X (20 mM
Tris-HCl pH 8, 100 mM KCl, 0.1 mM EDTA, 1 mM
DTT, 50% glycerol, 0.5% Tween 20, 0.5% Nonidet
P40); 1.5 mM MgCl2; 0.2 mM of each dNTP; 0.2 μM
C University College London 2003 Annals of Human Genetics (2003) 67,42–53 43
V. Romano et al.
Table 1 Some data on the sampling places
Number Earliest documented
of actual Altitude settlement:
Sampling places inhabitants (meters asl) when (who) Historical settlements
Butera 6,300 402 Early Bronze Age Greeks from Crete,
Arabs (854 AD),
Normans (1089 AD),
Lombardi from North-Italy
(1161 AD)
Caccamo 9,000 521 ? Greeks,
Arabs,
Normans (1093 AD)
Castellammare del Golfo 15,000 26 Mesolithic – Neolithic Greeks
(Elymes, Phoenicians)
Piazza Armerina 22,000 697 Early Bronze Age Greeks,
(Siculi) Romans,
Byzantines,
Arabs,
Lombardi from North-Italy
(1161 AD, presence of
Gallo-Italic dialect)
Ragusa 67,000 502 Early Bronze Age Greeks,
(Siculi) Arabs (868 AD),
Normans (1091 AD)
Sciacca 40,000 60 Early Neolithic Greeks,
Romans,
Arabs (814 AD)
Troina 10,000 1,120 Early Bronze Age Normans (XI century AD)
(Siculi)
Table 2 Investigated polymorphic STR loci
Chromosomal Repeat Number
Polymorphism Gene symbol Definition location Sequence of alleles
FGA FGA Human fibrinogen alpha chain gene 4q28 AAAG 15
F13A01 F13A1 Human coagulation factor XIII 6p24-p25 AAAG 15
vWA31/A VWF Human von Willebrand factor gene 12p12-qter AGAT 9
TH01 TH1 Human tyrosine hydroxylase gene 11p15.5 AATG 6
FES-FPS FES Human c-fes/fps proto-oncogene 15q25-qter ATTT 7
TPOX TPO Human thyroid peroxidase 2p13 AATG 8
CSF1PO CSF1R Human c-fms proto-oncogene 5q33.5-p34 AGTA 8
LIPOL LPL Human lipoprotein lipase 8p22 AATG 7
PAH-STR PAH Human phenylalanine hydroxylase gene 12q22-q24.2 TCTA 9
of each primer. One primer was modified by the addition
of a dye label (6-FAM: 6-carboxyfluorescein) at the
5 end. Primer sequences used for PCR of LIPOL STR
are as described in Zuliani & Hobbs (1990). Conditions
used for PCR (LIPOL) were as follows: 95◦C for 2.
Each of the 28 cycles was then performed as follows;
95◦C for 45, 63◦C for 30, 72◦C for 30. At the end
of the 28 cycles samples were kept at 72◦C for 10. 1 μl
of the PCR products was diluted in 12 μl of deionised
formamide plus 1 μl of GeneScan 350 Rox (molecular
weight DNA marker), denatured at 95◦C for 3 min,
cooled on melting ice, and loaded on a ABI PRISM 310
Genetic Analyzer (Perkin Elmer, USA). The fragment
sizes were analysed by the GeneScan software (Perkin
Elmer, USA). To make allele typing easier and to establish
the exact number of tetranucleotide repeat units,
at least two alleles for each locus were sequenced by
conventional techniques.
44 Annals of Human Genetics (2003) 67,42–53 C University College London 2003
Genetic Structure of Sicily
Typing of the 10 mtDNA haplogroups (H, V, T, J, U,
K, X, I, M, L1/L2) which characterize most European
populations is described by Torroni et al. (1996, 1998)
and references therein.
Data and Data Analysis
Genetic Data
Input data for all the analyses which follow are allele
frequencies estimated by gene counting in each sample.
Depending on the number of tested samples and polymorphisms
the data sets are slightly different. The most
complete data set is formed by the seven samples described
above, and 85 microsatellite allele frequencies,
and it will be referred to by the acronym SIC0785. As all
samples with the exception of Butera were tested for the
presence of 10 mtDNA haplogroups, the data set which
includes the frequencies of those haplogroups will be
referred to as SIC0695. In order to scale the genetic position
of Sicily in the Mediterranean we added original
data from Algeria (43 individuals), Egypt (43 individuals)
and Turkey (33 individuals), which was typed for the
85 microsatellite alleles: the resulting dataset will be referred
to as MED1085. These original data are available
on request.
Isonymy
Isonymy matrices, whose elements are the probability
that two individuals belonging to different samples
(towns) have the same surname, are based on the surnames
of telephone directories of the year 1993, after
commercial and company surnames were eliminated.
The numbers of ascertained surnames, of different surnames,
and the percentage of individuals carrying a surname
of possible Greek origin (collected in Rohlfs,
1984), over the total of ascertained surnames have been
found as follows: Troina (2995 ascertained surnames
of which 476 are different and 11% of possible Greek
origin), Sciacca (12147, 1774, 7%), Castellammare del
Golfo (5178, 1079, 8%), Piazza Armerina (6802, 1501,
10%), Butera (1732, 457, 7%), Caccamo (2351, 515,
8%) and Ragusa (24379, 2921, 10%).
Analysis
Very simply stated, the general goal of our analysis is
to test whether the Sicilian samples we have surveyed
are genetically heterogeneous and to look for possible
reasons for this. A deeper understanding of genetic data
involves a series of tests of hypotheses, estimates of genetic
parameters and graphic displays, which form the
output of many computer packages currently available.
The references of those used, and to what purpose, are
as follows:
1. Tests of Hardy-Weinberg equilibrium and estimates
of parameters of genetic structure were performed
by using the GDA (Genetic Data Analysis) computer
program written by Lewis & Zaykin (2001).
2. A (multivariate) analysis of molecular variance
(AMOVA) to take into account the inter-individual
variability within samples was performed using the
ARLEQUIN computer package (Schneider et al.
2000).
3. Mantel (1967) test to compare matrices of (genetic,
isonomy, etc.) distances between samples was performed
by using the NTSYSpc v. 2.02h computer
package (Exeter Software).
4. Phylogenetic trees reconstructed by the maximum
likelihood method developed by Felsenstein (1988),
statistical bootstrap tests (Efron & Tibshirani, 1993)
and displays of the trees were performed using the
PHYLIP computer package (Felsenstein, 2000).
5. Gene and genotypic differentiation for microsatellites
from all samples were calculated by Genepop
v3.1d, a software package designed by Michel
Raymond and Francois Rousset. The latest version
is available from the web site http://www.cefe.cnrsmop.
fr/.
6. Genetic distances (δμ)2 for microsatellite data according
to Goldstein et al. (1995) were calculated by
the computer code Microsat 2 (written by E. Minch,
available at the web site: www://hpgl.stanford.edu).
Results
STR Allele Frequencies and Hardy-Weinberg
Equilibrium
Our samples have been typed for 9 autosomal STR
markers, accounting for a total of 85 alleles and for the
10 mtDNA haplogroups H, V, T, J, U, K, X, I, M,
L1/L2 and others (pooled in a “blank” haplogroup).
Allele and haplogroup frequencies were calculated by
C University College London 2003 Annals of Human Genetics (2003) 67,42–53 45
V. Romano et al.
simple gene counting for each of the 7 (or 6 if mtDNA is
also included) Sicilian samples, and are available from the
web site of the journal. Exact tests of Hardy-Weinberg
equilibrium have been calculated using the permutation
method of Guo & Thompson (1992), which gives
a valid probability for a test of Hardy-Weinberg equilibrium
when rare alleles at a locus produce small expected
numbers. Among the 63 tests (7 samples and
9 loci), three showed a significant probability (less than
0.01) and four were between 0.01 and 0.05; three (loci
TH01, F13A01, FES/FPS) are from the Butera sample,
two (TH01, F13A01) from Ragusa, one (LIPOL) from
Caccamo and one (FES/FPS) from Troina.
Genetic Variation Within and Between
Samples
Genetic variation within and between our Sicilian samples
is conveniently quantified by the F statistics of
Wright (1951). Three basic quantities can be described
when diploid individuals are sampled from a series of
populations as follows: the overall inbreeding coefficient,
FIT, which reflects the variability of alleles within
individuals over all populations; the coancestry coefficient,
FST (or θ), which reflects the variability of alleles
of different individuals between populations; and
the coefficient of inbreeding, FIS (or f ), which reflects
the variability of alleles within individuals within populations.
These three quantities (related by the relationship
(1− FIT)=(1− FST)(1− FIS)) calculated from the
dataset SIC0785 are displayed in Table 3.
The confidence interval of the overall FST has been
estimated by bootstrapping (over loci) 1000 replicates of
the data. The hypothesis that FST has a zero value (no
Table 3 Genetic differentiation indexes for the tested microsatellite
polymorphisms
Polymorphism FIT FIS FST
PAH-STR −0.001110 −0.003645 0.002525
LIPOL −0.013048 −0.020358 0.007164
vWA31/A 0.023533 0.019608 0.004003
TH01 0.093374 0.091590 0.001964
F13A01 0.096790 0.096136 0.000723
FES/FPS 0.039104 0.031637 0.007712
TPOX −0.016879 −0.023907 0.006864
FGA 0.141815 0.139472 0.002723
CSF1PO 0.083441 0.082789 0.000711
Overall 0.052190 0.048671 0.003699
genetic differentiation among samples) can be rejected at
the 5% significance level as the confidence interval was
estimated to be 0.002178− 0.005442. This also holds
when considering only the mitochondrial data; then
the FST=0.032, one order of magnitude higher and
comparable to the values compiled by Seielstadt et al.
(1998) from European data.
The Markov chain based exact test implemented in
the Genepop computer package (referenced above) to
assess the contribution of single polymorphisms to the
whole genetic heterogeneity showed that microsatellite
polymorphisms PAH-STR, LIPOL, TH01, FES/FPS
and FGA provide the statistically most significant contributions
to the genetic differentiation within Sicily
at the genotypic, as well as at the gene, level. Such
a degree of genetic structure is still observed when
the samples are grouped according to their geography,
into Eastern (Troina, Butera, Ragusa and Piazza Armerina)
and Western (Caccamo, Sciacca, Castellammare
del Golfo) Sicily. Five microsatellite polymorphisms –
LIPOL, TH01, F13A01, FES/FPS and FGA – contribute
to this genetic heterogeneity between Eastern
and Western Sicily at a statistically significant probability
level, and for three of them (LIPOL, TH01, and
FES/FPS) both gene and genotype frequencies differ
significantly.
Genetic Distances and Isonymy
The parameter FST when estimated for the pair of samples
i, j is called the “coancestry” coefficient and its
transformation D(i, j ) = −ln(1 − FST(i, j )) is proportional
to the time of differentiation when only genetic
drift is causing the genetic differentiation between the
two samples; for this reason it is appropriate to interpret
it as a genetic distance.We calculated the coancestry coefficients
by taking into account the sample sizes (“unbiased”
estimates) as described by Reynolds et al. (1983).
The resulting distance matrices for microsatellites and
for mtDNA haplogroups are shown in Table 4.
Theory based on the island model of migration
indicates that the coancestry coefficient is equal to
1/(1+4Nν) for diploid systems (as STR data), and
1/(1+Nν) for haploid systems (as mitochondrial data),
where N is the effective population size and ν is the
sum of migration and mutation rates. As the mutation
46 Annals of Human Genetics (2003) 67,42–53 C University College London 2003
Genetic Structure of Sicily
Table 4 Coancestry distances and Nν estimates (in parentheses) among seven Sicilian samples estimated from 85 STR polymorphisms
and from mtDNA haplogroup frequencies (in italic). na, not applicable as sampling errors are larger than distances
Troina Sciacca Castellammare del Golfo Piazza Armerina Butera Caccamo
Sciacca 0.003276 (76)
0.0369 (26)
Castellammare 0.004224 (59) −0.00025 (na)
del Golfo −0.000687 (na) 0.011495 (86)
Piazza −0.001466 (na) 0.00354 (70) 0.003441 (73)
Armerina 0.075726 (49) −0.002572 (na) 0.037799 (25)
Butera 0.004947 (50) −0.000939 (na) 0.000586 (426) 0.004839 (51)
Caccamo 0.002996 (83) 0.003099 (80) 0.001805 (138) 0.002061 (121) 0.005873 (42)
0.121408 (7) 0.024078 (41) 0.082663 (11) −0.011608 (na)
Ragusa 0.005790 (43) 0.009366 (26) 0.007755 (32) −0.003656 (na) 0.00977 (25) 0.00941 (26)
0.083151 (11) 0.004249 (234) 0.048364 (20) −0.013139 (na) 0.004647 (214)
Table 5 Isonymy matrix among seven Sicilian samples (in italic ×10 if surnames of probable Greek origin alone are considered)
Troina Sciacca Castellammare del Golfo Piazza Armerina Butera Caccamo
Sciacca 0.000377
0.000514
Castellammare 0.000406 0.000579
del Golfo 0.000698 0.000517
Piazza 0.000596 0.000545 0.000528
Armerina 0.001324 0.001203 0.000615
Butera 0.000401 0.000453 0.000559 0.000563
0.000542 0.000583 0.000405 0.000904
Caccamo 0.000216 0.000620 0.000385 0.000336 0.000157
0.000770 0.000167 0.000554 0.000589 0.000153
Ragusa 0.000321 0.000300 0.000390 0.000298 0.000384 0.000282
0.000238 0.000476 0.001725 0.000457 0.000797 0.000110
rates for the two systems are different, but substantially
lower than any estimates of the human migration rates,
for equal N the quantity Nν can be considered proportional
to the migration rate between the two samples.
In order to compare diploid STR data with haploid
mitochondrial genetic differences possibly due to this
migration, the quantity Nν is also shown in Table 4.
Negative distances are due to the negative contribution
of the sampling error: this means that sampling variance
is larger than variance determined by genetic drift,
and therefore such distances are assumed to be zero. A
formal test on the hypothesis of no correlation between
STR and mitochondrial genetic distances confirm what
a simple inspection of the data suggests: no evidence of
correlation.
The genetic coancestry distances shown in Table 4
have been correlated with the geographic distance matrix
(calculated from the actual roads) and the isonymy
matrix shown in Table 5, where the surnames of
Greek origin were also considered. The Mantel statistics
(Mantel, 1967) have been used to test whether their
correlation is different from zero. No statistically significant
correlation between geographic and isonymy matrices
has been found. Correlations between the matrices
of genetic distances and isonymy were calculated: for
the STR and the mtDNA haplogroup polymorphisms
we obtained r(SIC0695) = 0.43 with a probability of
no correlation at a borderline significance level of 6%;
for the mtDNA haplogroups alone, however, a much
lower correlation (0.14), statistically not significant, was
obtained. A higher but statistically not significant correlation
between genetic and geographic distances was
found (r = 0.39, P = 0.10).
In order to explore the order of magnitude of genetic
divergence times among our samples, another set
of distances, the so-called (δμ)2 distances proposed by
Goldstein et al. (1995) as the most appropriate for microsatellite
data, are shown in Table 6. As discussed
C University College London 2003 Annals of Human Genetics (2003) 67,42–53 47
V. Romano et al.
Table 6 (δμ)2 distances among seven Sicilian
samples estimated by 85 STR polymorphisms.
In italics divergence times
(years BP) from the equation E[(δμ)2] =
2ωβt given in Goldstein et al. (1995), assuming
a mutation rate β = 2.8 × 10−4
(Chakraborty et al. 1997), a constant variance
ω in the size of mutational jumps
and a generation time of 25 years
Castellammare Piazza
Troina Sciacca del Golfo Armerina Butera Caccamo
Sciacca 0.040
1785
Castellammare 0.072 0.037
del Golfo 3214 1652
Piazza 0.052 0.042 0.078
Armerina 2321 1875 3482
Butera 0.077 0.030 0.107 0.064
3437 1339 4776 2857
Caccamo 0.044 0.037 0.103 0.071 0.035
3214 1652 4598 3169 1562
Ragusa 0.017 0.044 0.082 0.032 0.074 0.058
759 1964 3660 1428 3303 2589
in several papers (e.g. Zhivotovsky & Feldman 1995;
Cooper et al. 1999) the use of these distances has
advantages and disadvantages. Their basic assumptions
of a single-step mutation process generating the number
of repeats and of the constancy of mutation rates
among loci are difficult to test: a large variance of (δμ)2
is likely to make this distance less robust to describe recently
diverged populations. Alternatively, the expected
value of (δμ)2 is twice the product of the microsatellite
mutation rate per the variance in size of mutational
jumps per divergence time (in number of generations):
in a first approximation one can assume the
constancy of the first two factors, so that the expected
value of (δμ)2 is proportional to time and independent
from sample size, making it especially attractive.
Bearing in mind these limitations, Table 6 shows the
mean times of genetic divergence for each pair of samples.
They are expressed in years before present (YBP)
by making the assumption of 25 years per generation
and of a mutation rate for all microsatellites equal to
2.8 × 10−4 (Chakraborty et al. 1997). The median
value (in YBP) is 2321 and the interquantile range is
1652. Inferred divergence times ranged from 759 (between
Troina and Ragusa) to 4776 (between Butera and
Castellammare) YBP. For 85 polymorphisms the relative
error of these estimates can be calculated to be about
15–20% (Zhivotovsky & Feldman, 1995), which does
not change the order of magnitude of these very qualitative
findings.
Principal Component and Tree Analysis
The datasets SIC0785 and SIC0695 have been summarised
by principal component analysis, mainly to test
whether single principal components could suggest specific
hypotheses for the genetic relationship among the
Sicilian samples. As the variables are gene frequencies
whose sum is equal to 1 for each locus considered, the
data don’t fill a full space because they are not independent:
they are called “compositional” data and there
are appropriate methods to deal with them (Aitchison,
1986; Reyment & Savazzi, 1999). The first three principal
component coordinates computed according to
these methods by a computer code, kindly provided to
us by Prof. Reyment, give results similar to those calculated
by the traditional method, and explain 26%, 19%
and 18% of the total genetic variance for the SIC0785
dataset, and 26%, 22% and 20% for the SIC0695 dataset.
No graphical display is given as the tree representation
of the same data (see below) is more informative by
also incorporating lower principal components. An interesting
result of the analysis is that the first principal
component coordinates of the seven samples shows a
correlation with their longitude which is greater than
that with their latitude (Pearson r = 0.36 versus 0.18),
even if not statistically significant. A simple inspection of
the scatter-plot (Fig. 2) indicates that one point (Butera)
has considerable leverage on the linear regression between
the two variables. In fact, the exclusion of this
sample results in an almost perfect and statistically significant
linearity between the first principal component
coordinates of the remaining six samples and their longitude
(r = 0.98, P = 0.0004), while the same does not
hold with their latitude (r = 0.72, P = 0.11). The same
analyses applied to the other principal component coordinates
always give lower and not statistically significant
correlations.
48 Annals of Human Genetics (2003) 67,42–53 C University College London 2003
Genetic Structure of Sicily
12.5 13.0 13.5 14.0
-0.2 0.0 0.2 0.4 0.6
Longitude
First principal component scores
Castellammare
Butera
Caccamo
Sciacca
Piazza
Armerina
Ragusa
Troina
Figure 2 First principal coordinates of the seven samples
(ordinate) calculated on 85 STR gene frequencies as function of
the longitudes of the samples (abscissa).
Maximum likelihood trees have been estimated for
three datasets: SIC0785 based on 85 STR markers;
SIC0695 based on 85 STR markers and 10 mtDNA
haplogroups; and MED1085 based on 85 STR markers
where three additional Mediterranean samples (Turkey,
Egypt and Algeria) were added. They are represented in
Figs. 3 a,b,c respectively. The robustness of these trees
has been tested by the bootstrap technique: the variables
(gene frequencies) of each dataset have been randomly
resampled 1000 times with replacement, and the percentage
of times each splitting is shared among the 1000
resampled trees (called “bootstrap value”) is indicated on
the relevant branch. Any percentage higher than 50% is
considered to give a “robust” splitting: the justification
of this threshold percentage is that if there is one tree
with all branchings having bootstrap values higher than
50%, there is no other tree with this property.
Fig. 3a shows two clear clusters: Castellammare, Sciacca
and Butera which group together 86% of the times
and the remaining samples whose structure is, however,
less defined. Ragusa and Piazza Armerina are associated
70% of the times, but the evolutionary model suggested
by the tree is probably not valid for Troina and
Caccamo, which join the tree with bootstrap values of
less than 50%. The addition of the mtDNA haplogroup
frequencies (Fig. 3b) does not help to resolve the matter:
Troina joins Castellammare and Sciacca 62% of times,
but Caccamo joins Troina, Castellammare and Sciacca
Ragusa
Sciacca
Butera
Castellammare
Troina
Caccamo
Piazza Armerina
74
86
70
Ragusa
Caccamo
Sciacca
Castellammare
Troina
Piazza Armerina
94
48
62
Castellammare
ALGERIA
EGYPT
Troina
Caccamo
Ragusa Piazza Armerina
Sciacca
Butera
TURKEY
80
78
54
67
Figure 3 Maximum likelihood trees: the numbers on the
branches are the bootstrap percentages testing the robustness
of the different partitions of the trees (see text). a) Dataset
SI0785. b) Dataset SI0695. c) Dataset MED1085.
48% of times, and other combinations of samples in
lower percentages.
The maximum likelihood tree obtained by adding
three samples from North-Africa (Algeria and Egypt)
and the Middle East (Turkey) provides further information:
54% of times Castellammare, Butera and Sciacca
are associated with the Middle East sample, while the
remaining samples (Troina, Caccamo, Piazza Armerina
and Ragusa) are associated with the two samples from
North-Africa.
C University College London 2003 Annals of Human Genetics (2003) 67,42–53 49
V. Romano et al.
Discussion
Two important and probably related aspects deserving
special attention in the reconstruction of the genetic history
of Sicily are, to what extent: (i) genetic differentiation
within the island really exists, and why; (ii) modern
Sicilian samples are genetically related with other
Mediterranean populations. In this study we estimated
85 allele frequencies for 9 STR polymorphisms and 10
mtDNA haplogroup frequencies, to investigate internal
genetic differentiation within Sicily and to provide data
for future comparisons.
The first general result from the present analysis is that
Sicily is genetically heterogeneous to a degree which is
statistically significant. The complex history of Sicily,
made up of different settlements since its first human
colonisation, rather than selective effects, may help to
explain this heterogeneity. In fact the alleles of the genes
listed in Table 2 reflect non-coding polymorphisms: although
in principle it cannot be excluded that some of
the STR alleles may be in linkage disequilibrium with
selectively non-neutral coding mutations, the general
consistency of the FSTs from STR data with those from
SNP data (Barbujani et al. 1997) provides additional evidence
that selection is not likely to be a major factor
causing genetic heterogeneity in Sicily. Migration and
genetic drift seem to have played a more effective role.
The quantities Nν in Table 4 may reflect an intensive
history of migrations in our Sicilian samples, and the observation
that the autosomal STRNνs are mostly higher
than the mitochondrial Nνs may indicate that male migration
was higher than female migration in this history.
According to the classical Greek historian Thucydides,
who lived in the second half of the fifth century
BC (The Peloponnesian War, Book VI), “it is said that
the earliest inhabitants [of Sicily] were the Cyclopes: I
cannot say what kind of people these were or where
they came from. . . . The next settlers after them seem
to have been the Sicanian. . . . After the fall of Troy, some
of the Trojans escaped from the Achaeans and came in
ships to Sicily, where they settled next to the Sicanians
and were called by the name of Elymi. . . . The Sicels
(latin Siculi) crossed over to Sicily from Italy, where they
lived previously and from which they were driven by
the Opicans. . .”. Even a broad outline of pre-Roman
and post-Roman Sicilian demography is here as archaeological
and linguistic evidence today provide a more
accurate and modern assessment of the major demographic
shifts in Sicily than such classical foundation
myths. In fact it is known that Sicily had a flourishing
population in the late Upper Palaeolithic (Martini,
1997) and in Neolithic times (Tusa, 2000). Also the
presence of Sicanians (associated today with the Thapsos
culture in the middle Bronze Age, 1300 BC, and
with the Pantalica culture from 1250 to 850 BC) is documented.
At the end of the Pantalica culture (earlier
Iron Age) the Sicanians were pushed towards the middle
and the south of the island by the Sicels (coming
from continental Italy) from the east and, to a lesser extent,
by the Elymi from the west, where they founded
Eryx and Segesta (IX–VIII century BC: the language of
the graffiti found in Segesta seems to suggest an Anatolian
root). Starting in the eighth century BC the coastal
areas of Sicily and southern Italy were massively settled
by Greek colonizers, from which many historical
and archaeological records remain. The Phoenician
and Carthaginian colonization took place at a similar
time but had a lesser impact, as they did not survive
the Greek power except in the end triangle of western
Sicily where they pushed the Elymi inland. Despite several
later conquests (by the Romans in the third century
BC, by the Arabs in the eighth and ninth centuries AD,
and by the Normans in the eleventh and twelfth centuries
AD) the Greek demographic and cultural influence
remained remarkable in many ways: even today our
samples show surnames of possible Greek origin (according
to Rohlfs, 1984) with a remarkable prevalence
of 7 to 11%.
Establishing a one-to-one correspondence between
the genetic (gene and genotypic) heterogeneity of Sicily
observed today and a presumed genetic composition of
its pre-Roman settlers is a very dangerous exercise until
one has typed ancient DNA from pre-Roman Sicilian
fossils in the relevant archaeological areas, but some
tentative elements for discussion may be offered, at least
as cautious working hypotheses for further testing. The
peopling of Sicily, as very briefly described above, should
have caused genetic differentiation on the west-east axis
of the island: old classical genetic markers (Piazza et al.
1988), surnames (Guglielmino et al. 1991), and dialect
isoglosses (Ruffino, 1997) agree by showing this differentiation.
The genetic analysis by Rickards et al. (1998)
50 Annals of Human Genetics (2003) 67,42–53 C University College London 2003
Genetic Structure of Sicily
Table 7 MtDNA haplogroup frequencies in 6 Sicilian samples and their age ranges in Europe according to Richards et al. (2000),
Table 1
Sample/
Haplogroup H V T J U K X I M L1/L2 Others
Troina 0.61905 0.01905 0.06667 0.04762 0.07619 0.05714 0.03810 0.00952 0.00000 0.00000 0.06667
Sciacca 0.38372 0.02326 0.10465 0.09302 0.10465 0.05814 0.03488 0.02326 0.08140 0.02326 0.06977
Castellammare 0.53913 0.04348 0.09565 0.06087 0.06957 0.01739 0.03478 0.01739 0.02609 0.00870 0.08696
Piazza 0.30769 0.02564 0.15384 0.12820 0.17948 0.05128 0.00000 0.07692 0.00000 0.00000 0.07692
Armerina
Caccamo 0.25862 0.00000 0.12069 0.15517 0.27586 0.06896 0.00000 0.08621 0.00000 0.00000 0.03448
Ragusa 0.28571 0.01786 0.14286 0.05357 0.16071 0.12500 0.00000 0.07143 0.01786 0.00000 0.12500
Age ranges 19,200− 11,100− 33,100− 22,200− 53,600− 12,900− 17,000− 27,200− (1) (1)
(YBP) 21,400 16,900 40,200 27,400 58,900 18,300 30,000 40,500
(1) Haplogroup not common in Europe
failed to find this geographical pattern, but our results in
Fig. 2 show that at least the fraction of genetic variability
summarized by the most important principal component
of our data (which is 26%) is correlated with
longitude much more than with latitude. The reason
why Butera deviates from the pattern of the other six
samples has no simple explanations, also because, very
unfortunately, Butera was not typed for the mitochondrial
markers. The microsatellite data (Table 6) show,
however, that Butera is among the samples with the
oldest divergence times.
In a recent paper Richards et al. (2000) developed a
“founder analysis” to identify and date migrations from
the Near East into Europe, by picking out founder sequences
in mtDNA HVS-I types. Table 7 shows the
mtDNA haplogroup frequencies obtained for our six Sicilian
samples, supplemented by the corresponding age
ranges in Europe according to the estimates of Richards
et al. (2000).
The tree analysis of our Sicilian samples shows that
the samples of Caccamo and Troina cannot be reliably
placed in any of the trees: the relevant bootstrap values
are less than, or about, 50%. This instability is probably
due to the tree model of evolution, which does not allow
admixture of the tree branches once split, a very
unrealistic hypothesis in the case of Sicily whose history
is composed of a stratification of different settlements,
each probably originating and developing with different
demographic parameters. If one looks at Table 7 where
only the mitochondrial history is represented, one will
notice that Caccamo and Troina have, respectively, the
minimum and maximum frequencies of haplogroup
H (0.259 and 0.619), and the maximum and the minimum
frequencies of haplogroups U (0.276 and 0.076),
I (0.086 and 0.009) and J (0.155 and 0.048). This remarkable
combination of extreme values may suggest
that the spatial genetic differentiation of Sicily can be
also due to settlements stratified in different times, as
exemplified by the hotly discussed settlements by the
Sicani and Elymi in Central and Western Sicily, and before
that by the Siculi in Eastern Sicily (Tusa, 1997). Two
haplogroups not common in Europe are present: haplogroup
M, separated from Eastern Africa to Western
Asia and Eurasia about 50,000 years ago (Quintana-
Murci et al. 1999) has been found in Sciacca (8%),
Castellammare (3%) and Ragusa (2%); and haplogroup
L1/L2 originating from Africa (Watson et al. 1997) has
been found in Sciacca (2%) and Castellammare (less
than 1%).
Divergence times computed from microsatellite data
provide a more recent time perspective which is more
comparable with historical records. It is interesting to
note that in Sciacca today there are microsatellite types
present which diverged more recently (in the Christian
era) than in all other samples: this may suggest a
genetic composition of Sciacca mainly derived from
settlements after the Roman conquest of Sicily (First
Punic war 246 BC). All other divergence times inferred
from microsatellites take root from the second to first
millennium BC: they seem to backdate to the
pre-Hellenistic period. It must be pointed out,
however, that such time ranges represent only rough
orders of magnitude, also because the divergence model
assumes a treelike splitting of the ancestral gene pool
C University College London 2003 Annals of Human Genetics (2003) 67,42–53 51
V. Romano et al.
without subsequent admixture, which almost certainly
does not apply to our samples.
Finally it is interesting to note that in these samples
the isonymy data are poorly correlated with the total
set of genetic data (r = 0.43, P = 0.06) and not correlated
at all with the subset of mitochondrial types. A
possible reason for this lack of clear correlation is that
the surname distribution we used is probably too recent
(1993) to synchronize with the long lasting memory
of the genetic traits and that, especially in recent
times, male and female migration patterns within Sicily
contributed differently in erasing genetic differences:
in fact the analysis by Guglielmino et al. (1991), referring
to surnames from consanguineous marriages of
about a century ago, seems to show much more congruence
with the genetic data. Unfortunately the geographic
distribution of such a collection of data in 16
dioceses does not allow further subdivisions into smaller
units, and therefore makes a more quantitative statement
impossible.
In conclusion, even if it is difficult to resist the temptation
to associate provisional time depths to the previous
data, our work shows at least the interest of studying
the genetic history of Sicily, the largest Mediterranean
island. More samples and more markers, possibly from
genetic non-recombining DNA regions such as the
DHPLC-SNP markers of the Y-chromosome, already
tested in Europe by Semino et al. (2000), will give more
resolving power: hopefully Sicily-specific DNA mutations
will be found to dissect different settlements, migrations,
bottlenecks, and to ascribe more accurate time
ranges to them.
Acknowledgements
The DNA samples from individuals of Turkish origin were
kindly provided by Prof. T. Coskun, Hacettepe University,
Department of Pediatrics, Unit of Metabolism, Ankara,
Turkey. The DNA samples from Egyptian individuals were
kindly provided by Prof. Nemat Hashem. The expert technical
assistance of Giuseppina Barrancotto and Pietro Schinocca
is acknowledged. The authors wish to thank Peter Forster
for valuable suggestions in the preparation of this manuscript.
This work was supported by Progetto Finalizzato of Ministry
of Health “Genetica di popolazione degli alleli PAH in Sicilia:
paragone con altri polimorfismi del DNA”; Progetto Finalizzato
C.N.R. Beni Culturali (“Cultural Heritage”) and Cofinanziamento
MURST ex40% (Italy) 1999.
References
Aitchison, J. (1986) The statistical analysis of compositional data.
London: Chapman and Hall.
Anker, R., Steinbrueck, T. & Donis-Keller, H. (1992)
Tetranucleotide repeat polymorphism at the human thyroid
peroxidase (hTPO) locus. Hum Mol Genet 1,
137.
Barbujani, G., Magagni, A., Minch, E. & Cavalli-Sforza, L.L.
(1997) An apportionment of human DNA diversity. Proceedings
of the National Academy of Sciences USA 94, 4516–
4519.
Cal`ı, F., Dianzani, I., Desviat, L.R., Perez, B., Ugarte,
M., Ozguc, M., Seyrantepe, V., Shiloh, Y., Giannattasio,
S., Carducci, C., Bosco, P., De Leo, G., Piazza, A. &
Romano, V. (1997) The STR252 – IVS10nt546 –
VNTR 7 phenylalanine hydroxylase minihaplotype in
five Mediterranean samples. Hum Genet 100, 350–
355.
Chakraborty, R., Kimmel, M., Stivers, D.N., Davison, L.J. &
Deka, R. (1997) Relative mutation rates at di-, tri-, and
tetranucleotide microsatellite loci. Proc Natl Acad Sci USA
94, 1041–1046.
Cooper, G., Amos, W., Bellamy, R., Siddiqui, M.R., Frodsham,
A., Hill, A.V.S. & Rubinsztein, D.C. (1999) An Empirical
Exploration of the (δμ)2 Genetic Distance for 213
Human Microsatellite Markers. Am J Hum Genet 65, 1125–
1133.
Efron, B. & Tibshirani, R. (1993) An introduction to the bootstrap.
New York: Chapman and Hall.
Felsenstein, J. (1988) Phylogenies and quantitative characters.
Annual Review of Ecology and Systematics 19, 445–471.
Felsenstein, J. (2000) PHYLIP (Phylogeny Inference Package) version
3.6a. Distributed by the author. Department of Genetics,
University of Washington, Seattle.
Finley, M.L. (1968) A History of Sicily: Ancient Sicily to the Arab
Conquest, London, Viking Press.
Goldstein,D.B., Ruiz Linares, A., Cavalli-Sforza, L.L. & Feldman,
M.W. (1995) An evaluation of genetic distances for
use with microsatellite loci. Genetics 139, 463–471.
Goltsov, A.A., Eisensmith, R.C., Naughton, E.R., Jin, L.,
Chakraborty, R. & Woo, S.L.C. (1993) A single polymorphic
STR system in the human phenylalanine hydroxylase
gene permits rapid prenatal diagnosis and carrier
screening for phenylketonuria. Hum Mol Genet 2, 577–
581.
Guglielmino, C.R., Zei, G. & Cavalli-Sforza, L.L. (1991) Genetic
and Cultural Transmission in Sicily as Revealed by
Names and Surnames. Hum Biol 63, 607–627.
Guo, S.W. & Thompson, E.A. (1992) Performing the Exact
Test of Hardy-Weinberg Proportion for Multiple Alleles.
Biometrics 48, 361–372.
Hammond, H.A., Jin, L., Zhong, Y., Caskey, C.T. &
Chakraborty, R. (1994) Evaluation of 13 short tandem
52 Annals of Human Genetics (2003) 67,42–53 C University College London 2003
Genetic Structure of Sicily
repeat loci for use in personal identification applications.
Am J Hum Genet 55, 1 175–89.
Kimpton, C.P.,Walton, A. & Gill, P. (1992) A further tetranucleotide
repeat polymorphism in the vWF gene. Hum Mol
Genet 1, 287.
Lewis, P.O. & Zaykin, D. (2001). Genetic Data Analysis: Computer
program for the analysis of allelic data. Version 1.0 (d16c).
Free program distributed by the authors over the internet
from http://lewis.eeb.uconn.edu/lewishome/ software.html.
Mantel, N. (1967) The detection of disease clustering and a
generalized regression approach. Cancer Res 27, 209–220.
Martini, F. (1997) Il Paleolitico Superiore in Sicilia. In: Prima
Sicilia (ed. S. Tusa), pp. 111–124. Palermo: Ediprint.
Mills, K.A., Even, D. & Murray, J.C. (1992) Tetranucleotide
repeat polymorphism at the human alpha fibrinogen locus
(FGA) Hum Mol Genet 9, 779.
Piazza, A., Cappello, N., Olivetti, E. & Rendine, S. (1988) A
genetic history of Italy. Ann Hum Genet 52, 203–213.
Polymeropoulos, M.H., Rath, D.S., Xiao, H. & Merrill, C.R.
(1991a) Tetranucleotide repeat polymorphism at the human
c-fes/fps proto-oncogene (FES) Nucl Acids Res 19,
3753.
Polymeropoulos, M.H., Xiao, H., Rath, D.S. & Merrill, C.R.
(1991b) Tetranucleotide repeat polymorphism at the human
tyrosine hydroxylase gene. Nucl Acids Res 19,4018.
Quintana-Murci, L., Semino, O., Bandelt, H.-J., Passarino,
G., McElreavey, K. & Santachiara-Benerecetti, A.S. (1999)
Genetic evidence for an early exit of Homo sapiens sapiens
from Africa through eastern Africa. Nat Genet 23, 437–
441.
Reyment, R.A. & Savazzi, E. (1999) Aspects of Multivariate
Statistical Analysis in Geology. Amsterdam: Elsevier Science
B.V.
Reynolds, J., Weir, B.S. & Cockerham, C.C. (1983) Estimation
of the coancestry coefficient: basis for a short-term
genetic distance. Genetics 105, 767–779.
Richards, M., Macaulay, V., Hickey, E., Vega, E., Sykes, B.,
Guida, V., Rengo, C., Sellitto, D., Cruciani, F., Kivisild,
T., Villems, R., Thomas, M., Rychkov, S., Rychkov, O.,
Rychkov, Y., Golge, M., Dimitrov, D., Hill, E., Bradley,
D., Romano, R., Cali, F., Vona, G., Demaine, A., Papiha,
S., Triantaphyllidis, C., Stefanescu, G., Hatina, J., Belledi,
M., Di Rienzo, A., Novelletto, A., Oppenheim, A.,Nørby,
S., Al-Zaheri, N., Santachiara-Benerecetti, S., Scozzari,
R., Torroni, A. & Bandelt, H.-J. (2000) Tracing European
Founder Lineages in the Near Eastern mtDNA Pool. Am J
Hum Genet 67, 1251–1276.
Rickards, O., Martinez-Labarga, C., Scano, G., De Stefano,
G.F., Biondi, G., Capaci, M. & Walter, H. (1998) Genetic
history of the population of Sicily. Hum Biol 70, 699–714.
Rohlfs, G. (1984) Dizionario storico dei cognomi nella Sicilia Orientale.
Palermo: Centro di Studi Filologici e Linguistici Siciliani.
Ruffino, G. (1997) Sicily. The dialects of Italy (eds. Maiden M
and Parry, M), pp. 365–375. London: Routlege.
Schneider, S., Roessli, D. & Excoffier, L. (2000) Arlequin
vers. 2.000. A software for population genetic data
analysis. Free program distributed by the authors from
http://anthro.unige.ch/arlequin.
Seielstadt, M.T., Minch, E. & Cavalli-Sforza, L.L. (1998) Genetic
evidence for a higher female migration rate in humans.
Nature Genetics 20, 278–280.
Semino, O., Passarino, G., Oefner, P.J., Lin, A.A., Arbuzova,
S., Beckman, L.E., De Benedictis, G., Francalacci,
P., Kouvatsi, A., Limborska, S., Marcikiae, M., Mika,
A., Mika, B., Primorac, D., Santachiara-Benerecetti, A.S.,
Cavalli-Sforza, L.L. & Underhill, P.A. (2000) The genetic
legacy of Paleolithic Homo sapiens sapiens in extant Europeans:
a Y chromosome perspective. Science 290, 1155–
1159.
Torroni, A., Bandelt, H.J., D’Urbano, L., Lahermo, P., Moral,
P., Sellitto, D., Rengo, C., Forster, P., Savantaus, M.L.,
Bonn´e-Tamir, B. & Scozzari, R. (1998) mtDNA analysis
reveals a major late Paleolithic population expansion from
southwestern to northeastern Europe. Am J Hum Genet 62,
1137–1152.
Torroni, A., Huoponen, K., Francalacci, P., Petrozzi, M.,
Morelli, L., Scozzari, R., Obinu, D., Savontaus, M.L. &
Wallace, D.C. (1996) Classification of European mtDNAs
from an analysis of three European populations. Genetics
144, 1835–1850.
Tusa, S. (1983) La Sicilia nella preistoria. Palermo: Sellerio.
pp 53–111 (a); pp. 121–181 (b).
Tusa, S. (1997) Prima Sicilia. Alle origini della societ`a siciliana.
(ed. S. Tusa). Palermo: Ediprint.
Tusa, S. (2000) Ethnic dynamics and proto-history of Sicily.
Journal Cultural Heritage 1, Supplement 2, pp. 17–28.
Watson, E., Forster, P., Richards, M. & Bandelt, H.J. (1997)
Mitochondrial footprints of human expansions in Africa.
Am J Hum Genet 61, 691–704.
Wright, S. (1951) The genetical structure of populations. Annals
of Eugenetics 15, 323–354.
Zhivotovsky, L.A. & Feldman, M.W. (1995) Microsatellite
variability and genetic distances. Proc Natl Acad Sci USA
92, 11549–52.
Zschocke, J., Graham, C.A., McKnight, J.J. & Nevin, N.C.
(1994) The STR system in the human phenylalanine hydroxylase
gene: true fragment length obtained with fluorescent
labelled PCR primers. Acta Paediatr Supplement 407,
41–42.
Zuliani, G. & Hobbs, H.H. (1990) Tetranucleotide repeat
polymorphism in the LPL gene. Nucleic Acids Res 18, 16
4958.
Received: 19 February 2002
Accepted: 31 July 2002
C University College London 2003 Annals of Human Genetics (2003) 67,42–53 53