Statistics I INTRODUCTION Statistics, branch of mathematics that deals with the collection, organization, and analysis of numerical data and with such problems as experiment design and decision making. II HISTORY Domesday Book Compiled in 1086 under the direction of William the Conquerer, the Domesday Book was a meticulous survey of feudal estates in England. Public Record Office, Surrey, England Simple forms of statistics have been used since the beginning of civilization, when pictorial representations or other symbols were used to record numbers of people, animals, and inanimate objects on skins, slabs, or sticks of wood and the walls of caves. Before 3000 BC the Babylonians used small clay tablets to record tabulations of agricultural yields and of commodities bartered or sold. The Egyptians analyzed the population and material wealth of their country before beginning to build the pyramids in the 31st century BC. The biblical books of Numbers and 1 Chronicles are primarily statistical works, the former containing two separate censuses of the Israelites and the latter describing the material wealth of various Jewish tribes. Similar numerical records existed in China before 2000 censuses to be used as bases for taxation as early as 594 BC. BC. The ancient Greeks held See Census. The Roman Empire was the first government to gather extensive data about the population, area, and wealth of the territories that it controlled. During the Middle Ages in Europe few comprehensive censuses were made. The Carolingian kings Pepin the Short and Charlemagne ordered surveys of ecclesiastical holdings: Pepin in 758 and Charlemagne in 762. Following the Norman Conquest of England in 1066, William I, king of England, ordered a census to be taken; the information gathered in this census, conducted in 1086, was recorded in the Domesday Book. Registration of deaths and births was begun in England in the early 16th century, and in 1662 the first noteworthy statistical study of population, Observations on the London Bills of Mortality, was written. A similar study of mortality made in Breslau, Germany, in 1691 was used by the English astronomer Edmond Halley as a basis for the earliest mortality table. In the 19th century, with the application of the scientific method to all phenomena in the natural and social sciences, investigators recognized the need to reduce information to numerical values to avoid the ambiguity of verbal description. At present, statistics is a reliable means of describing accurately the values of economic, political, social, psychological, biological, and physical data and serves as a tool to correlate and analyze such data. The work of the statistician is no longer confined to gathering and tabulating data, but is chiefly a process of interpreting the information. The development of the theory of probability increased the scope of statistical applications. Much data can be approximated accurately by certain probability distributions, and the results of probability distributions can be used in analyzing statistical data. Probability can be used to test the reliability of statistical inferences and to indicate the kind and amount of data required for a particular problem. III STATISTICAL METHODS How Polls Predict Professional pollsters typically conduct their surveys among sample populations of 1,000 people. Statistical measurements show that reductions in the margin of error flatten out considerably after the sample size reaches 1,000. © Microsoft Corporation. All Rights Reserved. The raw materials of statistics are sets of numbers obtained from enumerations or measurements. In collecting statistical data, adequate precautions must be taken to secure complete and accurate information. The first problem of the statistician is to determine what and how much data to collect. Actually, the problem of the census taker in obtaining an accurate and complete count of the population, like the problem of the physicist who wishes to count the number of molecule collisions per second in a given volume of gas under given conditions, is to decide the precise nature of the items to be counted. The statistician faces a complex problem when, for example, he or she wishes to take a sample poll or straw vote. It is no simple matter to gauge the size and constitution of the sample that will yield reasonably accurate predictions concerning the action of the total population. In protracted studies to establish a physical, biological, or social law, the statistician may start with one set of data and gradually modify it in light of experience. For example, in early studies of the growth of populations, future change in size of population was predicted by calculating the excess of births over deaths in any given period. Population statisticians soon recognized that rate of increase ultimately depends on the number of births, regardless of the number of deaths, so they began to calculate future population growth on the basis of the number of births each year per 1000 population. When predictions based on this method yielded inaccurate results, statisticians realized that other limiting factors exist in population growth. Because the number of births possible depends on the number of women rather than the total population, and because women bear children during only part of their total lifetime, the basic datum used to calculate future population size is now the number of live births per 1000 females of childbearing age. The predictive value of this basic datum can be further refined by combining it with other data on the percentage of women who remain childless because of choice or circumstance, sterility, contraception, death before the end of the childbearing period, and other limiting factors. The excess of births over deaths, therefore, is meaningful only as an indication of gross population growth over a definite period in the past; the number of births per 1000 population is meaningful only as an expression of the proportion of increase during a similar period; and the number of live births per 1000 women of childbearing age is meaningful for predicting future size of populations. IV TABULATION AND PRESENTATION OF DATA Frequency-Distribution Table A frequency-distribution table summarizes data. For example, there were 1200 grades received on 4 examinations by 10 sections of 30 students each. The first column lists the ten intervals into which the grades were grouped. The second column lists the midpoints of these intervals. The third column lists the number of grades in each interval, that is, their frequency. (There were 20 grades between 0 and 10.) The fourth column lists the proportion of grades in each interval, that is, their relative frequency. (.017 of the 1200 grades were between 0 and 10.) The fifth column lists the number of grades in an interval and all intervals below it, that is, their cumulative frequency. (35 grades were in or below the interval between 10 and 20.) The sixth column lists the proportion of grades in or below an interval, that is, their relative cumulative frequency. (0.029 of the 1200 grades were in or below the interval 10 to 20.) © Microsoft Corporation. All Rights Reserved. The collected data must be arranged, tabulated, and presented to permit ready and meaningful analysis and interpretation. To study and interpret the examinationgrade distribution in a class of 30 pupils, for instance, the grades are arranged in ascending order: 30, 35, 43, 52, 61, 65, 65, 65, 68, 70, 72, 72, 73, 75, 75, 76, 77, 78, 78, 80, 83, 85, 88, 88, 90, 91, 96, 97, 100, 100. This progression shows at a glance that the maximum is 100, the minimum 30, and the range, or difference, between the maximum and minimum is 70. In a cumulative-frequency graph, such as Fig. 1, the grades are marked on the horizontal axis and double marked on the vertical axis with the cumulative number of the grades on the left and the corresponding percentage of the total number on the right. Each dot represents the accumulated number of students who have attained a particular grade or less. For example, the dot A corresponds to the second 72; reading on the vertical axis, it is evident that there are 12, or 40 percent, of the grades equal to or less than 72. In analyzing the grades received by 10 sections of 30 pupils each on four examinations, a total of 1200 grades, the amount of data is too large to be exhibited conveniently as in Fig. 1. The statistician separates the data into suitably chosen groups, or intervals. For example, ten intervals might be used to tabulate the 1200 grades, as in column (a) of the accompanying frequency-distribution table; the actual number in an interval, called the frequency of the interval, is entered in column (c). The numbers that define the interval range are called the interval boundaries. It is convenient to choose the interval boundaries so that the interval ranges are equal to each other; the interval midpoints, half the sum of the interval boundaries, are simple numbers, because they are used in many calculations. A grade such as 87 will be tallied in the 80-90 interval; a boundary grade such as 90 may be tallied uniformly throughout the groups in either the lower or upper intervals. The relative frequency, column (d), is the ratio of the frequency of an interval to the total count; the relative frequency is multiplied by 100 to obtain the percent relative frequency. The cumulative frequency, column (e), represents the number of students receiving grades equal to or less than the range in each succeeding interval; thus, the number of students with grades of 30 or less is obtained by adding the frequencies in column (c) for the first three intervals, which total 53. The cumulative relative frequency, column (f), is the ratio of the cumulative frequency to the total number of grades. The data of a frequency-distribution table can be presented graphically in a frequency histogram, as in Fig. 2, or a cumulative-frequency polygon, as in Fig. 3. The histogram is a series of rectangles with bases equal to the interval ranges and areas proportional to the frequencies. The polygon in Fig. 3 is drawn by connecting with straight lines the interval midpoints of a cumulative frequency histogram. Newspapers and other printed media frequently present statistical data pictorially by using different lengths or sizes of various symbols to indicate different values. V MEASURES OF CENTRAL TENDENCY After data have been collected and tabulated, analysis begins with the calculation of a single number, which will summarize or represent all the data. Because data often exhibit a cluster or central point, this number is called a measure of central tendency. Let x1, x2, ..., xn be the n tabulated (but ungrouped) numbers of some statistic; the most frequently used measure is the simple arithmetic average, or mean, written ?, which is the sum of the numbers divided by n: If the x's are grouped into k intervals, with midpoints m1, m2, ..., mk and frequencies f1, f2, ..., fk, respectively, the simple arithmetic average is given by with i = 1, 2, ..., k. The median and the mode are two other measures of central tendency. Let the x's be arranged in numerical order; if n is odd, the median is the middle x; if n is even, the median is the average of the two middle x's. The mode is the x that occurs most frequently. If two or more distinct x's occur with equal frequencies, but none with greater frequency, the set of x's may be said not to have a mode or to be bimodal, with modes at the two most frequent x's, or trimodal, with modes at the three most frequent x's. VI MEASURES OF VARIABILITY The investigator frequently is concerned with the variability of the distribution, that is, whether the measurements are clustered tightly around the mean or spread over the range. One measure of this variability is the difference between two percentiles, usually the 25th and the 75th percentiles. The p th percentile is a number such that p percent of the measurements are less than or equal to it; in particular, the 25th and the 75th percentiles are called the lower and upper quartiles, respectively. The p th percentile is readily found from the cumulative-frequency graph, (Fig. 1) by running a horizontal line through the p percent mark on the vertical axis on the graph, then a vertical line from this point on the graph to the horizontal axis; the abscissa of the intersection is the value of the p th percentile. The standard deviation is a measure of variability that is more convenient than percentile differences for further investigation and analysis of statistical data. The standard deviation of a set of measurements x1, x2, ..., xn, with the mean ? is defined as the square root of the mean of the squares of the deviations; it is usually designated by the Greek letter sigma (?). In symbols The square, ?2, of the standard deviation is called the variance. If the standard deviation is small, the measurements are tightly clustered around the mean; if it is large, they are widely scattered. VII CORRELATION When two social, physical, or biological phenomena increase or decrease proportionately and simultaneously because of identical external factors, the phenomena are correlated positively; under the same conditions, if one increases in the same proportion that the other decreases, the two phenomena are negatively correlated. Investigators calculate the degree of correlation by applying a coefficient of correlation to data concerning the two phenomena. The most common correlation coefficient is expressed as in which x is the deviation of one variable from its mean, y is the deviation of the other variable from its mean, and N is the total number of cases in the series. A perfect positive correlation between the two variables results in a coefficient of +1, a perfect negative correlation in a coefficient of -1, and a total absence of correlation in a coefficient of 0. Intermediate values between +1 and 0 or -1 are interpreted by degree of correlation. Thus, .89 indicates high positive correlation, -.76 high negative correlation, and .13 low positive correlation. VIII MATHEMATICAL MODELS Distribution of IQ Scores The distribution of scores (commonly called IQ scores) on the Wechsler Adult Intelligence Scale follows an approximately normal curve, an average distribution of values. The test is regularly adjusted so that the median score is 100--that is, so that half of the scores fall above 100, and half fall below. © Microsoft Corporation. All Rights Reserved. A mathematical model is a mathematical idealization in the form of a system, proposition, formula, or equation of a physical, biological, or social phenomenon. Thus, a theoretical, perfectly balanced die that can be tossed in a purely random fashion is a mathematical model for an actual physical die. The probability that in n throws of a mathematical die a throw of 6 will occur k times is in which (À is the symbol for the binomial coefficient ) The statistician confronted with a real physical die will devise an experiment, such as tossing the die n times repeatedly, for a total of Nn tosses, and then determine from the observed throws the likelihood that the die is balanced and that it was thrown in a random way. In a related but more involved example of a mathematical model, many sets of measurements have been found to have the same type of frequency distribution. For example, let x1, x2, ..., xN be the number of 6's cast in the N respective runs of n tosses of a die and assume N to be moderately large. Let y1, y2, ..., yN be the weights, correct to the nearest 1/100 g, of N lima beans chosen haphazardly from a 100-kg bag of lima beans. Let z1, z2, ..., zN be the barometric pressures recorded to the nearest 1/1000 cm by N students in succession, reading the same barometer. It will be observed that the x's, y's, and z's have amazingly similar frequency patterns. The statistician adopts a model that is a mathematical prototype or idealization of all these patterns or distributions. One form of the mathematical model is an equation for the frequency distribution, in which N is assumed to be infinite: in which e (approximately 2.7) is the base for natural logarithms (see Logarithm). The graph of this equation (Fig. 4) is the bell-shaped curve called the normal, or Gaussian, probability curve. If a variate x is normally distributed, the probability that its value lies between a and b is given by The mean of the x's is 0, and the standard deviation is 1. In practice, if N is large, the error is exceedingly small. IX TESTS OF RELIABILITY The statistician is often called upon to decide whether an assumed hypothesis for some phenomenon is valid or not. The assumed hypothesis leads to a mathematical model; the model, in turn, yields certain predicted or expected values, for example, 10, 15, 25. The corresponding actually observed values are 12, 16, 21. To determine whether the hypothesis is to be kept or rejected, these deviations must be judged as normal fluctuations caused by sampling techniques or as significant discrepancies. Statisticians have devised several tests for the significance or reliability of data. One is the chi-square (c 2) test. The deviations (observed values minus expected values) are squared, divided by the expected values, and summed: The value of c2 is then compared with values in a statistical table to determine the significance of the deviations. X HIGHER STATISTICS The statistical methods described above are the simpler, more commonly used methods in the physical, biological, and social sciences. More advanced methods, often involving advanced mathematics, are used in further statistical studies, such as sampling theory, inference and estimation theory, and design of experiments. Contributed By: James Singer Microsoft ® Encarta ® 2009. © 1993-2008 Microsoft Corporation. All rights reserved.