為什麼需要對數據轉換，和轉換的方法

最新 10-17

想要好好學習計量經濟學的，可以考慮加入咱們圈子的社群，加入方式參見後文。

-------------------------------------------------------------------------------Transformations: an introduction-------------------------------------------------------------------------------

Reasons for using transformations

There are many reasons for transformation. The list here is not comprehensive.

1. Convenience 2. Reducing skewness 3. Equal spreads 4. Linear relationships 5. Additive relationships

If you are looking at just one variable, 1, 2 and 3 are relevant, while if you are looking at two or more variables, 4 and 5 are more important. However, transformations that achieve 4 and 5 very often achieve 2 and 3.

1.ConvenienceA transformed scale may be as natural as the original scale and more convenient for a specific purpose (e.g. percentages rather than original data, sines rather than degrees).

One important example isstandardisation, whereby values are adjusted for differing level and spread. In general

value - level standardised value = -------------. spread

Standardised values have level 0 and spread 1 and have no units: hence standardisation is useful for comparing variables expressed in different units. Most commonly astandard scoreis calculated using the mean and standard deviation (sd) of a variable:

x - mean of x z = -------------. sd of x

Standardisation makes no difference to the shape of a distribution.

2.Reducing skewnessA transformation may be used to reduce skewness. A distribution that is symmetric or nearly so is often easier to handle and interpret than a skewed distribution. More specifically, a normal or Gaussian distribution is often regarded as ideal as it is assumed by many statistical methods.

To reduce right skewness, take roots or logarithms or reciprocals (roots are weakest). This is the commonest problem in practice.

To reduce left skewness, take squares or cubes or higher powers.

3.Equal spreadsA transformation may be used to produce approximately equal spreads, despite marked variations in level, which again makes data easier to handle and interpret. Each data set or subset having about the same spread or variability is a condition calledhomoscedasticity: its opposite is calledheteroscedasticity. (The spelling-sked-rather than-sced-is also used.)

4.Linear relationshipsWhen looking at relationships between variables, it is often far easier to think about patterns that are approximately linear than about patterns that are highly curved. This is vitally important when using linear regression, which amounts to fitting such patterns to data. (In Stata, regress is the basic command for regression.)

For example, a plot of logarithms of a series of values against time has the property that periods withconstant rates of change(growth or decline) plot as straight lines.

5.Additive relationshipsRelationships are often easier to analyse when additive rather than (say) multiplicative. So

y = a + bx

in which two terms a and bx are added is easier to deal with than

y = ax^b

in which two terms a and x^b are multiplied.Additivityis a vital issue inanalysis of variance(in Stata, anova, oneway, etc.).

In practice, a transformation often works, serendipitously, to do several of these at once, particularly to reduce skewness, to produce nearly equal spreads and to produce a nearly linear or additive relationship. But this is not guaranteed.

Review of most common transformations

The most useful transformations in introductory data analysis are the reciprocal, logarithm, cube root, square root, and square. In what follows, even when it is not emphasised, it is supposed that transformations are used only over ranges on which they yield (finite) real numbers as results.

Reciprocal

Thereciprocal, x to 1/x, with its sibling thenegative reciprocal, x to -1/x, is a very strong transformation with a drastic effect on distribution shape. It can not be applied to zero values. Although it can be applied to negative values, it is not useful unless all values are positive. The reciprocal of a ratio may often be interpreted as easily as the ratio itself: e.g.

population density (people per unit area) becomes area per person;

persons per doctor becomes doctors per person;

rates of erosion become time to erode a unit depth.

(In practice, we might want to multiply or divide the results of taking the reciprocal by some constant, such as 1000 or 10000, to get numbers that are easy to manage, but that itself has no effect on skewness or linearity.)

The reciprocal reverses order among values of the same sign: largest becomes smallest, etc. The negative reciprocal preserves order among values of the same sign.

Logarithm

Thelogarithm, x to log base 10 of x, or x to log base e of x (ln x), or x to log base 2 of x, is a strong transformation with a major effect on distribution shape. It is commonly used for reducing right skewness and is often appropriate for measured variables. It can not be applied to zero or negative values. One unit on a logarithmic scale means a multiplication by the base of logarithms being used. Exponential growth or decline

y = a exp(bx)

is made linear by

ln y = ln a + bx

so that the response variable y should be logged. (Here exp() means raising to the power e, approximately 2.71828, that is the base of natural logarithms.)

An aside on thisexponential growth or declineequation: put x = 0, and

y = a exp(0) = a,

so that a is the amount or count when x = 0. If a and b > 0, then y grows at a faster and faster rate (e.g. compound interest or unchecked population growth), whereas if a > 0 and b < 0, y declines at a slower and slower rate (e.g. radioactive decay).

Power functions y = ax^b are made linear by log y = log a + b log x so that both variables y and x should be logged.

An aside on suchpower functions: put x = 0, and for b > 0,

y = ax^b = 0, so the power function for positive b goes through the origin, which often makes physical or biological or economic sense. Think: does zero for x imply zero for y? This kind of power function is a shape that fits many data sets rather well.

Consider ratios y = p / q where p and q are both positive in practice. Examples are

males / females; dependants / workers; downstream length / downvalley length.

Then y is somewhere between 0 and infinity, or in the last case, between 1 and infinity. If p = q, then y = 1. Such definitions often lead to skewed data, because there is a clear lower limit and no clear upper limit. The logarithm, however, namely

log y = log p / q = log p - log q,

is somewhere between -infinity and infinity and p = q means that log y = 0. Hence the logarithm of such a ratio is likely to be more symmetrically distributed.

Cube root

Thecube root, x to x^(1/3). This is a fairly strong transformation with a substantial effect on distribution shape: it is weaker than the logarithm. It is also used for reducing right skewness, and has the advantage that it can be applied to zero and negative values. Note that the cube root of a volume has the units of a length. It is commonly applied to rainfall data.

Applicability to negative values requires a special note. Consider (2)(2)(2) = 8 and (-2)(-2)(-2) = -8. These examples show that the cube root of a negative number has negative sign and the same absolute value as the cube root of the equivalent positive number. A similar property is possessed by any other root whose power is the reciprocal of an odd positive integer (powers 1/3, 1/5, 1/7, etc.).

This property is a little delicate. For example, change the power just a smidgen from 1/3, and we can no longer define the result as a product of precisely three terms. However, the property is there to be exploited if useful.

Square root

Thesquare root, x to x^(1/2) = sqrt(x), is a transformation with a moderate effect on distribution shape: it is weaker than the logarithm and the cube root. It is also used for reducing right skewness, and also has the advantage that it can be applied to zero values. Note that the square root of an area has the units of a length. It is commonly applied to counted data, especially if the values are mostly rather small.

Square

Thesquare, x to x^2, has a moderate effect on distribution shape and it could be used to reduce left skewness. In practice, the main reason for using it is to fit a response by a quadratic function y = a + b x + c x^2. Quadratics have a turning point, either a maximum or a minimum, although the turning point in a function fitted to data might be far beyond the limits of the observations. The distance of a body from an origin is a quadratic if that body is moving under constant acceleration, which gives a very clear physical justification for using a quadratic. Otherwise quadratics are typically used solely because they can mimic a relationship within the data region. Outside that region they may behave very poorly, because they take on arbitrarily large values for extreme values of x, and unless the intercept a is constrained to be 0, they may behave unrealistically close to the origin.

Squaring usually makes sense only if the variable concerned is zero or positive, given that (-x)^2 and x^2 are identical.

Which transformation?

The main criterion in choosing a transformation is: what works with the data? As examples above indicate, it is important to consider as well two questions.

What makes physical (biological, economic, whatever) sense, for example in terms of limiting behaviour as values get very small or very large? This question often leads to the use of logarithms.

Can we keep dimensions and units simple and convenient? If possible, we prefer measurement scales that are easy to think about. The cube root of a volume and the square root of an area both have the dimensions of length, so far from complicating matters, such transformations may simplify them. Reciprocals usually have simple units, as mentioned earlier. Often, however, somewhat complicated units are a sacrifice that has to be made.

Psychological comments-for the puzzled

The main motive for transformation is greater ease of description. Although transformed scales may seem less natural, this is largely a psychological objection. Greater experience with transformation tends to reduce this feeling, simply because transformation so often works so well. In fact, many familiar measured scales are really transformed scales: decibels, pH and the Richter scale of earthquake magnitude are all logarithmic.

However, transformations cause debate even among experienced data analysts. Some use them routinely, others much less. Various views, extreme or not so extreme, are slightly caricatured here to stimulate reflection or discussion. For what it is worth, I consider all these views defensible, or at least understandable.

"This seems like a kind of cheating. You don t like how the data are, so you decide to change them."

"I see that this is a clever trick that works nicely. But how do I know when this trick will work with some other data, or if another trick is needed, or if no transformation is needed?"

"Transformations are needed because there is no guarantee that the world works on the scales it happens to be measured on."

"Transformations are most appropriate when they match a scientific view of how a variable behaves."

Often it helps to transform results back again, using the reverse orinversetransformation:

reciprocal t = 1 / x reciprocal x = 1 / t

log base 10 t = log_10 x 10 to the power x = 10^t

log base e t = log_e x = ln x e to the power x = exp(t)

log base 2 t = log_2 x 2 to the power x = 2^t cube root t = x^(1/3) cube x = t^3

square root t = x^(1/2) square x = t^2

How to do transformations in Stata

Basic first steps

1. Draw a graph of the data to see how far patterns in data match the simplest ideal patterns. Try dotplot or scatter as appropriate.

2. See what range the data cover. Transformations will have little effect if the range is small.

3. Think carefully about data sets including zero or negative values. Some transformations are not defined mathematically for some values, and often they make little or no scientific sense. For example, I would never transform temperatures in degrees Celsius or Fahrenheit for these reasons (unless to Kelvin).

Standard scores (mean 0 and sd 1) in a new variable are obtained by

. egen stdpopi = std(popi)

whereas the basic transformations can all be put in new variables by generate:

. gen recener = 1/energy. gen logeener = ln(energy). gen l10ener = log10(energy). gen curtener = energy^(1/3). gen sqrtener = sqrt(energy). gen sqener = energy^2

. gen logitp = logit(p)if p is a proportion. gen logitp = logit(p / 100)if p is a percent. gen frootp = sqrt(p) - sqrt(1-p)if p is a proportion. gen frootp = sqrt(p) - sqrt(100-p)if p is a percent

Cube roots of negative numbers require special care. Stata uses a general routine to calculate powers and does not look for special cases of powers. Whenever negative values are present, a more general recipe for cube roots issign(x) * (abs(x)^(1/3)). Similar comments apply to fifth, seventh, roots etc.

Note any messages about missing values carefully: unless you had missing values in the original variable, they indicate an attempt to apply a transformation when it is not defined. (Do you have zero or negative values, for example?)

It is not always necessary to create a transformed variable before working with it. In particular, many graph commands allow the optionsyscale(log)andxscale(log). This is very useful because the graph is labelled using the original values, but it does not leave behind a log-transformed variable in memory.

Other commands

Stata offers various other commands designed to help you choose a transformation. ladder, gladder and qladder try several transformations of a variable with the aim of showing how far they produce a more nearly normal (Gaussian) distribution. In practice such commands can be helpful, or they can be confusing at an introductory level: for examples, they can suggest a transform at odds with what your scientific knowledge would indicate. boxcox and lnskew0 are more advanced commands that should be used only after studying textbook explanations of what they do. Box and Cox (1964) is the key original reference.

For some statistical people any debate about transformation is largely side-stepped by the advent ofgeneralised linear models. In such models, estimation is carried out on a transformed scale using a specified link function, but results are reported on the original scale of the response. The Stata command is glm.

Transformations for proportions and percents (more advanced)

Data that are proportions (between 0 and 1) or percents (between 0 and 100) often benefit from special transformations. The most common is thelogit(or logistic) transformation, which is

logit p = log (p / (1 - p)) for proportions

OR logit p = log (p / (100 - p)) for percents

where p is a proportion or percent.

This transformation treats very small and very large values symmetrically, pulling out the tails and pulling in the middle around 0.5 or 50%. The plot of p against logit p is thus a flattened S-shape. Strictly, logit p cannot be determined for the extreme values of 0 and 1 (100%): if they occur in data, there needs to be some adjustment.

One justification for this logit transformation might be sketched in terms of a diffusion process such as the spread of literacy. The push from zero to a few percent might take a fair time; once literacy starts spreading its increase becomes more rapid and then in turn slows; and finally the last few percent may be very slow in converting to literacy, as we are left with the isolated and the awkward, who are the slowest to pick up any new thing. The resulting curve is thus a flattened S-shape against time, which in turn is made more nearly linear by taking logits of literacy. More formally, the same idea might be justified by imagining that adoption (infection, whatever) is proportional to the number of contacts between those who do and those who do not, which will rise and then fall quadratically. More generally, there are many relationships in which predicted values cannot logically be less than 0 or more than 1 (100%). Using logits is one way of ensuring this: otherwise models may produce absurd predictions.

The logit (looking only at the case of proportions)

logit p = log (p / (1 - p))

can be rewritten

logit p = log p - log (1 - p)

and in this form it can be seen as a member of a set offoldedtransformations

transform of p = something done to p - something done to (1 - p).

This way of writing it brings out the symmetrical way in which very high and very low values are treated. (If p is small, 1 - p is large, and vice versa.) The logit is occasionally called thefolded log. The simplest other such transformation is thefolded root(that means square root)

folded root of p = root of p - root of (1 - p).

As with square roots and logarithms generally, the folded root has the advantage that it can be applied without adjustment to data values of 0 and 1 (100%). The folded root is a weaker transformation than the logit. In practice it is used far less frequently.

Two other transformations for proportions and percents met in the older literature (and still used occasionally) are theangularand theprobit. The angular is

arcsin(root of p)

or the angle whose sine is the square root of p. In practice, it behaves very like

p^0.41 - (1 - p)^0.41,

which in turn is close to

p^0.5 - (1 - p)^0.5,

which is another way of writing the folded root (Tukey 1960). The probit is a transformation with a mathematical connection to the normal (Gaussian) distribution, which is not only very similar in behaviour to the logit, but also more awkward to work with. As a result, it is now less seen, except in more advanced applications, where it retains several advantages.

Transformations as a family (more advanced)

The main transformations mentioned previously, with the exception of the logarithm, namely the reciprocal, cube root, square root and square, are all powers. The powers concerned are

reciprocal -1 cube root 1/3 square root 1/2 square 2

Note that the sequence of explanation was not capricious, but in numerical order of power. Therefore, these transformations are all members of a family. In addition, contrary to what may appear at first sight, the logarithm really belongs in the family too. Knowing this is important to appreciating that the transformations used in practice are not just a bag of tricks, but a series of tools of different sizes or strengths, like a set of screwdrivers or drill bits. We could thus fill out this sequence, the ladder of transformations as it is sometimes known, with more powers, as for example in

reciprocal square -2 reciprocal -1 (yields one) 0 cube root 1/3 square root 1/2 identity 1 square 2 cube 3 fourth power 4

Among the additions here, the identity transformation, say x^1 = x, is the transformation that is, in a sense, no transformation. The graph of x against x is naturally a straight line and so the power of 1 divides transformations whose graph is convex upwards (powers less than 1) from transformations whose graph is concave upwards (powers greater than 1). Powers less than 1 squeeze high values together and stretch low values apart, and powers more than 1 do the opposite.

The transformation x^0, on the other hand, is degenerate, as it always yields 1 as a result. However, we will now see that in a strong sense log x (meaning, strictly, the natural logarithm or ln x) really belongs in the family at the position of power 0.

If you know calculus, you will know that the sequence of powers

..., x^-3, x^-2, x^-1, x^0, x^1, x^2, ...

has as integrals, apart from additive constants,

..., -x^-2 / 2, -x^-1, ln x, x, x^2 / 2, x^3 / 3, ...

and the mapping can be reversed by differentiation. So integrating x^(p - 1) yields x^p / p, unless p is 0, in which case it yields ln x. Thus we can define a family

t_p(x) = x^p if p != 0, = ln x if p == 0.

The notion of choosing from a family when we choose a power or logarithm is a key idea. It follows that we can usually choose a different member of the family if the transformation turns out to be too weak, or too strong, for our purpose and our data.

Many discussions of transformations focus on slightly different families, for a variety of mathematical and statistical reasons. The canonical reference here is Box and Cox (1964), although note also earlier work by Tukey (1957). Most commonly, the definition is changed to

t_p(x) = (x^p - 1) / p if p != 0, = ln x if p == 0. This t(x, p) has various properties which point up family resemblances.

1. ln x is the limit as p -> 0 of (x^p - 1) / p.

2. At x = 1, t_p(x) = 0, for all p.

3. The first derivative (rate of change) of t_p(x) is x^(p - 1) if p != 0 and 1 / x if p == 0. At x = 1, this is always 1.

4. The second derivative of t_p(x) is (p - 1) x^(p - 2) if p != 0 and -1 / x^2 if p == 0. At x = 1, this is always (p - 1).

Another small change of definition has some similar consequences, but also some other advantages. Consider

t_p(x) = [(x + 1)^p - 1] / p if p != 0, = ln(x + 1) if p == 0.

This t(x, p) has various properties which also point up family resemblances.

1. If p = 1, t_p(x) = x.

2. At x = 0, t_p(x) = 0, for all p. So all curves start at the origin.

3. The first derivative (rate of change) of t_p(x) is (x + 1)^(p - 1) if p != 0 and 1 / (x + 1) if p == 0. At x = 0, this is always 1. So the curves have the same slope at the origin.

4. The second derivative of t_p(x) is (p - 1) (x + 1)^(p - 2) if p != 0 and -1 / (x + 1)^2 if p == 0. At x = 0, this is always (p - 1).

The most useful consequence, however, is that this definition can be extended more easily to variables that can be both positive and negative, as will now be seen.

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 計量經濟學圈 的精彩文章:

※50本經濟學書單，入門到精通分門別類
※非參數估計的根基，核密度估計大陳述
※中國經濟學者影響力排行榜，你家老師上榜了？
※最全估計方法，解決遺漏變數偏差，內生性，混淆變數和相關問題
※挑戰Nobel獎的印度經濟學大師們，遠遠不止阿瑪提亞.森

TAG:計量經濟學圈 |