PARETO DISTRIBUTIONS
P. Trehin
trehinp@aol.com
Abstract :
This paper is a
short presentation of a powerful method for the analysis of statistical
distribution of specific populations, according to some pertinent size
measurements of the individual entities that compose them.
This type of
distributions is remarkably stable, over time, through various geographical
areas and is known as "Pareto distributions", from the
name of Wilfredo Pareto, the famous Swiss economist/sociologist. He discovered
that above a certain size, cumulative income distribution, when plotted on
LogLog scales form an almost straight line. The theory has been refined since,
but the method remains valid.
LogLog scale
is commonly used in the analysis of distribution relating the number of people's
to their income level, but is also used for analysing the number of networks, of
enterprises, of PABXs[1],
etc... in relation to their respective sizes : number of networks end points,
number of employees, number of
telephone extensions, etc...). These distribution follow, in general, what is
generally called a "Pareto distribution".
We will propose
a variation to the classical Pareto distribution analysis technique which gives
more precise results over a larger size domain than the pure Pareto method.
Background
Pareto
discovered the distribution that now bears his name while studying family income
distributions in Switzerland. It was common sense knowledge that there are more
families with a low income than families with a large income. What was less
common knowledge was that these distributions followed a very smooth pattern.
Pareto
was analyzing income distributions among families[2].
Given the broad range of incomes on the one side and the even broader range of
number of families within the various income level classes (hundred of thousands
in the low income classes, few hand full in the very high income classes),
Pareto decided to use LogLog scale graphic paper in order to be able to
represent them all on a single chart.
He
remarked then that not only the common sense knowledge that there are more small
income families than large income families was verified, but in addition, the
distribution was following a straight line on the LogLog paper. Further
empirical studies for other time periods, for other countries family income
distributions, came to the stunning result that they were all following the same
pattern.
Log
of Number of families
with
Income X
Y A
1000 !
!
! * *
100 !
*
!
*
! ' *
10 !
' *
!
' *
!
' *
1 !
' *
!
'
*
!
'
*
.1 !
'
*
! '
*
!
'
*
.01 !
'
*
!
'
*
!
'
*
.001 !
'
*
!
'
*
!
'
*
.0001!
'
*
> X
X0
Log of Income
The Straight line part of the graph lends
itself to easy mathematical calculations.
Y = A X +
B
Where X is the logarithm of the
income level x
Where Y is the logarithm of
the number of people having an in come x
The arithmetic equation indeed is
A
Y = B x
This allows
interpolation of missing values, estimation of values for different size class
boundaries, etc... The Pareto distribution is however failing to give a good fit
for the lower end of the size spectrum where a straight line does not fit the
observed data. In fact Pareto himself had stated that his observation was valid
only above a minimum size X0.
Emprical
analysis conducted statisticians to use a more sophisticated distributions curve
instead of a Pareto distribution ; the "lognormal"
distribution. Once transformed in a loglog scale, the curve fitting technique
uses a parabolic equation instead of a linear equation.
Y
= A X²
+ B X + C
Where X is the logarithm of the
income level x
Where Y is the logarithm of
the number of people heving an in come x
Empirical
analysis shows that this second degree regression curve
gives very good results and that above a certain size it is statistically
indistinguishable from the original Pareto distribution. Some simple
mathematical calculus shows that after returning to an arithmetic scale, the
second degree equations becomes the one of the LogNormal distribution. In the
rest of this paper I'll use the term Pareto distribution as a generic name since
it is commonly known as such, keeping in mind that the true distribution name is
LogNormal.
Domains
of applicability of Pareto Distributions
Later analysis of statistical distributions
have demonstrated that Pareto distributions were indeed very common in various
fields:
·
Enterprises distribution by employee size
·
Enterprises distribution by Yearly Revenue size
·
Establishments distribution by employee size
·
PABX distribution by Extension size
·
Computers distribution by Price size
·
Computers distribution by Memory size
·
Computers distribution by Installed MIPS size
·
Computers distribution by Installed Terminals
·
etc...
The remarkable stability of all these
distributions, through time
and geographical area makes them a pretty
powerful instrument of analysis and prevision.

(EMEA
short for Europe Middle East Africa)
When
should one look for a Pareto distribution?
Pareto
distributions all have in common the following characteristics:
They count a number of entities versus one of their size expressed in a
variety of measurements.
·
Number of enterprises belonging to a certain employee
size class
·
Number of people having a certain level of assets
·
Number of networks having so many endpoints
The size has an open ended upper limit
·
Number of employees, at least in theory, can extend
indefinitely
·
MIPS in a computer, likewise
·
Number of end points in a network too
The size measurement is homogeneous
·
The unit that measure size is the same across the
population
·
Computer size is measured in MIPS throughout the
analysis
·
Number of end points is the size measure of a network
all along.
Populations analyzed are large
There
is obviously a certain degree of insight involved in this decision to use a
Pareto distribution, based on experience of having faced many cases.
Testing for Pareto Distribution fitting
The
first step is to plot the data on a LogLog scale to verify visually that the
curve follows the nice smooth parabolic pattern. This is a very simple plotting
exercise which can be done using either LogLog scale paper or more easily
nowadays by setting the computer graphic on the LogLog scale for the X and Y
axis.
Data
come in general already grouped by size classes. The size classes boundaries are
arbitrary and rarely provide equal intervals. One must use the fundamental
method of histogram plotting, that is to use normalized plotting for statistical
distributions.
* The height of
the bar is calculated by dividing the total quantity in the size class by the
width of that size class.
The
plotted dot should take in account the fact that the average size in each size
class is skewed towards the low end. When ever possible use the actual average
size. When this actual average is not available empirical analysis has shown
that one can use the geometric average of the boundaries as a fair
approximation.
* The center of
the size class is not the arithmetic mean of the extremes but the geometric mean.
ie the square root of the product of the lower bound by the higher bound.
Log Number of Enterprises
Belonging to Size class X
Y A
!
!*
! *
* '
!
*
!
' *
!
'*
!
' * '
!
' *
!
' '
*
!
' '
*
!
' '
*
!
' '*
!
' '
* '
!
' '
* '
!
' '
* '
!
' '
*
!
' '
' *
!
' '
' *
!
' '
'*
!
' '
' *
'
!
' '
' * '
>
X
X0
Log of Number of Employees
The
visual test will immediately confirm or infirm the hypothesis made that the
distribution is indeed a Pareto (Lognormal) distribution. The slightest glitch
on the curve[3] indicates that the data
does not follow a Pareto distribution. We have to remember that we are looking
at a LogLog scale and that small variations on the graph represent ratios and
not absolute differences.
For
example, a deviation of two units on the scale means that we have either twice
the quantity or half the quantity compared to a theoretical Pareto distribution.
Such
a difference may be genuine, ie in the distribution that we observe, there is a
specific condition happening in that precise point of the curve. We should look
for such possibility.
More
often, we have an artifact in our methodology that causes the distribution to
look "strange". This could be due to a sampling bias (undetected, of
course) or to an error in the extrapolation approach, or any other calculation
error.
Further
testing methods can be employed to confirm with formulas our eye test. Chi
square test could be an appropriate test. I will not expand upon the testing
methods here.
Step by step Pareto Distribution Plotting
Let's
take for example the distribution of establishments by employee size classes in
the USA in 1970. (source, County Business Patterns, US Department of commerce)
We
will calculate successively the width of each size class, the average value of
each employee size class and the heights of each corresponding histogram bar.
Size class Width
For
each size class, just take away the lower boundary value from the higher
boundary value, plus 1, remember those nasty interval problems?...
For example: (3+1)  1 = 3
(7+1)  4 = 4
Low high
Size class
bound bound width
1 3
3
4 7
4
8 19
12
20 49
30
50 99
50
100 249
150
250 499
250
500 999
500
1000 1499
500
1500 2499 1000
2500 4999 2500
5000 10000 5001
Average size by size class
For each class, just multiply the lower
size boundary and take the square root, rounding to one decimal point is in
general sufficient.
For
example: 1 x 3 = 3
___
V 3 =
1.732
Low high
Geometric
bound bound average
1 3
1.7
4 7
5.3
8 19
12.3
20 49
31.3
50 99
70.4
100 249
157.8
250 499
353.2
500 999
706.8
1000
1499 1224.3
1500 2499 1936.1
2500 4999 3535.2
5000 10000 7071.1
Heights of each size class histogram bar
for each class, just divide the number of
units observed by the width of the size class
For example: 1762340
: 3 = 587447.6
Low high
Size class observed Histogram Bar
bound bound width
quantities heights
1 3
3
1762340 587446.6667
4
7 4 723019
180754.75
8 19
12
593038 49419.83333
20 49
30
272635 9087.833333
50 99
50
90103 1802.06
100 249
150
51566 343.7733333
250 499
250
16597 66.388
500 999
500
7233 14.466
1000 1499
500
2077 4.154
1500 2499 1000
1250 1.25
2500 4999 2500
743 0.2972
5000 10000 5001
329 0.06578684263
Plotting:
We
can now plot the values of the histogram heights versus the geometric average
size class on LogLog paper or use a plot program on a computer selecting the
LOGLOG option for both the X and the Y axis.
X axis
Y axis
Geometric
Histogram Bar
averages heights
1.7
587446.66
5.3
180754.75
12.3
49419.83
31.3
9087.83
70.4
1802.06
157.8
343.773
353.2
66.388
706.8
14.466
1224.3
4.154
1936.1
1.25
3535.2
0.2972
7071.1
0.06578
Second example:
Network
distribution according to the number of endpoints attached from the estimates
coming from early analysis of a survey conducted in Spain.
NETWORKS Geometric
Size class NUMBER
OF Histogram Bar
SIZE CLASSES Average width
NETWORKS heights
2 4
2.8
3 22322
7440.6
5 14
8.4
10 18356
1835.6
15 49
27.1
35 8484
242.4
50 149
86.3 100 2708
27.08
150 499
273.6
350 954
2.725
500 1499
865.7 1000 484
0.484
1500 4999
2738.3 3500 32
0.00914
5000 14999
8660 10000 10
0.001
15000 49999
27385.9 35000 2
0.0000571
The non linear aspect of this curve,
contrary to what was expected from a Pareto / Log Normal distribution should
attract our attention and lead us to further analysis of the ”accident” on
this curve. It could be a genuine effect linked to a specific situation (regulatory
or technical treshold). In most cases it is however likely to be due to some
bias in our sampliing technique (as was the case here) or to an error of
calculation.
Further
uses of Pareto distributions:
So
far we have only used the Pareto distribution to test the shape of a
distribution that our knowledge of the environment would make us assume to be of
"Pareto type".
Interpolation to new size classes:
We
can use the stability properties of these distributions to infer information not
currently available, for example we would like to compare two distributions for
which the size class boundaries don't match. We can reverse the previous flow of
calculations in order to get from a shape of a curve the value of the quantities
corresponding to an hypothetical new size class.
Before
doing that reverse calculation we have first to calculate the coefficients of
the equation of the curve. For that we use a regression analysis program, in
this case a second degree regression. This program must be applied to Log of the
variables as it is only in a LogLog space that the curve fitting to a second
degree polynomial equation occurs.
REG2DEG (Log Num. of Entities) VS (Log of Aver. Size)==> COEF.
/
Program name: use the one available on your statistical package
COEF. are the A B and C of the equation:
Y = A X²
+ B X + C
Where
X
= Log average size classes
Y = Log of number of entities
Once we have the coefficients we can use
them in the second degree equation to calculate the Log of the theoretical
histogram bar heights values, corresponding to the log of new values of
geometric averages of the new size classes.
POLYNOM (Log of new Aver. Size) VS COEF.
/
Program name that calculates new values
using the previous coefficients
A B and C in the equation with new average
values X' correspond to the new size classes:
Y = A X' ²
+ B X' + C
Where
X' = Log of average of new size
classes
Y' = Log of heights of histogram
bars for new size classes
Real
value of the histogram bar heights can then be obtained by exponentiation of the
logarithm base, 10 in general, to the value calculated above.
HH = 10 to the power Y'
/
New absolute histogram heights
At this point
remember to multiply the theoretical curve value by the new size class width to
obtain the absolute value.
NN = HH x New Size classes
width
/
New count of entities in new size classes[4]
Extrapolation to new environments:
The
Pareto distribution is extremely frequent and shows a high degree of robustness.
It can be use to predict with rather high confidence the evolution of some
populations. The change in the slope of the curves is very stable and its
variation, if any, is very slow and monotonic.
If
we have little information about one country, we may use the distribution
available for the same population in another one that has better statistics, and
infer, with relatively good confidence, a size distribution for the country for
which we didn't have it.
Conclusion
In
summary, the Pareto distribution analysis is a powerful tool in so far as it
allows the analyst to test the general validity of an observed distribution and
feel confident about the observations made. When one finds a variation in the
curve, it is an important information which points us towards an area of
research for investigation of either a genuine reason for the odd shape or for a
correction of a cause of error that had been overlooked in the study.
Paul Trehin
[1]
PABX : Private Branch Exchange, a category of telephone line switching
systems.
[2]
At the time Pareto used a cumulative distribution curve. Empirical research
has shown that the characteristics of the distribution remained the same in
non cumulative distributions. The rest of this paper will concentrate on
that last situation.
[3]
See example 2 for an illustration of
this
[4]
It may be necessary to ratio the
calculated theoretical values in order to arrive at the exact same total as
the original population.