PARETO DISTRIBUTIONS

 P. Trehin

trehinp@aol.com

 Abstract :

 This paper is a short presentation of a powerful method for the analysis of statistical distribution of specific populations, according to some pertinent size measurements of the individual entities that compose them. It is based upon my experience in the forecasting department of a large international company, building models of industrial market structures.

This type of distributions is remarkably stable, over time, through various geographical areas and is known as "Pareto distributions", from the name of Wilfredo Pareto, the famous Swiss economist/sociologist. He discovered that above a certain size, cumulative income distribution, when plotted on Log-Log scales form an almost straight line. The theory has been refined since, but the method remains valid.

Log-Log scale is commonly used in the analysis of distribution relating the number of people's to their income level, but is also used for analysing the number of networks, of enterprises, of PABXs[1], etc... in relation to their respective sizes : number of networks end points, number of  employees, number of telephone extensions, etc...). These distribution follow, in general, what is generally called a "Pareto distribution".

We will propose a variation to the classical Pareto distribution analysis technique which gives more precise results over a larger size domain than the pure Pareto method.

Background

Pareto discovered the distribution that now bears his name while studying family income distributions in Switzerland. It was common sense knowledge that there are more families with a low income than families with a large income. What was less common knowledge was that these distributions followed a very smooth pattern.

Pareto was analyzing income distributions among families[2]. Given the broad range of incomes on the one side and the even broader range of number of families within the various income level classes (hundred of thousands in the low income classes, few hand full in the very high income classes), Pareto decided to use Log-Log scale graphic paper in order to be able to represent them all on a single chart.

He remarked then that not only the common sense knowledge that there are more small income families than large income families was verified, but in addition, the distribution was following a straight line on the Log-Log paper. Further empirical studies for other time periods, for other countries family income distributions, came to the stunning result that they were all following the same pattern.

Log of Number of families

with Income X

Y  A

 1000 !

      !

      !  *    *

 100  !           *

      !              *

      !              ' *

 10   !              '   *

      !              '     *

      !              '       *

 1    !              '         *

      !              '           *

      !              '             *

 .1   !              '               *

      !              '                 *

      !              '                   *

 .01  !              '                     *

      !              '                       *

      !              '                         *

 .001 !              '                           *

      !              '                             *

      !              '                               *

 .0001!              '                                 *

      ------------------------------------------------------> X

                    X0                               Log of Income

The Straight line part of the graph lends itself to easy mathematical calculations.

              Y = A X  +  B

Where X is the logarithm of  the income level  x

Where Y is the logarithm of  the number of people having an in come x 

The arithmetic equation indeed is

                                   A

                Y = B x

This allows interpolation of missing values, estimation of values for different size class boundaries, etc... The Pareto distribution is however failing to give a good fit for the lower end of the size spectrum where a straight line does not fit the observed data. In fact Pareto himself had stated that his observation was valid only above a minimum size X0.

Emprical analysis conducted statisticians to use a more sophisticated distributions curve instead of a Pareto distribution ; the "log-normal" distribution. Once transformed in a log-log scale, the curve fitting technique uses a parabolic equation instead of a linear equation.

           Y = A X²  +  B X + C

Where X is the logarithm of  the income level  x

Where Y is the logarithm of  the number of people heving an in come x 

Empirical analysis shows that this second degree regression curve  gives very good results and that above a certain size it is statistically indistinguishable from the original Pareto distribution. Some simple mathematical calculus shows that after returning to an arithmetic scale, the second degree equations becomes the one of the Log-Normal distribution. In the rest of this paper I'll use the term Pareto distribution as a generic name since it is commonly known as such, keeping in mind that the true distribution name is Log-Normal.

Domains of applicability of Pareto Distributions

Later analysis of statistical distributions have demonstrated that Pareto distributions were indeed very common in various fields:

·      Enterprises distribution by employee size

·      Enterprises distribution by Yearly Revenue size

·      Establishments distribution by employee size

·      PABX distribution by Extension size

·      Computers distribution by Price size

·      Computers distribution by Memory size

·      Computers distribution by Installed MIPS size

·      Computers distribution by Installed Terminals

·      etc...

The remarkable stability of all these distributions, through time

and geographical area makes them a pretty powerful instrument of analysis and prevision.

 

 (EMEA short for Europe Middle East Africa)

When should one look for a Pareto distribution?

Pareto distributions all have in common the following characteristics:

They count a number of entities versus one of their size expressed in a variety of measurements.

·       Number of enterprises belonging to a certain employee size class

·       Number of people having a certain level of assets

·       Number of networks having so many end-points

The size has an open ended upper limit

·       Number of employees, at least in theory, can extend indefinitely

·       MIPS in a computer, likewise

·       Number of end points in a network too

The size measurement is homogeneous

·       The unit that measure size is the same across the population

·       Computer size is measured in MIPS throughout the analysis

·       Number of end points is the size measure of a network all along.

Populations analyzed are large

There is obviously a certain degree of insight involved in this decision to use a Pareto distribution, based on experience of having faced many cases.

Testing for Pareto Distribution fitting

The first step is to plot the data on a Log-Log scale to verify visually that the curve follows the nice smooth parabolic pattern. This is a very simple plotting exercise which can be done using either Log-Log scale paper or more easily nowadays by setting the computer graphic on the Log-Log scale for the X and Y axis.

Data come in general already grouped by size classes. The size classes boundaries are arbitrary and rarely provide equal intervals. One must use the fundamental method of histogram plotting, that is to use normalized plotting for statistical distributions.

* The height of the bar is calculated by dividing the total quantity in the size class by the width of that size class.

The plotted dot should take in account the fact that the average size in each size class is skewed towards the low end. When ever possible use the actual average size. When this actual average is not available empirical analysis has shown that one can use the geometric average of the boundaries as a fair approximation.

* The center of the size class is not the arithmetic mean of the extremes but the geometric mean. ie the square root of the product of the lower bound by the higher bound.

Log Number of Enterprises

Belonging to Size class X

   Y  A

      !

      !-------*-------

      !  *        *  '

      !              *

      !              ' *

      !              '---*----

      !              '     * '

      !              '       *

      !              '       ' *

      !              '       '   *

      !              '       '     *

      !              '       '-------*--------

      !              '       '         *     '

      !              '       '           *   '

      !              '       '             * '

      !              '       '               *

      !              '       '               ' *

      !              '       '               '   *

      !              '       '               '-----*------

      !              '       '               '       *   '

      !              '       '               '         * '

      ------------------------------------------------------> X

                    X0                       Log of Number of Employees

 The visual test will immediately confirm or infirm the hypothesis made that the distribution is indeed a Pareto (Log-normal) distribution. The slightest glitch on the curve[3] indicates that the data does not follow a Pareto distribution. We have to remember that we are looking at a Log-Log scale and that small variations on the graph represent ratios and not absolute differences.

For example, a deviation of two units on the scale means that we have either twice the quantity or half the quantity compared to a theoretical Pareto distribution.

Such a difference may be genuine, ie in the distribution that we observe, there is a specific condition happening in that precise point of the curve. We should look for such possibility.

More often, we have an artifact in our methodology that causes the distribution to look "strange". This could be due to a sampling bias (undetected, of course) or to an error in the extrapolation approach, or any other calculation error.

Further testing methods can be employed to confirm with formulas our eye test. Chi square test could be an appropriate test. I will not expand upon the testing methods here.

Step by step Pareto Distribution Plotting

Let's take for example the distribution of establishments by employee size classes in the USA in 1970. (source, County Business Patterns, US Department of commerce)

We will calculate successively the width of each size class, the average value of each employee size class and the heights of each corresponding histogram bar.

Size class Width

For each size class, just take away the lower boundary value from the higher boundary value, plus 1, remember those nasty interval problems?...

       For example: (3+1) - 1 = 3

                    (7+1) - 4 = 4

             Low     high   Size class

             bound   bound  width

                 1       3      3

                 4       7      4

                 8      19     12

                20      49     30

                50      99     50

               100     249    150

               250     499    250

               500     999    500

              1000    1499    500

              1500    2499   1000

              2500    4999   2500

              5000   10000   5001

Average size by size class

For each class, just multiply the lower size boundary and take the square root, rounding to one decimal point is in general sufficient.

For example: 1 x 3 = 3

                    ___

               V 3  = 1.732

              Low     high   Geometric

             bound   bound  average

                 1      3      1.7

                 4      7      5.3

                 8     19     12.3

                20     49     31.3

                50     99     70.4

               100    249    157.8

               250    499    353.2

               500    999    706.8

              1000   1499   1224.3

              1500   2499   1936.1

              2500   4999   3535.2

              5000  10000   7071.1

Heights of each size class histogram bar

for each class, just divide the number of units observed by the width of the size class

             For example:    1762340 : 3  = 587447.6

 

        Low     high   Size class   observed      Histogram Bar

        bound   bound  width        quantities    heights

            1       3      3         1762340       587446.6667

            4       7      4          723019       180754.75

            8      19     12          593038        49419.83333

           20      49     30          272635         9087.833333

           50      99     50           90103         1802.06

          100     249    150           51566          343.7733333

          250     499    250           16597           66.388

          500     999    500            7233           14.466

         1000    1499    500            2077            4.154

         1500    2499   1000            1250            1.25

         2500    4999   2500             743            0.2972

         5000   10000   5001             329            0.06578684263

Plotting:

We can now plot the values of the histogram heights versus the geometric average size class on Log-Log paper or use a plot program on a computer selecting the LOG-LOG option for both the X and the Y axis.

                    X axis        Y axis

                  Geometric     Histogram Bar

                  averages      heights

                     1.7        587446.66

                     5.3        180754.75

                    12.3         49419.83

                    31.3          9087.83

                    70.4          1802.06

                   157.8           343.773

                   353.2            66.388

                   706.8            14.466

                  1224.3             4.154

                  1936.1             1.25

                  3535.2             0.2972

                  7071.1             0.06578

Second example:

Network distribution according to the number of end-points attached from the estimates coming from early analysis of a survey conducted in Spain.

     NETWORKS     Geometric  Size class    NUMBER OF     Histogram Bar

    SIZE CLASSES  Average    width         NETWORKS      heights

       2       4       2.8         3         22322        7440.6

       5      14       8.4        10         18356        1835.6

      15      49      27.1        35          8484         242.4

      50     149      86.3       100          2708          27.08

     150     499     273.6       350           954           2.725

     500    1499     865.7      1000           484           0.484

    1500    4999    2738.3      3500            32           0.00914

    5000   14999    8660       10000            10           0.001

   15000   49999   27385.9     35000             2           0.0000571

The non linear aspect of this curve, contrary to what was expected from a Pareto / Log Normal distribution should attract our attention and lead us to further analysis of the ”accident” on this curve. It could be a genuine effect linked to a specific situation (regulatory or technical treshold). In most cases it is however likely to be due to some bias in our sampliing technique (as was the case here) or to an error of calculation.

Further uses of Pareto distributions:

So far we have only used the Pareto distribution to test the shape of a distribution that our knowledge of the environment would make us assume to be of "Pareto type".

Interpolation to new size classes:

We can use the stability properties of these distributions to infer information not currently available, for example we would like to compare two distributions for which the size class boundaries don't match. We can reverse the previous flow of calculations in order to get from a shape of a curve the value of the quantities corresponding to an hypothetical new size class.

Before doing that reverse calculation we have first to calculate the coefficients of the equation of the curve. For that we use a regression analysis program, in this case a second degree regression. This program must be applied to Log of the variables as it is only in a Log-Log space that the curve fitting to a second degree polynomial equation occurs.

        REG2DEG (Log Num. of Entities) VS (Log of Aver. Size)==> COEF.

            /

   Program name: use the one available on your statistical package

   COEF. are the A B and C of the equation:

                Y = A X²  +  B X + C

   Where   X = Log average size classes

                Y = Log of number of entities

Once we have the coefficients we can use them in the second degree equation to calculate the Log of the theoretical histogram bar heights values, corresponding to the log of new values of geometric averages of the new size classes.

        POLYNOM (Log of new Aver. Size) VS COEF.

            /

Program name that calculates new values using the previous coefficients

A B and C in the equation with new average values X' correspond to the new size classes:

              Y = A X' ²  +  B X' + C

   Where   X' = Log of average of new size classes

                Y' = Log of heights of histogram bars for new size classes

Real value of the histogram bar heights can then be obtained by exponentiation of the logarithm base, 10 in general, to the value calculated above.

             HH = 10 to the power Y'

            /

New absolute histogram heights

At this point remember to multiply the theoretical curve value by the new size class width to obtain the absolute value.

             NN  = HH x New Size classes width

            /

New count of entities in new size classes[4]

Extrapolation to new environments:

The Pareto distribution is extremely frequent and shows a high degree of robustness. It can be use to predict with rather high confidence the evolution of some populations. The change in the slope of the curves is very stable and its variation, if any, is very slow and monotonic.

If we have little information about one country, we may use the distribution available for the same population in another one that has better statistics, and infer, with relatively good confidence, a size distribution for the country for which we didn't have it.

Conclusion

In summary, the Pareto distribution analysis is a powerful tool in so far as it allows the analyst to test the general validity of an observed distribution and feel confident about the observations made. When one finds a variation in the curve, it is an important information which points us towards an area of research for investigation of either a genuine reason for the odd shape or for a correction of a cause of error that had been overlooked in the study.

    Paul Trehin



[1] PABX : Private Branch Exchange, a category of telephone line switching systems.

[2] At the time Pareto used a cumulative distribution curve. Empirical research has shown that the characteristics of the distribution remained the same in non cumulative distributions. The rest of this paper will concentrate on that last situation.

[3] See example 2 for an illustration of this

[4] It may be necessary to ratio the calculated theoretical values in order to arrive at the exact same total as the original population.