# ORGANISING AN ANALYSIS

15

where rev rearranges the vector of median profits sorted from smallest to largest. Of course, we can replace the median function with mean or what- ever is appropriate in the call to tapply. In our situation, mean is not a good choice, because the distributions of profits or sales are naturally skewed. Sim- ple graphical tools for the inspection of distributions are introduced in the next section.

# 1.7.1 Simple Graphics

The degree of skewness of a distribution can be investigated by constructing histograms using the hist function. (More sophisticated alternatives such as smooth density estimates will be considered in Chapter ??.) For example, the code for producing Figure 1.1 first divides the plot region into two equally spaced rows (the layout function) and then plots the histograms of the raw market values in the upper part using the hist function. The lower part of the figure depicts the histogram for the log transformed market values which appear to be more symmetric.

Bivariate relationships of two continuous variables are usually depicted as scatterplots. In R, regression relationships are specified by so-called model formulae which, in a simple bivariate case, may look like

R> fm <- marketvalue ~ sales R> class(fm)

[1] "formula"

with the dependent variable on the left hand side and the independent vari- able on the right hand side. The tilde separates left and right hand side. Such a model formula can be passed to a model function (for example to the linear model function as explained in Chapter ??). The plot generic function im- plements a formula method as well. Because the distributions of both market value and sales are skewed we choose to depict their logarithms. A raw scat- terplot of 2000 data points (Figure 1.2) is rather uninformative due to areas with very high density. This problem can be avoided by choosing a transpar- ent color for the dots (currently only possible with the PDF graphics device) as shown in Figure 1.3.

If the independent variable is a factor, a boxplot representation is a natural choice. For four selected countries, the distributions of the logarithms of the market value may be visually compared in Figure 1.4. Here, the width of the boxes are proportional to the square root of the number of companies for each country and extremely large or small market values are depicted by single points.

# 1.8 Organising an Analysis

Although it is possible to perform an analysis typing all commands directly on the R prompt it is much more comfortable to maintain a separate text file collecting all steps necessary to perform a certain data analysis task. Such