Sampling from a Discrete Distribution
A colleague of mine asked how he could sample from a discrete distribution. Imagine, he said, a six-sided die that is "loaded." That is, when thrown, the sides do not show with equal probabilities. Examine the discrete distribution delineated in the following table. The first column contains the number showing on the face of the die, while the second column states the probability of observing that face when the thrown die come to rest.
| A Six-Sided Die | |
| x | p |
|---|---|
| 1 | 0.1 |
| 2 | 0.1 |
| 3 | 0.1 |
| 4 | 0.1 |
| 5 | 0.2 |
| 6 | 0.4 |
Thus, for example, the probability of throwing a 1 is 0.1, a 10% chance. The probability of throwing a 6 is 0.4, a 40% chance.
Drawing a Random Sample from the Discrete Distribution
We will now draw from the set (1, 2, 3, 4, 5, 6) with replacement. This means that we draw a number from the set (1, 2, 3, 4, 5, 6), record the result, then place it back in the set. This means that it will be available on our next draw from the set.
However, we must also assign numbers (0.1, 0.1, 0.1, 0.1, 0.2, 0.4) representing the probabilities of selecting a number in the corresponding position of the list (1, 2, 3, 4, 5, 6).
This task is easily accomplished in R. We will use the sample command, which has the following general syntax.
> sample(x, size, replace = FALSE, prob = NULL)
Here is a description of the arguments to the sample function:
- x - a vector containing the the elements from which to sample
- size - the number of elements we wish to choose from the set x
- replace - if true, we return the number to the set x before making another draw
- prob - a vector of probabilities that correspond to each entry in the vector x
Readers can obtain additional help on the sample command by entering the following command at the R prompt:
> ?sample
Let's begin by drawing 100 numbers from this "discrete" distribution. After you enter the first of the following code lines, press the Enter key, which results in a new line that starts with a "plus sign" (+). This is R's line continuation character and is provided by the system when you press Enter to start a new line. Continue typing the remaining lines of code. In the last line, the concluding parentheses (the second one) delimits a complete command and R will then execute and store the result in the variable v.
> v=sample(c(1, 2, 3, 4, 5, 6), + 100, + replace=TRUE, + prob=c(0.1, 0.1, 0.1, 0.1, 0.2, 0.4))
Well, that was easy! And fast! The variable v now contains 100 random numbers drawn from the discrete distribution with the assigned probabilities. You can view these numbers by typing v at the R prompt. The result is shown below.
> v [1] 1 6 4 6 1 1 3 2 6 3 6 3 6 5 2 5 4 6 6 4 6 5 6 1 5 6 6 6 6 2 6 5 6 [34] 5 6 1 2 6 6 1 2 5 6 5 4 5 5 3 6 1 5 6 5 3 4 6 6 6 5 4 4 5 6 3 6 3 [67] 6 6 2 6 5 4 6 2 6 2 5 5 5 6 4 6 6 3 6 5 6 6 5 4 4 5 5 3 3 5 3 3 2 [100] 5
You can test the length of this vector with the length command.
> length(v) [1] 100
Using R's Table Command
Now that we have 100 numbers drawn from our discrete distribution, how do we go about visualizing the result? One idea would be to sort the numbers, then count the number of times each number occurs. This is most easily accomplished with R's table command.
> table(v) v 1 2 3 4 5 6 7 9 12 11 24 37
Note that the number 1 was selected 7 times, the number 2 nine times, etc.
Crafting a Barplot of the Count
A barplot is one of the best methods for visualizing the results of our sample. The following command is used to create the barplot shown in Figure 1.
> barplot(table(v))

Figure 1. A barplot of the sample drawn from the discrete distribution that models the "loaded" die.
We can add appropriate axes labels and a title with the following lines of code, which are used to produce the image in Figure 2.
> barplot(table(v), + xlab="Die Face Number", + ylab="Count", + main="Modeling a Loaded Die")

Figure 2. Adding a title and axis labels to the plot from Figure 1.
Crafting a Barplot of the Frequency
Let's create a barplot that shows the frequency of occurrence for each side of the "loaded die." This number should approximate the actual probability of occurrence, an approximation that should improve with larger and larger sample sizes.
The command length(v) returned the length of the vector v, or more informally, the number of elements in our sample. Recall that we drew a sample of size 100.
> length(v) [1] 100
We can turn the count of each die face in our table into a frequency of occurrence by dividing the counts in our table by the size of the sample. This is simple to do in R.
> table(v)/length(v) v 1 2 3 4 5 6 0.07 0.09 0.12 0.11 0.24 0.37
To explain this result, first recall the counts of each die face.
> table(v) v 1 2 3 4 5 6 7 9 12 11 24 37
Because the size of the sample is 100 and the number of times the number 1 is drawn is 7, the frequency is 7/100, or equivalently, 0.07. In like manner, the number 2 was drawn 9 times, so its frequency of occurrence is 9/100, or equivalently, 0.09. These results are evident in the first of two tables above.
Let's create another barplot, only this time we will scale the vertical axis with the frequency of occurrence instead of the count, again an easy task in R. The following lines are used to create the barplot shown in Figure 3.
> barplot(table(v)/length(v), + xlab="Die Face Number", + ylab="Frequency", + main="Modeling a Loaded Die")

Figure 3. Displaying the frequency of occurrence of each die face.
Important Notes: Here are two important observations:
- The height of each bar in the barplot in Figure 3 now represents the frequency of occurrence of the number in our sample of 100 numbers drawn from the discrete distribution that models our "loaded die."
- Note that the height of the first bar, the one representing the frequency of throwing a "one" with our simulated "loaded" die, is 0.07. In similar fashion, the height of each bar represents the frequency of occurrence of each face on our simulated die.
The following table contains the number on each face, the frequency of occurrence in our sample of 100 simulated "throws", and the expected probability of occurrence as defined by our discrete distribution model at the start of this activity.
| A Six-Sided Die | ||
| x | f | p |
|---|---|---|
| 1 | 0.07 | 0.1 |
| 2 | 0.09 | 0.1 |
| 3 | 0.12 | 0.1 |
| 4 | 0.11 | 0.1 |
| 5 | 0.24 | 0.2 |
| 6 | 0.37 | 0.4 |
Increasing the Sample Size
Note that the actual frequency of occurrence approximates the expected probability of occurrence. One would expect that this correlation would improve if we increased the sample size. This is precisely the sort of thing that R handles with ease. Let's increase the sample size to 10,000, then reproduce the table and barplot of the frequencies of occurrence.
First we draw a sample of size 10,000 from our discrete distribution.
> v=sample(c(1, 2, 3, 4, 5, 6), + 10000, + replace=TRUE, + prob=c(0.1, 0.1, 0.1, 0.1, 0.2, 0.4))
We again produce a table of frequencies.
> table(v)/length(v)
v
1 2 3 4 5 6
0.0993 0.1009 0.0903 0.0948 0.1999 0.4148
Note that these frequencies are very close to the actual probabilities. This occurs because we have increased the sample size.
The following sequence of lines produces the barplot shown in Figure 4.
> barplot(table(v)/length(v), + xlab="Die Face Number", + ylab="Frequency", + main="Modeling a Loaded Die")

Figure 4. Increasing the sample size closes the gap between the actual frequency and the expected probability.
Note that the height of each bar measures the frequency of occurrence and is a close approximation of the expected probability of each outcome on the "loaded die."
Enjoy!
We hope you enjoyed this introduction to the R system. This interactive system provides a strong interactive interface for exploration in statistics.
We encourage you to explore further. Use the command ?samp to learn more about what you can do with the samp command.
