EM: categorical indicators

Author

DL Oberski

EM for categorical indicators

Below you will find code to simulate data from a mixture of conditionally independent binary indicators, and to estimate that mixture (LCA model) using the EM algorithm. The code is intended to be easy to understand and as simple as possible, while still doing the job. You can copy code into your R environment by clicking the copy icon in the top right of each code block.

Exercises

  1. Read the simulation code. Do you have any questions?
  2. Read the EM loop. Do you understand all steps?
  3. Code understanding check:
    1. In the simulation code, explain why the subscripts 1 and 2 are used respectively in P_Y.X_true[[1]][2, X].
    2. Which values would you change if you wanted to implement random starts?
    3. Suppose the model said that the variables do not come from a binomial (Bernoulli) distribution, but from some other distribution (for example a Beta one). Which lines would you need to change?
  4. Try reversing the starting values, so, Y1 = c(0.4, 0.6) becomes Y1 = c(0.6, 0.4), and similarly for the other two variables. What happens to the estimates? How do these compare to the true values now?
  5. Set the number of iterations of the EM algorithm to a large number, such as maxiter = 200. What happens to the estimates?
  6. Try out different values of the prior, the profile probabilities, and the sample size. Report any interesting observations.
  7. Fit the same model using poLCA (or otherwise). (Remember that you can directly analyze df_samp.) Do the results agree with our own EM implementation?
  8. BONUS: Calculate the log-likelihood of the model.
  9. BONUS: Investigate the entropy \(R^2\) of the posterior classification as a function of (a) the profile probabilities and (b) prior.

Simulate data

set.seed(202303)

n <- 1000L # Sample size

P_X_true <- 0.4 # True pi (will be class size of X=2)

# Sample class memberships:
X <- sample(1:2, size = n, replace = TRUE, 
            prob = c(1 - P_X_true, P_X_true))

# True profiles:
P_Y.X_true <- list(
  Y1 = matrix(c(0.9, 0.2,
                0.1, 0.8), byrow = TRUE, nrow = 2),
  Y2 = matrix(c(0.7, 0.2,
                0.3, 0.8), byrow = TRUE, nrow = 2),
  Y3 = matrix(c(0.95, 0.4,
                0.05, 0.6), byrow = TRUE, nrow = 2)
)

The conditional (profile) probabilities \(P(Y_j | X)\) are:

print(P_Y.X_true)
$Y1
     [,1] [,2]
[1,]  0.9  0.2
[2,]  0.1  0.8

$Y2
     [,1] [,2]
[1,]  0.7  0.2
[2,]  0.3  0.8

$Y3
     [,1] [,2]
[1,] 0.95  0.4
[2,] 0.05  0.6

We now sample some data using the conditional probabilities and the values of X.

# Sample observed indicators from binomials (Bernoullis):
Y1 <- rbinom(n, size = 1, prob = P_Y.X_true[[1]][2, X])
Y2 <- rbinom(n, size = 1, prob = P_Y.X_true[[2]][2, X])
Y3 <- rbinom(n, size = 1, prob = P_Y.X_true[[3]][2, X])

df_samp <- data.frame(Y1, Y2, Y3) # For other analyses

Fitting the model with our very EM algorithm

We will take as parameters the probabilities of a “1” response on each of the three indicators, given \(X=1\) or \(X=2\), respectively. And of course the class size \(\pi = P(X = 2)\). So there are 7 parameters in total. Since there are \(2^3 = 8\) observeed patterns, but only 7 independent ones, the degrees of freedom for this model equals zero.

# As usual, we start by guessing parameter values
guess_PY.X <- list(
  Y1 = c(0.4, 0.6),
  Y2 = c(0.4, 0.6),
  Y3 = c(0.4, 0.6)
)
# We will take PX (pi) to be P(X = 2)
guess_PX <- 0.5

# Number of EM iterations
maxiter <- 15

# Start the EM algorithm!
for(it in 1:maxiter) {
  # Just some output 
  if(it == 1) # A trick to make Quarto output this line correctly
    cat("It:\t(X=2)\tY1|X=1\tY1|X=2\tY2|X=1\tY2|X=2\tY3|X=1\tY3|X=2\n")
  cat(sprintf("%03d\t%1.3f\t%1.3f\t%1.3f\t%1.3f\t%1.3f\t%1.3f\t%1.3f\n", it,
      guess_PX, 
      guess_PY.X$Y1[1], guess_PY.X$Y1[2],
      guess_PY.X$Y2[1], guess_PY.X$Y2[2],
      guess_PY.X$Y3[1], guess_PY.X$Y3[2]))

  # E-step
  # ------------------
  # For clarity purposes I am not using a loop over the three variables. 
  #  In practice, you will probably want to do that. 

  # The probability of observing that value if X were X=1 or X=2
  
  # Here we use the assumption that each value is ~ Bernoulli
  #. In practice you would work with logs, but here we ignore that
  P_Y1.X1 <- dbinom(Y1, size = 1, prob = guess_PY.X$Y1[1])
  P_Y1.X2 <- dbinom(Y1, size = 1, prob = guess_PY.X$Y1[2])
  
  P_Y2.X1 <- dbinom(Y2, size = 1, prob = guess_PY.X$Y2[1])
  P_Y2.X2 <- dbinom(Y2, size = 1, prob = guess_PY.X$Y2[2])
  
  P_Y3.X1 <- dbinom(Y3, size = 1, prob = guess_PY.X$Y3[1])
  P_Y3.X2 <- dbinom(Y3, size = 1, prob = guess_PY.X$Y3[2])
  
  # Now we use the conditional independence assumption 
  #.  to get the probability of the whole pattern (df_samp[i, ])
  # (In practice you will want to takes a sum of logs instead)
  P_Y_X1 <-  P_Y1.X1 * P_Y2.X1 * P_Y3.X1
  P_Y_X2 <-  P_Y1.X2 * P_Y2.X2 * P_Y3.X2
  
  # Now we use the mixture assumption to get the marginal probability of the pattern:
  P_Y <- (1 - guess_PX)*P_Y_X1 + guess_PX*P_Y_X2
  
  # Finally we are ready to apply Bayes rule to get the posterior 
  #. P(X = 2 | Y = y)
  post_X2 <- guess_PX*P_Y_X2 / P_Y
  
  # M-step 
  # ------------------
  # Now we have the posterior it is easy to calculate the probabilities we need
  
  # M-step for 'priors' / class size of X=2
  guess_PX <- mean(post_X2)
  
  # M-step for profiles
  guess_PY.X$Y1[1] <- weighted.mean(Y1, w = (1 - post_X2))
  guess_PY.X$Y1[2] <- weighted.mean(Y1, w = post_X2)
  
  guess_PY.X$Y2[1] <- weighted.mean(Y2, w = (1 - post_X2))
  guess_PY.X$Y2[2] <- weighted.mean(Y2, w = post_X2)
  
  guess_PY.X$Y3[1] <- weighted.mean(Y3, w = (1 - post_X2))
  guess_PY.X$Y3[2] <- weighted.mean(Y3, w = post_X2)
}
It: (X=2)   Y1|X=1  Y1|X=2  Y2|X=1  Y2|X=2  Y3|X=1  Y3|X=2
001 0.500   0.400   0.600   0.400   0.600   0.400   0.600
002 0.426   0.228   0.538   0.341   0.647   0.145   0.420
003 0.423   0.167   0.623   0.289   0.720   0.092   0.495
004 0.415   0.125   0.692   0.256   0.775   0.057   0.552
005 0.406   0.106   0.732   0.247   0.798   0.042   0.584
006 0.399   0.099   0.753   0.250   0.804   0.037   0.601
007 0.393   0.098   0.765   0.255   0.804   0.036   0.611
008 0.389   0.098   0.773   0.259   0.804   0.035   0.618
009 0.385   0.099   0.778   0.262   0.805   0.036   0.624
010 0.381   0.100   0.782   0.264   0.806   0.037   0.628
011 0.378   0.101   0.785   0.266   0.808   0.037   0.631
012 0.376   0.102   0.788   0.267   0.810   0.038   0.633
013 0.374   0.103   0.790   0.268   0.811   0.039   0.636
014 0.371   0.105   0.792   0.269   0.813   0.040   0.638
015 0.370   0.106   0.794   0.270   0.814   0.041   0.640

Results

Here are the resulting estimates of the conditional probabilities (profiles).

guess_PY.X |> 
  lapply(function(x) rbind(1-x, x)) |>
  print(digits = 3)
$Y1
   [,1]  [,2]
  0.894 0.204
x 0.106 0.796

$Y2
  [,1]  [,2]
  0.73 0.184
x 0.27 0.816

$Y3
    [,1]  [,2]
  0.9588 0.359
x 0.0412 0.641

Compare these to the true profiles, which were:

P_Y.X_true
$Y1
     [,1] [,2]
[1,]  0.9  0.2
[2,]  0.1  0.8

$Y2
     [,1] [,2]
[1,]  0.7  0.2
[2,]  0.3  0.8

$Y3
     [,1] [,2]
[1,] 0.95  0.4
[2,] 0.05  0.6

The estimated class sizes are

c(1 - guess_PX, guess_PX) |> 
  print(digits = 3)
[1] 0.632 0.368

Compare these to the “priors” (class sizes), which were:

c(1 - P_X_true, P_X_true) |> 
  print(digits = 3)
[1] 0.6 0.4