library(tidyverse)
library(poLCA)
library(poLCA.extras)
set.seed(202303)
data(carcinoma)
head(carcinoma %>% sample_n(10))
A B C D E F G
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 1 1 1 1 1 1 1
4 2 2 1 1 2 1 2
5 2 2 2 2 2 2 2
6 1 1 1 1 1 1 1
DL Oberski
The loglinear formulation of LCA is very powerful. It only works for categorical data, but, in addition to the “plain vanilla” kind of locally independent LCA, it allows for:
The largest drawback is that, as implemented in most software (including the excellent cvam
package in R
), it is extremely intensive on computer memory. This means that most models that run easily in other software will likely refuse to run on your laptop, giving an “out of memory” error.
The loglinear formulation of LCA is the basis of the powerful Latent Gold software, as well as its predecessor, the still-popular LEM program (https://jeroenvermunt.nl/). Latent Gold uses many optimizations, which mostly allow it to circumvent the “small models only” problem. Here, we will use cvam
, whose implementation is very similar to LEM.
To demonstrate the uses of loglinear LCA, we use the carcinoma
data from the poLCA
package.
poLCA
library(tidyverse)
library(poLCA)
library(poLCA.extras)
set.seed(202303)
data(carcinoma)
head(carcinoma %>% sample_n(10))
A B C D E F G
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 1 1 1 1 1 1 1
4 2 2 1 1 2 1 2
5 2 2 2 2 2 2 2
6 1 1 1 1 1 1 1
f_carcinoma <- cbind(A, B, C, D, E, F, G) ~ 1
fit_polca <- poLCA(f_carcinoma,
nclass = 2,
data = carcinoma,
nrep = 10,
verbose = FALSE)
fit_polca
Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$A
Pr(1) Pr(2)
class 1: 0.0000 1.0000
class 2: 0.8835 0.1165
$B
Pr(1) Pr(2)
class 1: 0.0169 0.9831
class 2: 0.6456 0.3544
$C
Pr(1) Pr(2)
class 1: 0.2391 0.7609
class 2: 1.0000 0.0000
$D
Pr(1) Pr(2)
class 1: 0.4589 0.5411
class 2: 1.0000 0.0000
$E
Pr(1) Pr(2)
class 1: 0.0214 0.9786
class 2: 0.7771 0.2229
$F
Pr(1) Pr(2)
class 1: 0.5773 0.4227
class 2: 1.0000 0.0000
$G
Pr(1) Pr(2)
class 1: 0.0000 1.0000
class 2: 0.8835 0.1165
Estimated class population shares
0.5012 0.4988
Predicted class memberships (by modal posterior prob.)
0.5 0.5
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 118
number of estimated parameters: 15
residual degrees of freedom: 103
maximum log-likelihood: -317.2568
AIC(2): 664.5137
BIC(2): 706.0739
G^2(2): 62.36543 (Likelihood ratio/deviance statistic)
X^2(2): 92.64814 (Chi-square goodness of fit)
From the poLCA model, using the bivariate residual (BVR) functionality in the poLCA.extras
package, we can see that there are some considerable local dependencies.
A B C D E F
B 1.9
C 0.5 4.1
D 0.5 1.2 1.5
E 0.6 8.3 0.6 9.3
F 0.5 1.2 1.9 8.2 0.6
G 7.1 5.7 1.5 1.0 1.9 0.9
bootstrap_bvr_pvals(f_carcinoma,
fit_polca = fit_polca,
data = carcinoma, R = 200)
A B C D E F
B 0.205
C 0.190 0.090
D 0.105 0.360 0.150
E 0.645 0.000 0.590 0.065
F 0.080 0.290 0.095 0.000 0.450
G 0.005 0.010 0.035 0.020 0.170 0.015
cvam
library(cvam)
options(contrasts = rep("contr.sum", 2))
dat <- carcinoma %>% mutate_all(as.factor)
df_freq <- table(dat, useNA = "ifany") |> as.data.frame()
# Create a new variable that is completely missing (and has 2 cats)
df_freq$X <- latentFactor(NROW(df_freq), 2)
# Create formula
formula <-
paste(names(df_freq)[!names(df_freq) %in% c("Freq", "X")],
collapse = " + ") %>%
sprintf("~ X * (%s)", .) %>%
as.formula()
system.time(
fit_cvam <- cvam(formula, data = df_freq,
freq = Freq,
control = list(startValJitter = 0.1)
))
Note: Estimate at or near boundary
Estimated variances may be unreliable
user system elapsed
0.067 0.001 0.069
summary(fit_cvam)
~ X * (A + B + C + D + E + F + G)
Prior:
Flattening frequency = 0
Total nuggets + flattening = 0
Ridge factor = 0
Intensity factor = 1
Sample size:
total N in supplied data = 118
N from supplied data used in model fit = 118
prior effective sample size = 0
Degrees of freedom:
patterns of coarsened data = 128
cells in complete-data table = 256
cells without latent variables = 128
structural zero cells = 0
parameters in Poisson model = 16
df = 112
Starting values:
default, center
jitter SD = 0.100000
EM algorithm:
Converged at iteration 42
Gradient length = 0.000007
Final logP = 127.6839
Final loglik = 127.6839
Estimates from EM, with Hessian-based SEs
coef SE zstat pval
(Intercept) -23.4100 669.5679 -0.03 0.9721
X1 3.3833 669.5680 0.01 0.9960
A1 -3.8298 315.4106 -0.01 0.9903
B1 -0.8658 0.2612 -3.31 0.0009
C1 4.2032 261.6161 0.02 0.9872
D1 4.4529 266.4354 0.02 0.9867
E1 -0.6439 0.2572 -2.50 0.0123
F1 4.5816 266.0340 0.02 0.9863
G1 -4.0925 372.3337 -0.01 0.9912
X1:A1 -4.8428 315.4106 -0.02 0.9877
X1:B1 -1.1657 0.2614 -4.46 0.0000
X1:C1 -4.7819 261.6161 -0.02 0.9854
X1:D1 -4.5352 266.4354 -0.02 0.9864
X1:E1 -1.2683 0.2590 -4.90 0.0000
X1:F1 -4.4257 266.0340 -0.02 0.9867
X1:G1 -5.1055 372.3337 -0.01 0.9891
what <- paste0("~", LETTERS[1:7], "|X") |> sapply(as.formula)
est_indep <- cvamEstimate(c(~X, what), fit_cvam)
est_indep
Estimates and SE's from EM, linearized
~ X
X prob SE prob.lower prob.upper
1 1 0.5012 0.0463 0.4113 0.5910
2 2 0.4988 0.0463 0.4090 0.5887
~ A | X
X A prob SE prob.lower prob.upper
1 1 1 0.0000 0.0000 0.0000 1.0000
2 1 2 1.0000 0.0000 0.0000 1.0000
3 2 1 0.8835 0.0429 0.7702 0.9449
4 2 2 0.1165 0.0429 0.0551 0.2298
~ B | X
X B prob SE prob.lower prob.upper
1 1 1 0.0169 0.0168 0.0024 0.1105
2 1 2 0.9831 0.0168 0.8895 0.9976
3 2 1 0.6456 0.0627 0.5156 0.7572
4 2 2 0.3544 0.0627 0.2428 0.4844
~ C | X
X C prob SE prob.lower prob.upper
1 1 1 0.2391 0.0561 0.1466 0.3650
2 1 2 0.7609 0.0561 0.6350 0.8534
3 2 1 1.0000 0.0000 0.0000 1.0000
4 2 2 0.0000 0.0000 0.0000 1.0000
~ D | X
X D prob SE prob.lower prob.upper
1 1 1 0.4589 0.0651 0.3367 0.5863
2 1 2 0.5411 0.0651 0.4137 0.6633
3 2 1 1.0000 0.0000 0.0000 1.0000
4 2 2 0.0000 0.0000 0.0000 1.0000
~ E | X
X E prob SE prob.lower prob.upper
1 1 1 0.0214 0.0206 0.0032 0.1305
2 1 2 0.9786 0.0206 0.8695 0.9968
3 2 1 0.7771 0.0545 0.6530 0.8659
4 2 2 0.2229 0.0545 0.1341 0.3470
~ F | X
X F prob SE prob.lower prob.upper
1 1 1 0.5773 0.0644 0.4488 0.6961
2 1 2 0.4227 0.0644 0.3039 0.5512
3 2 1 1.0000 0.0000 0.0000 1.0000
4 2 2 0.0000 0.0000 0.0000 1.0000
~ G | X
X G prob SE prob.lower prob.upper
1 1 1 0.0000 0.0000 0.0000 1.0000
2 1 2 1.0000 0.0000 0.0000 1.0000
3 2 1 0.8835 0.0429 0.7702 0.9449
4 2 2 0.1165 0.0429 0.0551 0.2298
(Note that the Hessian-based se’s cannot be trusted. We can look at the estimated se’s in the marginal probability tables above, or use cvam
’s MCMC option to obtain standard errors on the loglinear parmeters.)
Now we update the model with the local dependences, held equal across classes.
# formula_ld <- update(formula, ~ . + D:F + A:G + B:E)
formula_ld <- update(formula, ~ . + (A+B+D+E+F+G)^2)
system.time(
fit_cvam_ld <-
cvam(formula_ld, data = df_freq, freq = Freq,
control = list(startValJitter = 0.05, iterMaxEM = 1000),
prior = cvam::cvamPrior(ridge = 0.1)
))
Note: Estimate at or near boundary
Estimated variances may be unreliable
user system elapsed
0.154 0.001 0.156
summary(fit_cvam_ld)
~ X + A + B + C + D + E + F + G + X:A + X:B + X:C + X:D + X:E + X:F + X:G + A:B + A:D + A:E + A:F + A:G + B:D + B:E + B:F + B:G + D:E + D:F + D:G + E:F + E:G + F:G
Prior:
Flattening frequency = 0.0
Total nuggets + flattening = 0.0
Ridge factor = 0.1
Intensity factor = 1.0
Sample size:
total N in supplied data = 118
N from supplied data used in model fit = 118
prior effective sample size = 0
Degrees of freedom:
patterns of coarsened data = 128
cells in complete-data table = 256
cells without latent variables = 128
structural zero cells = 0
parameters in Poisson model = 31
df = 97
Starting values:
default, center
jitter SD = 0.050000
EM algorithm:
Converged at iteration 48
Gradient length = 0.000007
Final logP = 156.3212
Final loglik = 156.3619
Estimates from EM, with Hessian-based SEs
coef SE zstat pval
(Intercept) -7.44766 1.2413 -6.00 0.0000
X1 0.05773 1.1782 0.05 0.9609
A1 -1.07054 0.8586 -1.25 0.2124
B1 -0.99729 0.6410 -1.56 0.1198
C1 0.91826 0.4892 1.88 0.0605
D1 1.37380 0.6926 1.98 0.0473
E1 -0.63117 0.6675 -0.95 0.3444
F1 1.44479 0.6807 2.12 0.0338
G1 -0.61377 0.9487 -0.65 0.5177
X1:A1 -1.15492 0.5704 -2.02 0.0429
X1:B1 0.86834 0.5993 1.45 0.1473
X1:C1 -1.80735 0.4922 -3.67 0.0002
X1:D1 -0.43269 0.4615 -0.94 0.3484
X1:E1 -1.20903 0.6867 -1.76 0.0783
X1:F1 -0.90714 0.9970 -0.91 0.3629
X1:G1 -1.15465 0.6863 -1.68 0.0925
A1:B1 0.46370 0.2411 1.92 0.0544
A1:D1 0.54873 0.5507 1.00 0.3190
A1:E1 0.02503 0.2238 0.11 0.9109
A1:F1 0.05476 0.7841 0.07 0.9443
A1:G1 0.26642 0.2274 1.17 0.2413
B1:D1 0.39565 0.5548 0.71 0.4758
B1:E1 0.63904 0.2244 2.85 0.0044
B1:F1 0.27186 0.5824 0.47 0.6407
B1:G1 0.97343 0.5155 1.89 0.0590
D1:E1 -0.42058 0.4911 -0.86 0.3917
D1:F1 0.37588 0.1552 2.42 0.0154
D1:G1 0.64085 0.5849 1.10 0.2732
E1:F1 0.15256 0.7192 0.21 0.8320
E1:G1 0.47018 0.2358 1.99 0.0462
F1:G1 -0.05171 0.8287 -0.06 0.9502
est_locdep <- cvamEstimate(c(~X, what), fit_cvam_ld)
est_locdep
Estimates and SE's from EM, linearized
~ X
X prob SE prob.lower prob.upper
1 1 0.4443 0.0502 0.3493 0.5436
2 2 0.5557 0.0502 0.4564 0.6507
~ A | X
X A prob SE prob.lower prob.upper
1 1 1 0.0050 0.0097 0.0001 0.1889
2 1 2 0.9950 0.0097 0.8111 0.9999
3 2 1 0.7905 0.0582 0.6545 0.8826
4 2 2 0.2095 0.0582 0.1174 0.3455
~ B | X
X B prob SE prob.lower prob.upper
1 1 1 0.0226 0.0203 0.0038 0.1232
2 1 2 0.9774 0.0203 0.8768 0.9962
3 2 1 0.5794 0.0646 0.4503 0.6984
4 2 2 0.4206 0.0646 0.3016 0.5497
~ C | X
X C prob SE prob.lower prob.upper
1 1 1 0.1445 0.0629 0.0587 0.3141
2 1 2 0.8555 0.0629 0.6859 0.9413
3 2 1 0.9957 0.0081 0.8501 0.9999
4 2 2 0.0043 0.0081 0.0001 0.1499
~ D | X
X D prob SE prob.lower prob.upper
1 1 1 0.4202 0.0710 0.2905 0.5619
2 1 2 0.5798 0.0710 0.4381 0.7095
3 2 1 0.9720 0.0263 0.8391 0.9957
4 2 2 0.0280 0.0263 0.0043 0.1609
~ E | X
X E prob SE prob.lower prob.upper
1 1 1 0.0048 0.0096 0.0001 0.1938
2 1 2 0.9952 0.0096 0.8062 0.9999
3 2 1 0.7146 0.0617 0.5805 0.8192
4 2 2 0.2854 0.0617 0.1808 0.4195
~ F | X
X F prob SE prob.lower prob.upper
1 1 1 0.5242 0.0718 0.3852 0.6596
2 1 2 0.4758 0.0718 0.3404 0.6148
3 2 1 0.9948 0.0108 0.7542 0.9999
4 2 2 0.0052 0.0108 0.0001 0.2458
~ G | X
X G prob SE prob.lower prob.upper
1 1 1 0.0046 0.0091 0.0001 0.1882
2 1 2 0.9954 0.0091 0.8118 0.9999
3 2 1 0.7905 0.0583 0.6543 0.8827
4 2 2 0.2095 0.0583 0.1173 0.3457
anova(fit_cvam, fit_cvam_ld, method = "BIC")
Model 1: ~ X * (A + B + C + D + E + F + G)
Model 2: ~ X + A + B + C + D + E + F + G + X:A + X:B + X:C + X:D + X:E + X:F + X:G + A:B + A:D + A:E + A:F + A:G + B:D + B:E + B:F + B:G + D:E + D:F + D:G + E:F + E:G + F:G
resid.df -2*loglik BIC rank
1 112 -255.37 -179.04 1
2 97 -312.72 -164.83 2
The local dependence model fits better.
Example from Oberski (2016), available at https://doi.org/10.1007/s11634-015-0211-0 (open access).
Data are from the LISS panel. People were asked 5 times if they had voted in the last parliamentary election. This pertained to two separate elections. The substantively inspired model is as shown.
flowchart TD X1((Voted 2006)) ---> A[2008] X1 ---> B[2009] X1 ---> C[2010] X2((Voted 2010)) ---> D[2011] X2 ---> E[2012] X1 --> X2
Read in the data.
path <- "https://daob.nl/files/lca/vote-LISS.dat.gz"
vote <- readr::read_table(path, na = ".")
rmarkdown::paged_table(vote)
We transform the data into frequencies (including the missing values). We also rename the variables A-E to write smaller formulas.
df_freq <- table(vote[, -1], useNA = "ifany") |> as.data.frame()
names(df_freq)[1:5] <- LETTERS[1:5]
We now formulate the model. It includes two different latent variables X1
and X2
, which are meant to signify really voting in 2006 and 2010, respectively.
# Create a new variable that is completely missing (and has 2 cats)
df_freq$X1 <- latentFactor(NROW(df_freq), 2)
df_freq$X2 <- latentFactor(NROW(df_freq), 2)
# Create formula
formula <- ~ X1 * (A + B + C) + X2 * (D + E) + X1:X2
We use cvam
to fit the model.
Some results of interest.
summary(fit_cvam)
~ X1 * (A + B + C) + X2 * (D + E) + X1:X2
Prior:
Flattening frequency = 0
Total nuggets + flattening = 0
Ridge factor = 0
Intensity factor = 1
Sample size:
total N in supplied data = 9510
N from supplied data used in model fit = 9510
prior effective sample size = 0
Degrees of freedom:
patterns of coarsened data = 243
cells in complete-data table = 128
cells without latent variables = 32
structural zero cells = 0
parameters in Poisson model = 14
df = 18
Starting values:
default, center
jitter SD = 0.100000
EM algorithm:
Converged at iteration 206
Gradient length = 0.000005
Final logP = 68648.1
Final loglik = 68648.1
Estimates from EM, with Hessian-based SEs
coef SE zstat pval
(Intercept) -0.1313 0.08044 -1.63 0.1027
X11 0.7904 0.11156 7.08 0.0000
A1 -0.7223 0.04392 -16.45 0.0000
B1 -0.4948 0.05053 -9.79 0.0000
C1 -0.1910 0.04447 -4.29 0.0000
X21 -0.2248 0.11662 -1.93 0.0539
D1 -0.6126 0.04945 -12.39 0.0000
E1 -0.2831 0.06674 -4.24 0.0000
X11:A1 1.2421 0.04424 28.08 0.0000
X11:B1 1.3693 0.04807 28.49 0.0000
X11:C1 1.2343 0.04326 28.54 0.0000
X21:D1 -1.2459 0.04699 -26.51 0.0000
X21:E1 -1.4374 0.06316 -22.76 0.0000
X11:X21 -1.0906 0.04178 -26.10 0.0000
what <- paste0("~", LETTERS[1:3], "|X1") |> sapply(as.formula)
what <- what %>% c(paste0("~", LETTERS[4:5], "|X2") |> sapply(as.formula))
what <- c(~X1, ~X2, ~X2 | X1, what)
est_indep <- cvamEstimate(what, fit_cvam)
est_indep
Estimates and SE's from EM, linearized
~ X1
X1 prob SE prob.lower prob.upper
1 1 0.1726 0.0057 0.1618 0.1841
2 2 0.8274 0.0057 0.8159 0.8382
~ X2
X2 prob SE prob.lower prob.upper
1 1 0.8383 0.0063 0.8256 0.8503
2 2 0.1617 0.0063 0.1497 0.1744
~ X2 | X1
X1 X2 prob SE prob.lower prob.upper
1 1 1 0.2448 0.0214 0.2052 0.2892
2 1 2 0.7552 0.0214 0.7108 0.7948
3 2 1 0.9622 0.0049 0.9513 0.9707
4 2 2 0.0378 0.0049 0.0293 0.0487
~ A | X1
X1 A prob SE prob.lower prob.upper
1 1 0 0.7388 0.0204 0.6969 0.7767
2 1 1 0.2612 0.0204 0.2233 0.3031
3 2 0 0.0193 0.0027 0.0147 0.0253
4 2 1 0.9807 0.0027 0.9747 0.9853
~ B | X1
X1 B prob SE prob.lower prob.upper
1 1 0 0.8518 0.0178 0.8135 0.8834
2 1 1 0.1482 0.0178 0.1166 0.1865
3 2 0 0.0235 0.0032 0.0180 0.0305
4 2 1 0.9765 0.0032 0.9695 0.9820
~ C | X1
X1 C prob SE prob.lower prob.upper
1 1 0 0.8896 0.0151 0.8563 0.9159
2 1 1 0.1104 0.0151 0.0841 0.1437
3 2 0 0.0547 0.0043 0.0467 0.0638
4 2 1 0.9453 0.0043 0.9362 0.9533
~ D | X2
X2 D prob SE prob.lower prob.upper
1 1 0 0.0237 0.0032 0.0182 0.0308
2 1 1 0.9763 0.0032 0.9692 0.9818
3 2 0 0.7801 0.0233 0.7310 0.8225
4 2 1 0.2199 0.0233 0.1775 0.2690
~ E | X2
X2 E prob SE prob.lower prob.upper
1 1 0 0.0310 0.0043 0.0237 0.0405
2 1 1 0.9690 0.0043 0.9595 0.9763
3 2 0 0.9096 0.0179 0.8677 0.9391
4 2 1 0.0904 0.0179 0.0609 0.1323
As in the paper, we can also include two local dependencies by updating the formula and refitting.
The models are nested, so we can use the LRT to compare them.
anova(fit_cvam, fit_cvam_ld)
Model 1: ~ X1 * (A + B + C) + X2 * (D + E) + X1:X2
Model 2: ~ X1 + A + B + C + X2 + D + E + X1:A + X1:B + X1:C + X2:D + X2:E + X1:X2 + A:B + A:E
resid.df -2*loglik df change
1 18 -137296
2 16 -137364 2 67.86
The actual estimates are not too different.
est_locdep <- cvamEstimate(what, fit_cvam_ld)
est_locdep
Estimates and SE's from EM, linearized
~ X1
X1 prob SE prob.lower prob.upper
1 1 0.1851 0.0075 0.1709 0.2003
2 2 0.8149 0.0075 0.7997 0.8291
~ X2
X2 prob SE prob.lower prob.upper
1 1 0.1674 0.0069 0.1544 0.1813
2 2 0.8326 0.0069 0.8187 0.8456
~ X2 | X1
X1 X2 prob SE prob.lower prob.upper
1 1 1 0.7248 0.0277 0.6673 0.7756
2 1 2 0.2752 0.0277 0.2244 0.3327
3 2 1 0.0408 0.0058 0.0309 0.0537
4 2 2 0.9592 0.0058 0.9463 0.9691
~ A | X1
X1 A prob SE prob.lower prob.upper
1 1 0 0.6550 0.0248 0.6049 0.7020
2 1 1 0.3450 0.0248 0.2980 0.3951
3 2 0 0.0272 0.0036 0.0210 0.0351
4 2 1 0.9728 0.0036 0.9649 0.9790
~ B | X1
X1 B prob SE prob.lower prob.upper
1 1 0 0.7671 0.0248 0.7150 0.8121
2 1 1 0.2329 0.0248 0.1879 0.2850
3 2 0 0.0295 0.0040 0.0226 0.0384
4 2 1 0.9705 0.0040 0.9616 0.9774
~ C | X1
X1 C prob SE prob.lower prob.upper
1 1 0 0.9210 0.0189 0.8752 0.9509
2 1 1 0.0790 0.0189 0.0491 0.1248
3 2 0 0.0351 0.0062 0.0248 0.0494
4 2 1 0.9649 0.0062 0.9506 0.9752
~ D | X2
X2 D prob SE prob.lower prob.upper
1 1 0 0.7733 0.0239 0.7231 0.8167
2 1 1 0.2267 0.0239 0.1833 0.2769
3 2 0 0.0200 0.0035 0.0141 0.0281
4 2 1 0.9800 0.0035 0.9719 0.9859
~ E | X2
X2 E prob SE prob.lower prob.upper
1 1 0 0.8879 0.0208 0.8401 0.9227
2 1 1 0.1121 0.0208 0.0773 0.1599
3 2 0 0.0298 0.0044 0.0223 0.0398
4 2 1 0.9702 0.0044 0.9602 0.9777
First consider the four-category indicators as nominal.
data("election")
df_freq <- election[, 1:5] %>% table() %>% as.data.frame()
df_freq$X <- latentFactor(NROW(df_freq), 2)
# Create formula
formula <- ~ X * (MORALG + CARESG + KNOWG + LEADG + DISHONG)
fit_cvam <- cvam(formula, data = df_freq, freq = Freq,
control = list(startValJitter = 0.1))
summary(fit_cvam)
~ X * (MORALG + CARESG + KNOWG + LEADG + DISHONG)
Prior:
Flattening frequency = 0
Total nuggets + flattening = 0
Ridge factor = 0
Intensity factor = 1
Sample size:
total N in supplied data = 1495
N from supplied data used in model fit = 1495
prior effective sample size = 0
Degrees of freedom:
patterns of coarsened data = 1024
cells in complete-data table = 2048
cells without latent variables = 1024
structural zero cells = 0
parameters in Poisson model = 32
df = 992
Starting values:
default, center
jitter SD = 0.100000
EM algorithm:
Converged at iteration 115
Gradient length = 0.000034
Final logP = 1178.373
Final loglik = 1178.373
Estimates from EM, with Hessian-based SEs
coef SE zstat pval
(Intercept) -3.63815 0.15757 -23.09 0.0000
X1 1.13778 0.24287 4.68 0.0000
MORALG1 0.16551 0.12671 1.31 0.1915
MORALG2 1.35204 0.09074 14.90 0.0000
MORALG3 -0.21237 0.13418 -1.58 0.1135
CARESG1 -0.82327 0.21057 -3.91 0.0001
CARESG2 1.03761 0.09673 10.73 0.0000
CARESG3 0.44928 0.11499 3.91 0.0001
KNOWG1 0.51791 0.10565 4.90 0.0000
KNOWG2 1.59075 0.08528 18.65 0.0000
KNOWG3 -0.39963 0.14737 -2.71 0.0067
LEADG1 -0.75474 0.20225 -3.73 0.0002
LEADG2 1.10371 0.10844 10.18 0.0000
LEADG3 0.72692 0.11187 6.50 0.0000
DISHONG1 -1.11446 0.10390 -10.73 0.0000
DISHONG2 -0.08549 0.06760 -1.26 0.2060
DISHONG3 0.76925 0.05154 14.93 0.0000
X1:MORALG1 -1.54191 0.11903 -12.95 0.0000
X1:MORALG2 -0.53811 0.09313 -5.78 0.0000
X1:MORALG3 0.86393 0.13451 6.42 0.0000
X1:CARESG1 -1.75411 0.20513 -8.55 0.0000
X1:CARESG2 -0.60193 0.09661 -6.23 0.0000
X1:CARESG3 0.94726 0.10984 8.62 0.0000
X1:KNOWG1 -1.24453 0.10332 -12.05 0.0000
X1:KNOWG2 -0.37086 0.08936 -4.15 0.0000
X1:KNOWG3 0.80897 0.15394 5.25 0.0000
X1:LEADG1 -1.73558 0.20081 -8.64 0.0000
X1:LEADG2 -0.67058 0.10758 -6.23 0.0000
X1:LEADG3 0.74112 0.11295 6.56 0.0000
X1:DISHONG1 0.73923 0.10896 6.78 0.0000
X1:DISHONG2 0.37220 0.07144 5.21 0.0000
X1:DISHONG3 -0.22209 0.05838 -3.80 0.0001
what <- c(~X, ~MORALG | X, ~CARESG | X, ~KNOWG | X,
~LEADG | X, ~DISHONG | X)
cvamEstimate(what, fit_cvam)
Estimates and SE's from EM, linearized
~ X
X prob SE prob.lower prob.upper
1 1 0.4482 0.0218 0.4058 0.4912
2 2 0.5518 0.0218 0.5088 0.5942
~ MORALG | X
X MORALG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.0473 0.0103 0.0307 0.0722
2 1 2 Quite well 0.4224 0.0236 0.3769 0.4693
3 1 3 Not too well 0.3591 0.0214 0.3184 0.4020
4 1 4 Not well at all 0.1712 0.0159 0.1423 0.2047
5 2 1 Extremely well 0.4392 0.0204 0.3996 0.4796
6 2 2 Quite well 0.5273 0.0194 0.4891 0.5651
7 2 3 Not too well 0.0271 0.0078 0.0154 0.0475
8 2 4 Not well at all 0.0064 0.0036 0.0021 0.0191
~ CARESG | X
X CARESG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.0098 0.0051 0.0035 0.0272
2 1 2 Quite well 0.1990 0.0240 0.1561 0.2501
3 1 3 Not too well 0.5201 0.0227 0.4755 0.5644
4 1 4 Not well at all 0.2711 0.0197 0.2343 0.3114
5 2 1 Extremely well 0.3012 0.0182 0.2668 0.3379
6 2 2 Quite well 0.6117 0.0184 0.5752 0.6471
7 2 3 Not too well 0.0722 0.0134 0.0499 0.1032
8 2 4 Not well at all 0.0149 0.0054 0.0073 0.0303
~ KNOWG | X
X KNOWG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.0836 0.0122 0.0627 0.1108
2 1 2 Quite well 0.5858 0.0210 0.5440 0.6263
3 1 3 Not too well 0.2604 0.0185 0.2258 0.2984
4 1 4 Not well at all 0.0701 0.0103 0.0525 0.0932
5 2 1 Extremely well 0.4375 0.0203 0.3982 0.4776
6 2 2 Quite well 0.5340 0.0193 0.4961 0.5715
7 2 3 Not too well 0.0224 0.0077 0.0114 0.0438
8 2 4 Not well at all 0.0061 0.0031 0.0022 0.0164
~ LEADG | X
X LEADG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.0107 0.0052 0.0041 0.0276
2 1 2 Quite well 0.1985 0.0230 0.1573 0.2473
3 1 3 Not too well 0.5588 0.0223 0.5148 0.6019
4 1 4 Not well at all 0.2320 0.0184 0.1979 0.2701
5 2 1 Extremely well 0.2774 0.0177 0.2442 0.3133
6 2 2 Quite well 0.6133 0.0184 0.5766 0.6488
7 2 3 Not too well 0.1026 0.0153 0.0762 0.1367
8 2 4 Not well at all 0.0067 0.0037 0.0023 0.0197
~ DISHONG | X
X DISHONG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.1569 0.0152 0.1294 0.1890
2 1 2 Quite well 0.3041 0.0195 0.2674 0.3436
3 1 3 Not too well 0.3946 0.0213 0.3537 0.4371
4 1 4 Not well at all 0.1443 0.0159 0.1159 0.1784
5 2 1 Extremely well 0.0217 0.0054 0.0132 0.0353
6 2 2 Quite well 0.0875 0.0111 0.0681 0.1119
7 2 3 Not too well 0.3728 0.0189 0.3367 0.4105
8 2 4 Not well at all 0.5179 0.0200 0.4787 0.5570
Now consider them “ordinal” using an adjacent-category logit model, a.k.a. “nominal-by-linear” model in Goodman’s loglinear terminology.
scale_01 <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
factor_to_numeric_01 <- function(x) {
x_num <- seq_along(levels(x))[x]
scale_01(x_num)
}
X <- fit_cvam$modelMatrix
X <- X %>% as_tibble() %>%
# Remove nominal interaction terms
select_if(!grepl("X.:.+", names(.)))
# Now include linear interaction terms by hand
X$`X1:MORALG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$MORALG)
X$`X1:CARESG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$CARESG)
X$`X1:KNOWG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$KNOWG)
X$`X1:LEADG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$LEADG)
X$`X1:DISHONG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$DISHONG)
rmarkdown::paged_table(X)
X <- as.matrix(X)
We can now fit the model with ordinal indicators by passing our custom design matrix X
to cvam
.
fit_cvam_ord <- cvam(formula, data = df_freq,
freq = Freq, modelMatrix = X,
control = list(startValJitter = 0.1))
Note: Estimate at or near boundary
Estimated variances may be unreliable
summary(fit_cvam_ord)
~ X * (MORALG + CARESG + KNOWG + LEADG + DISHONG)
Prior:
Flattening frequency = 0
Total nuggets + flattening = 0
Ridge factor = 0
Intensity factor = 1
Sample size:
total N in supplied data = 1495
N from supplied data used in model fit = 1495
prior effective sample size = 0
Degrees of freedom:
patterns of coarsened data = 1024
cells in complete-data table = 2048
cells without latent variables = 1024
structural zero cells = 0
parameters in Poisson model = 22
df = 1002
Starting values:
default, center
jitter SD = 0.100000
EM algorithm:
Converged at iteration 102
Gradient length = 0.000028
Final logP = 1163.216
Final loglik = 1163.216
Estimates from EM, with Hessian-based SEs
coef SE zstat pval
(Intercept) -4.17560 0.14364 -29.07 0.0000
X1 -4.70561 0.23787 -19.78 0.0000
MORALG1 0.19249 0.07959 2.42 0.0156
MORALG2 1.48030 0.08811 16.80 0.0000
MORALG3 0.07357 0.06576 1.12 0.2632
CARESG1 -0.80496 0.10068 -8.00 0.0000
CARESG2 1.26607 0.09138 13.86 0.0000
CARESG3 0.74189 0.07953 9.33 0.0000
KNOWG1 0.60969 0.07515 8.11 0.0000
KNOWG2 1.68829 0.07947 21.25 0.0000
KNOWG3 -0.10932 0.07033 -1.55 0.1201
LEADG1 -0.82379 0.09712 -8.48 0.0000
LEADG2 1.22339 0.07733 15.82 0.0000
LEADG3 0.86062 0.07474 11.51 0.0000
DISHONG1 -1.22892 0.08710 -14.11 0.0000
DISHONG2 -0.04606 0.05561 -0.83 0.4076
DISHONG3 0.78925 0.05224 15.11 0.0000
X1:MORALG 3.46023 0.25312 13.67 0.0000
X1:CARESG 4.15955 0.29922 13.90 0.0000
X1:KNOWG 2.70613 0.21085 12.83 0.0000
X1:LEADG 3.85675 0.25308 15.24 0.0000
X1:DISHONG -1.75815 0.12500 -14.07 0.0000
cvamEstimate(what, fit_cvam_ord)
Estimates and SE's from EM, linearized
~ X
X prob SE prob.lower prob.upper
1 1 0.4492 0.0204 0.4097 0.4893
2 2 0.5508 0.0204 0.5107 0.5903
~ MORALG | X
X MORALG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.0385 0.0071 0.0267 0.0552
2 1 2 Quite well 0.4421 0.0201 0.4032 0.4817
3 1 3 Not too well 0.3432 0.0196 0.3059 0.3825
4 1 4 Not well at all 0.1762 0.0158 0.1474 0.2092
5 2 1 Extremely well 0.4471 0.0199 0.4086 0.4862
6 2 2 Quite well 0.5114 0.0172 0.4777 0.5449
7 2 3 Not too well 0.0395 0.0069 0.0279 0.0556
8 2 4 Not well at all 0.0020 0.0007 0.0010 0.0040
~ CARESG | X
X CARESG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.0066 0.0019 0.0037 0.0117
2 1 2 Quite well 0.2103 0.0220 0.1704 0.2567
3 1 3 Not too well 0.4981 0.0212 0.4566 0.5396
4 1 4 Not well at all 0.2850 0.0197 0.2479 0.3252
5 2 1 Extremely well 0.3043 0.0178 0.2705 0.3403
6 2 2 Quite well 0.6033 0.0177 0.5681 0.6374
7 2 3 Not too well 0.0893 0.0136 0.0659 0.1198
8 2 4 Not well at all 0.0032 0.0011 0.0016 0.0064
~ KNOWG | X
X KNOWG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.0825 0.0105 0.0641 0.1055
2 1 2 Quite well 0.5980 0.0166 0.5650 0.6301
3 1 3 Not too well 0.2442 0.0166 0.2132 0.2781
4 1 4 Not well at all 0.0752 0.0103 0.0574 0.0981
5 2 1 Extremely well 0.4391 0.0195 0.4014 0.4776
6 2 2 Quite well 0.5239 0.0165 0.4915 0.5561
7 2 3 Not too well 0.0352 0.0058 0.0255 0.0484
8 2 4 Not well at all 0.0018 0.0006 0.0009 0.0034
~ LEADG | X
X LEADG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.0077 0.0020 0.0046 0.0129
2 1 2 Quite well 0.2154 0.0210 0.1771 0.2593
3 1 3 Not too well 0.5419 0.0209 0.5007 0.5825
4 1 4 Not well at all 0.2351 0.0180 0.2017 0.2721
5 2 1 Extremely well 0.2803 0.0171 0.2481 0.3150
6 2 2 Quite well 0.6004 0.0174 0.5659 0.6338
7 2 3 Not too well 0.1155 0.0140 0.0908 0.1459
8 2 4 Not well at all 0.0038 0.0011 0.0022 0.0068
~ DISHONG | X
X DISHONG prob SE prob.lower prob.upper
1 1 1 Extremely well 0.1638 0.0144 0.1375 0.1940
2 1 2 Quite well 0.2976 0.0166 0.2660 0.3311
3 1 3 Not too well 0.3818 0.0142 0.3543 0.4100
4 1 4 Not well at all 0.1568 0.0138 0.1316 0.1859
5 2 1 Extremely well 0.0158 0.0029 0.0110 0.0226
6 2 2 Quite well 0.0925 0.0087 0.0768 0.1110
7 2 3 Not too well 0.3833 0.0137 0.3568 0.4105
8 2 4 Not well at all 0.5084 0.0180 0.4732 0.5436
The ordinal model is more parsimonious, using 10 fewer parameters.
anova(fit_cvam_ord, fit_cvam)
Model 1: ~ X * (MORALG + CARESG + KNOWG + LEADG + DISHONG)
Model 2: ~ X * (MORALG + CARESG + KNOWG + LEADG + DISHONG)
resid.df -2*loglik df change
1 1002 -2326.4
2 992 -2356.8 10 30.316
anova(fit_cvam_ord, fit_cvam, method = "BIC")
Model 1: ~ X * (MORALG + CARESG + KNOWG + LEADG + DISHONG)
Model 2: ~ X * (MORALG + CARESG + KNOWG + LEADG + DISHONG)
resid.df -2*loglik BIC rank
1 1002 -2326.4 -2165.6 1
2 992 -2356.8 -2122.8 2
options(contrasts = rep("contr.sum", 2))
varnames <- LETTERS[1:5]
sum_vars <- paste(varnames, collapse = "+")
formula <- sprintf("Freq ~ X * (%s)", sum_vars)
# formula <- sprintf("Freq ~ X * (%s) + B:E + G:A + D:F + B:G + G:D", sum_vars)
df_freq <- carcinoma[, varnames] |> table() |> as.data.frame()
n <- nrow(df_freq)
df_freq$patnum <- 1:n
df_expanded <- rbind(df_freq, df_freq)
df_expanded$X <- rep(0:1, each = n)
post_start <- runif(n)
df_expanded$post <- c(post_start, 1-post_start)
suppressWarnings( # LCAs often have boundary values within classes
for(i in 1:200) {
# M-step
# Loglinear model
fit_glm <- glm(formula,
data = df_expanded, weights = post,
family = "poisson")
# E-step
eta.X <- predict(fit_glm) # Linear predictor given X
eta.X <- matrix(eta.X, nrow = n) |> t()
n.X <- sum(exp(eta.X)) # Sample size given X
P_YX <- exp(eta.X)/n.X # Probability of each pattern, joint (X, Y)
P_Y <- colSums(P_YX)
P_X.Y <- t(P_YX) / P_Y # Posterior of X given Y
df_expanded$post <- as.vector(P_X.Y)
})
summary(fit_glm)
Call:
glm(formula = formula, family = "poisson", data = df_expanded,
weights = post)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.27223 -0.00021 0.00000 0.00006 2.32773
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -20.7518 3878.9737 -0.005 0.9957
X 9.5193 3997.5153 0.002 0.9981
A1 1.0887 0.2304 4.726 2.29e-06 ***
B1 0.3282 0.1401 2.343 0.0191 *
C1 11.0176 2854.0327 0.004 0.9969
D1 10.9489 2626.9629 0.004 0.9967
E1 0.6686 0.1651 4.050 5.11e-05 ***
X:A1 -11.0505 966.2770 -0.011 0.9909
X:B1 -2.3191 0.5229 -4.435 9.22e-06 ***
X:C1 -11.5198 2854.0327 -0.004 0.9968
X:D1 -10.9899 2626.9629 -0.004 0.9967
X:E1 -2.5160 0.4704 -5.349 8.83e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 346.375 on 63 degrees of freedom
Residual deviance: 31.418 on 52 degrees of freedom
AIC: 98.345
Number of Fisher Scoring iterations: 17
P_X <- rowSums(P_YX) # Prior
P_Y.X <- P_YX/P_X # Conditional of Y given X
P_Y.X_marginal <- sapply(varnames, function(v) {
apply(P_Y.X, 1, function(x) tapply(x, df_freq[, v], sum))
})
P_Y.X_marginal |> round(3)
A B C D E
[1,] 0.898 0.658 1.000 1.00 0.792
[2,] 0.102 0.342 0.000 0.00 0.208
[3,] 0.000 0.018 0.268 0.48 0.024
[4,] 1.000 0.982 0.732 0.52 0.976
# fit_polca$probs |> lapply(round, 3)
N <- sum(df_freq$Freq)
expected <- P_Y * N
observed <- df_freq$Freq
loglik <- sum(observed * log(P_Y))
deviance <- -2*loglik
p <- length(coef(fit_glm))
bic_deviance <- deviance + p*log(N)
bic_deviance
[1] 556.776
X2 <- sum((observed - expected)^2/expected)
G2 <- 2*sum(observed * log(observed/expected), na.rm = TRUE)
data.frame(Chisq=c(X2, fit_polca$Chisq), Gsqp=c(G2, fit_polca$Gsq), method=c("loglinear", "poLCA"))
Chisq Gsqp method
1 27.48417 27.98881 loglinear
2 92.64814 62.36543 poLCA
This does appear to work as well. It requires more hand-holding than cvam
.
library(gllm)
options(contrasts = rep("contr.treatment", 2))
K <- 3
df_expanded <- Reduce(rbind, replicate(K, df_freq, simplify = FALSE))
df_expanded$X <- factor(rep(seq(K), each = n))
formula_gllm <- ~ X * (A + B + C + D + E)
X <- model.matrix(formula_gllm, data = df_expanded)
s <- replicate(K, 1:nrow(df_freq))
res <- emgllm(y = df_freq$Freq, s = s, X = X)
res
$deviance
[1] 28.16023
$observed.values
[1] 34 2 7 3 0 0 0 0 0 0 0 1 0 0 0 0 2 0 9 10 0 1 0 18 0
[26] 0 0 5 0 0 0 26
$fitted.values
[1] 2.755131e+01 2.760242e+00 1.386034e+01 1.558156e+00 1.409730e-02
[6] 9.989355e-03 8.454280e-03 4.817056e-01 1.084389e-02 4.451493e-03
[11] 5.989757e-03 1.892922e-01 3.258625e-05 9.464652e-03 1.519161e-03
[16] 5.304967e-01 6.896560e+00 8.201101e-01 3.489994e+00 7.632804e+00
[21] 4.566682e-03 3.656668e-01 6.030415e-02 2.047734e+01 3.121613e-03
[26] 1.436234e-01 2.433266e-02 8.035526e+00 1.153064e-03 4.029088e-01
[31] 6.455638e-02 2.258447e+01
$full.table
[1] 2.755131e+01 2.757187e+00 1.385986e+01 1.387020e+00 1.407245e-02
[6] 1.408294e-03 7.079230e-03 7.084511e-04 1.083403e-02 1.084211e-03
[11] 5.450125e-03 5.454192e-04 5.533722e-06 5.537851e-07 2.783773e-06
[16] 2.785850e-07 6.896190e+00 6.901336e-01 3.469171e+00 3.471759e-01
[21] 3.522383e-03 3.525011e-04 1.771956e-03 1.773278e-04 2.711796e-03
[26] 2.713819e-04 1.364186e-03 1.365203e-04 1.385110e-06 1.386144e-07
[31] 6.967882e-07 6.973081e-08 4.363061e-06 1.526518e-03 2.445662e-04
[36] 8.556714e-02 1.226294e-05 4.290473e-03 6.873842e-04 2.404973e-01
[41] 4.812064e-06 1.683612e-03 2.697345e-04 9.437287e-02 1.352492e-05
[46] 4.732006e-03 7.581231e-04 2.652469e-01 1.857467e-04 6.498778e-02
[51] 1.041181e-02 3.642812e+00 5.220646e-04 1.826564e-01 2.926371e-02
[56] 1.023859e+01 2.048619e-04 7.167568e-02 1.148329e-02 4.017695e+00
[61] 5.757904e-04 2.014536e-01 3.227524e-02 1.129224e+01 4.363061e-06
[66] 1.526518e-03 2.445662e-04 8.556714e-02 1.226294e-05 4.290473e-03
[71] 6.873842e-04 2.404973e-01 4.812064e-06 1.683612e-03 2.697345e-04
[76] 9.437287e-02 1.352492e-05 4.732006e-03 7.581231e-04 2.652469e-01
[81] 1.857467e-04 6.498778e-02 1.041181e-02 3.642812e+00 5.220646e-04
[86] 1.826564e-01 2.926371e-02 1.023859e+01 2.048619e-04 7.167568e-02
[91] 1.148329e-02 4.017695e+00 5.757904e-04 2.014536e-01 3.227524e-02
[96] 1.129224e+01
[1] 28.49572
[1] 28.16023
---
title: "Loglinear LCA"
editor: visual
author: "DL Oberski"
format:
html:
code-fold: false
code-summary: "Show the code"
code-tools: true
code-link: true
theme: zephyr
toc: true
execute:
cache: true
---
# Loglinear formulation of LCA
The loglinear formulation of LCA is very powerful. It only works for categorical data, but, in addition to the "plain vanilla" kind of locally independent LCA, it allows for:
- Built-in handling of missing data (everything is a missing data problem);
- NMAR ("unseen data-dependent") models;
- Multiple latent class variables, unlocking arbitrary modelling with latent variables, i.e. "categorical SEM";
- "Linear by linear" and "linear by nominal" effects, making for more parsimonious models, unlocking:
- Discretized, "nonparametric" versions of IRT models ("DFactor models" in Latent Gold)
- Covariates/concomitant variables, including "measurement invariance" models;
- Facilities for complex sampling designs, i.e. weighting, clustering, and stratification;
- and more.
The largest drawback is that, as implemented in most software (including the excellent `cvam` package in `R`), it is extremely intensive on computer memory. This means that most models that run easily in other software will likely refuse to run on your laptop, giving an "out of memory" error.
The loglinear formulation of LCA is the basis of the powerful Latent Gold software, as well as its predecessor, the still-popular LEM program (https://jeroenvermunt.nl/). Latent Gold uses many optimizations, which mostly allow it to circumvent the "small models only" problem. Here, we will use `cvam`, whose implementation is very similar to LEM.
To demonstrate the uses of loglinear LCA, we use the `carcinoma` data from the `poLCA` package.
## Simple analysis using `poLCA`
```{r}
#| message: false
library(tidyverse)
library(poLCA)
library(poLCA.extras)
set.seed(202303)
data(carcinoma)
head(carcinoma %>% sample_n(10))
```
```{r}
f_carcinoma <- cbind(A, B, C, D, E, F, G) ~ 1
fit_polca <- poLCA(f_carcinoma,
nclass = 2,
data = carcinoma,
nrep = 10,
verbose = FALSE)
fit_polca
```
From the poLCA model, using the bivariate residual (BVR) functionality in the `poLCA.extras` package, we can see that there are some considerable local dependencies.
```{r}
bvr(fit_polca) |> round(1)
bootstrap_bvr_pvals(f_carcinoma,
fit_polca = fit_polca,
data = carcinoma, R = 200)
```
## Simple and locally dependent LCA using `cvam`
```{r}
library(cvam)
options(contrasts = rep("contr.sum", 2))
dat <- carcinoma %>% mutate_all(as.factor)
df_freq <- table(dat, useNA = "ifany") |> as.data.frame()
# Create a new variable that is completely missing (and has 2 cats)
df_freq$X <- latentFactor(NROW(df_freq), 2)
# Create formula
formula <-
paste(names(df_freq)[!names(df_freq) %in% c("Freq", "X")],
collapse = " + ") %>%
sprintf("~ X * (%s)", .) %>%
as.formula()
system.time(
fit_cvam <- cvam(formula, data = df_freq,
freq = Freq,
control = list(startValJitter = 0.1)
))
summary(fit_cvam)
what <- paste0("~", LETTERS[1:7], "|X") |> sapply(as.formula)
est_indep <- cvamEstimate(c(~X, what), fit_cvam)
est_indep
```
(Note that the Hessian-based se's cannot be trusted. We can look at the estimated se's in the marginal probability tables above, or use `cvam`'s MCMC option to obtain standard errors on the loglinear parmeters.)
Now we update the model with the local dependences, held equal across classes.
```{r}
# formula_ld <- update(formula, ~ . + D:F + A:G + B:E)
formula_ld <- update(formula, ~ . + (A+B+D+E+F+G)^2)
system.time(
fit_cvam_ld <-
cvam(formula_ld, data = df_freq, freq = Freq,
control = list(startValJitter = 0.05, iterMaxEM = 1000),
prior = cvam::cvamPrior(ridge = 0.1)
))
summary(fit_cvam_ld)
est_locdep <- cvamEstimate(c(~X, what), fit_cvam_ld)
est_locdep
anova(fit_cvam, fit_cvam_ld, method = "BIC")
```
The local dependence model fits better.
## Extensions to the basic LC model using loglinear formulation
### Including covariates
### Multiple latent variables
Example from Oberski (2016), available at <https://doi.org/10.1007/s11634-015-0211-0> (open access).
```{r}
library(readr)
library(cvam)
options(contrasts = rep("contr.sum", 2))
```
Data are from the LISS panel. People were asked 5 times if they had voted in the last parliamentary election. This pertained to two separate elections. The substantively inspired model is as shown.
```{mermaid}
flowchart TD
X1((Voted 2006)) ---> A[2008]
X1 ---> B[2009]
X1 ---> C[2010]
X2((Voted 2010)) ---> D[2011]
X2 ---> E[2012]
X1 --> X2
```
Read in the data.
```{r}
#| message: false
path <- "https://daob.nl/files/lca/vote-LISS.dat.gz"
vote <- readr::read_table(path, na = ".")
rmarkdown::paged_table(vote)
```
We transform the data into frequencies (including the missing values). We also rename the variables A-E to write smaller formulas.
```{r}
df_freq <- table(vote[, -1], useNA = "ifany") |> as.data.frame()
names(df_freq)[1:5] <- LETTERS[1:5]
```
We now formulate the model. It includes two different latent variables `X1` and `X2`, which are meant to signify really voting in 2006 and 2010, respectively.
```{r}
# Create a new variable that is completely missing (and has 2 cats)
df_freq$X1 <- latentFactor(NROW(df_freq), 2)
df_freq$X2 <- latentFactor(NROW(df_freq), 2)
# Create formula
formula <- ~ X1 * (A + B + C) + X2 * (D + E) + X1:X2
```
We use `cvam` to fit the model.
```{r}
fit_cvam <- cvam(formula, data = df_freq, freq = Freq,
control = list(startValJitter = 0.1))
```
Some results of interest.
```{r}
summary(fit_cvam)
what <- paste0("~", LETTERS[1:3], "|X1") |> sapply(as.formula)
what <- what %>% c(paste0("~", LETTERS[4:5], "|X2") |> sapply(as.formula))
what <- c(~X1, ~X2, ~X2 | X1, what)
est_indep <- cvamEstimate(what, fit_cvam)
est_indep
```
As in the paper, we can also include two local dependencies by updating the formula and refitting.
```{r}
# Update formula
formula_ld <- update(formula, ~ . + A:B + A:E)
fit_cvam_ld <- cvam(formula_ld, data = df_freq, freq = Freq,
control = list(startValJitter = 0.1))
```
The models are nested, so we can use the LRT to compare them.
```{r}
anova(fit_cvam, fit_cvam_ld)
```
The actual estimates are not too different.
```{r}
est_locdep <- cvamEstimate(what, fit_cvam_ld)
est_locdep
```
### Ordinal indicators
First consider the four-category indicators as nominal.
```{r}
data("election")
df_freq <- election[, 1:5] %>% table() %>% as.data.frame()
df_freq$X <- latentFactor(NROW(df_freq), 2)
# Create formula
formula <- ~ X * (MORALG + CARESG + KNOWG + LEADG + DISHONG)
fit_cvam <- cvam(formula, data = df_freq, freq = Freq,
control = list(startValJitter = 0.1))
summary(fit_cvam)
what <- c(~X, ~MORALG | X, ~CARESG | X, ~KNOWG | X,
~LEADG | X, ~DISHONG | X)
cvamEstimate(what, fit_cvam)
```
Now consider them "ordinal" using an adjacent-category logit model, a.k.a. "nominal-by-linear" model in Goodman's loglinear terminology.
```{r}
scale_01 <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
factor_to_numeric_01 <- function(x) {
x_num <- seq_along(levels(x))[x]
scale_01(x_num)
}
X <- fit_cvam$modelMatrix
X <- X %>% as_tibble() %>%
# Remove nominal interaction terms
select_if(!grepl("X.:.+", names(.)))
# Now include linear interaction terms by hand
X$`X1:MORALG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$MORALG)
X$`X1:CARESG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$CARESG)
X$`X1:KNOWG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$KNOWG)
X$`X1:LEADG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$LEADG)
X$`X1:DISHONG` <- X$X1 * factor_to_numeric_01(fit_cvam$mfTrue$DISHONG)
rmarkdown::paged_table(X)
X <- as.matrix(X)
```
We can now fit the model with ordinal indicators by passing our custom design matrix `X` to `cvam`.
```{r}
fit_cvam_ord <- cvam(formula, data = df_freq,
freq = Freq, modelMatrix = X,
control = list(startValJitter = 0.1))
summary(fit_cvam_ord)
cvamEstimate(what, fit_cvam_ord)
```
The ordinal model is more parsimonious, using `r fit_cvam_ord$degreesOfFreedom - fit_cvam$degreesOfFreedom` fewer parameters.
```{r}
anova(fit_cvam_ord, fit_cvam)
```
```{r}
anova(fit_cvam_ord, fit_cvam, method = "BIC")
```
### Linear effects among observed and latent variables (DFactor models)
### Missing-not-at-random (MNAR) / Unseen data-dependent models
## Doing everything by hand
```{r}
#| message: false
options(contrasts = rep("contr.sum", 2))
varnames <- LETTERS[1:5]
sum_vars <- paste(varnames, collapse = "+")
formula <- sprintf("Freq ~ X * (%s)", sum_vars)
# formula <- sprintf("Freq ~ X * (%s) + B:E + G:A + D:F + B:G + G:D", sum_vars)
df_freq <- carcinoma[, varnames] |> table() |> as.data.frame()
n <- nrow(df_freq)
df_freq$patnum <- 1:n
df_expanded <- rbind(df_freq, df_freq)
df_expanded$X <- rep(0:1, each = n)
post_start <- runif(n)
df_expanded$post <- c(post_start, 1-post_start)
suppressWarnings( # LCAs often have boundary values within classes
for(i in 1:200) {
# M-step
# Loglinear model
fit_glm <- glm(formula,
data = df_expanded, weights = post,
family = "poisson")
# E-step
eta.X <- predict(fit_glm) # Linear predictor given X
eta.X <- matrix(eta.X, nrow = n) |> t()
n.X <- sum(exp(eta.X)) # Sample size given X
P_YX <- exp(eta.X)/n.X # Probability of each pattern, joint (X, Y)
P_Y <- colSums(P_YX)
P_X.Y <- t(P_YX) / P_Y # Posterior of X given Y
df_expanded$post <- as.vector(P_X.Y)
})
summary(fit_glm)
P_X <- rowSums(P_YX) # Prior
P_Y.X <- P_YX/P_X # Conditional of Y given X
P_Y.X_marginal <- sapply(varnames, function(v) {
apply(P_Y.X, 1, function(x) tapply(x, df_freq[, v], sum))
})
P_Y.X_marginal |> round(3)
# fit_polca$probs |> lapply(round, 3)
N <- sum(df_freq$Freq)
expected <- P_Y * N
observed <- df_freq$Freq
loglik <- sum(observed * log(P_Y))
deviance <- -2*loglik
p <- length(coef(fit_glm))
bic_deviance <- deviance + p*log(N)
bic_deviance
X2 <- sum((observed - expected)^2/expected)
G2 <- 2*sum(observed * log(observed/expected), na.rm = TRUE)
data.frame(Chisq=c(X2, fit_polca$Chisq), Gsqp=c(G2, fit_polca$Gsq), method=c("loglinear", "poLCA"))
```
### Alternative R implementation
This does appear to work as well. It requires more hand-holding than `cvam`.
```{r}
library(gllm)
options(contrasts = rep("contr.treatment", 2))
K <- 3
df_expanded <- Reduce(rbind, replicate(K, df_freq, simplify = FALSE))
df_expanded$X <- factor(rep(seq(K), each = n))
formula_gllm <- ~ X * (A + B + C + D + E)
X <- model.matrix(formula_gllm, data = df_expanded)
s <- replicate(K, 1:nrow(df_freq))
res <- emgllm(y = df_freq$Freq, s = s, X = X)
res
with(res, sum((fitted.values - observed.values)^2/fitted.values))
with(res, 2*sum(observed.values * log(observed.values/fitted.values), na.rm = TRUE))
```