Analyzing Coffee Yields

statistics
Published

October 19, 2025

library(tidyverse)
library(lme4)
library(rstanarm)
options(mc.cores = 4)

This post demonstrates working with generalized linear mixed models in the context of coffee bean yield data. Each row in the following dataset is an observation of coffee bean yield from a single farm over the past year. Note that this dataset comes from cluster sampling. First, a random sample of supply units were chosen. Then, within each supply unit, a random sample of farms in the same cluster is taken. The number of farms sampled from each sampling unit is proportional to the size of the sampling unit, making sample means unbiased estimates of population means.

data <- read_csv("coffee.csv")
data |> select(ipm_methods, region, supply_unit, obs_shade)
# A tibble: 1,168 × 4
   ipm_methods                                      region supply_unit obs_shade
   <chr>                                            <chr>  <chr>       <chr>    
 1 replanting maintain_field_hygiene biological_co… sabana sabana_nor… light_sh…
 2 replanting maintain_field_hygiene biological_co… sabana sabana_nor… light_sh…
 3 maintain_field_hygiene                           sabana sabana_nor… medium_s…
 4 replanting maintain_field_hygiene                sabana sabana_nor… medium_s…
 5 maintain_field_hygiene                           sabana sabana_nor… medium_s…
 6 maintain_field_hygiene                           sabana sabana_nor… light_sh…
 7 maintain_field_hygiene                           sabana sabana_nor… medium_s…
 8 maintain_field_hygiene                           sabana sabana_nor… medium_s…
 9 replanting maintain_field_hygiene                sabana sabana_nor… medium_s…
10 maintain_field_hygiene                           sabana sabana_nor… minimal_…
# ℹ 1,158 more rows

We’ll need to do a little pre-processing first. The column listing pest management techniques (ipm_methods) can contain multiple treatment names, separated by a space. We’ll expand these into binary categories.

extract_categories <- function(data, s) {
  data |> separate_longer_delim(cols = {{ s }}, delim=" ") |> mutate(value = 1) |>
  pivot_wider(
    names_from = {{ s }},
    values_from = value,
    values_fill = 0
  )
}

We’ll also convert the obs_shade and last_replanted columns to factors.

data <- data |>
  mutate(
    obs_shade=strtoi(str_match(obs_shade, "_(\\d+)_percent")[,2]),
    last_replanted=factor(last_replanted,
                          levels=c("more_than_5_years_ago", "2_5_years_ago", "within_the_last_two_years"),
                          ordered=TRUE),
    obs_shade=factor(obs_shade, ordered = TRUE)) |>
  extract_categories(ipm_methods)

Now, on to the main question: does shade decrease yield? The following plot certainly suggests it does.

data |> ggplot(aes(obs_shade, yield_kg_ha_green)) + geom_boxplot()

But there’s likely confounders here. Higher altitude farms might both get less shade and have soils less amenable to high yields.

data |> ggplot(aes(scale(farm_altitude_meters), log(yield_kg_ha_green))) + geom_point() + facet_wrap(~obs_shade) + geom_smooth()

Let’s try modeling this more carefully.

fit <- lmer(log(yield_kg_ha_green) ~ obs_shade + (1|region/supply_unit) + scale(farm_altitude_meters), data=data)
tdat <- tibble(resid=residuals(fit))
tdat |> ggplot(aes(resid)) + geom_histogram()

A log linear model seems a little skewed.

ggplot(tdat,aes(sample=resid)) + stat_qq() + stat_qq_line()

A GLM with a Gamma family works better here.

fit2 <- glmer(yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) + (1|region/supply_unit), data=data, family=Gamma(link="log"))
plot(fit2)

tibble(resid=residuals(fit2)) |> ggplot(aes(sample=resid)) + stat_qq() + stat_qq_line()

summary(fit2)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: Gamma  ( log )
Formula: yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) +  
    (1 | region/supply_unit)
   Data: data

      AIC       BIC    logLik -2*log(L)  df.resid 
  17386.2   17431.8   -8684.1   17368.2      1159 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.4288 -0.7563 -0.2222  0.5141  5.3585 

Random effects:
 Groups             Name        Variance Std.Dev.
 supply_unit:region (Intercept) 0.05902  0.2429  
 region             (Intercept) 0.03234  0.1798  
Number of obs: 1168, groups:  supply_unit:region, 20; region, 8

Fixed effects:
                             Estimate Std. Error t value Pr(>|z|)    
(Intercept)                  6.485595   0.092594  70.043   <2e-16 ***
obs_shade.L                 -0.233903   0.106634  -2.194   0.0283 *  
obs_shade.Q                  0.020627   0.088580   0.233   0.8159    
obs_shade.C                  0.045673   0.059994   0.761   0.4465    
obs_shade^4                  0.009596   0.039745   0.241   0.8092    
scale(farm_altitude_meters) -0.037419   0.022113  -1.692   0.0906 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) obs_.L obs_.Q obs_.C obs_^4
obs_shade.L  0.135                            
obs_shade.Q  0.289  0.415                     
obs_shade.C  0.122  0.692  0.362              
obs_shade^4  0.143  0.162  0.403  0.230       
scl(frm_l_)  0.002  0.146 -0.033  0.028  0.019

We see a pretty clear negative linear relationship between shade coverage and yield. To decrease the variance of our estimate, we can try controlling for other factors. Let’s factor in fertilizer use.

data |> ggplot(aes(scale(average_fertilizer_per_hectare), log(yield_kg_ha_green))) + geom_point() + geom_smooth(span=0.2)

It seems like fertilizer has a highly nonlinear relationship with yield. Let’s just look at quantiles.

data <- data |> mutate(fertilizer_quantile = ntile(average_fertilizer_per_hectare, 4))
data |> ggplot(aes(fertilizer_quantile, log(yield_kg_ha_green))) + geom_bar(stat="identity")

fit3 <- glmer(yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) + fertilizer_quantile + (1|region/supply_unit), data=data, family=Gamma(link="log"))
summary(fit3)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: Gamma  ( log )
Formula: yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) +  
    fertilizer_quantile + (1 | region/supply_unit)
   Data: data

      AIC       BIC    logLik -2*log(L)  df.resid 
  17220.0   17270.6   -8600.0   17200.0      1158 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.4805 -0.6998 -0.1895  0.4592  6.5325 

Random effects:
 Groups             Name        Variance Std.Dev.
 supply_unit:region (Intercept) 0.05688  0.2385  
 region             (Intercept) 0.01657  0.1287  
Number of obs: 1168, groups:  supply_unit:region, 20; region, 8

Fixed effects:
                             Estimate Std. Error t value Pr(>|z|)    
(Intercept)                  5.886119   0.088345  66.627   <2e-16 ***
obs_shade.L                 -0.248051   0.099910  -2.483   0.0130 *  
obs_shade.Q                  0.001488   0.083406   0.018   0.9858    
obs_shade.C                  0.015202   0.056538   0.269   0.7880    
obs_shade^4                  0.022442   0.037406   0.600   0.5485    
scale(farm_altitude_meters) -0.034823   0.020595  -1.691   0.0909 .  
fertilizer_quantile          0.233742   0.017419  13.418   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) obs_.L obs_.Q obs_.C obs_^4 sc(__)
obs_shade.L  0.136                                   
obs_shade.Q  0.293  0.414                            
obs_shade.C  0.140  0.691  0.364                     
obs_shade^4  0.127  0.161  0.402  0.227              
scl(frm_l_) -0.010  0.151 -0.043  0.013  0.019       
frtlzr_qntl -0.483 -0.008 -0.020 -0.039  0.027  0.010

The AIC and residual variance decreased, and the standard error for shade’s negative linear effect shrunk even smaller.

plot(fit3)

Finally, we can add in factors for the pest control efforts we extracted initially.

fit4 <- glmer(yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) + fertilizer_quantile + use_insect_traps + biological_control + (1|region/supply_unit), data=data, family=Gamma(link="log"))
summary(fit4)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: Gamma  ( log )
Formula: yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) +  
    fertilizer_quantile + use_insect_traps + biological_control +  
    (1 | region/supply_unit)
   Data: data

      AIC       BIC    logLik -2*log(L)  df.resid 
  17217.0   17277.8   -8596.5   17193.0      1156 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.4889 -0.6999 -0.1899  0.4686  6.6756 

Random effects:
 Groups             Name        Variance Std.Dev.
 supply_unit:region (Intercept) 0.05524  0.2350  
 region             (Intercept) 0.01718  0.1311  
Number of obs: 1168, groups:  supply_unit:region, 20; region, 8

Fixed effects:
                            Estimate Std. Error t value Pr(>|z|)    
(Intercept)                  5.87748    0.08827  66.586   <2e-16 ***
obs_shade.L                 -0.25083    0.09969  -2.516   0.0119 *  
obs_shade.Q                  0.00545    0.08338   0.065   0.9479    
obs_shade.C                  0.02022    0.05647   0.358   0.7203    
obs_shade^4                  0.02129    0.03738   0.570   0.5689    
scale(farm_altitude_meters) -0.03168    0.02057  -1.540   0.1236    
fertilizer_quantile          0.23140    0.01745  13.261   <2e-16 ***
use_insect_traps             0.18656    0.09333   1.999   0.0456 *  
biological_control           0.12198    0.08625   1.414   0.1573    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) obs_.L obs_.Q obs_.C obs_^4 sc(__) frtlz_ us_ns_
obs_shade.L  0.135                                                 
obs_shade.Q  0.294  0.414                                          
obs_shade.C  0.140  0.690  0.365                                   
obs_shade^4  0.128  0.163  0.404  0.228                            
scl(frm_l_) -0.012  0.151 -0.041  0.015  0.020                     
frtlzr_qntl -0.476 -0.003 -0.016 -0.039  0.032  0.006              
us_nsct_trp -0.012  0.012  0.053  0.045  0.027  0.040  0.003       
blgcl_cntrl -0.039 -0.033 -0.045 -0.009 -0.052  0.035 -0.084 -0.099

Interestingly, controlling for pest mitigation methods makes the effect of altitude seem much less significant. An analysis of deviance lets us compare the nested models.

anova(fit2, fit3, fit4)
Data: data
Models:
fit2: yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) + (1 | region/supply_unit)
fit3: yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) + fertilizer_quantile + (1 | region/supply_unit)
fit4: yield_kg_ha_green ~ obs_shade + scale(farm_altitude_meters) + fertilizer_quantile + use_insect_traps + biological_control + (1 | region/supply_unit)
     npar   AIC   BIC  logLik -2*log(L)    Chisq Df Pr(>Chisq)    
fit2    9 17386 17432 -8684.1     17368                           
fit3   10 17220 17271 -8600.0     17200 168.2306  1     <2e-16 ***
fit4   12 17217 17278 -8596.5     17193   6.9538  2     0.0309 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference in deviances (likelihood ratio) for model 4 is still significant.