Maximum likelihood estimation with tensorflow probability and pystan (and now rstan too)

I've made some edits to this post based on comments on the stan forums and from Bob Carpenter below (many thanks to all):

y ~ bernoulli_logit(alpha + x * beta); -> y ~ bernoulli_logit_glm(x, alpha, beta); in the stan code (actually nvm this doesn't work with the currently released pystan or rstan 2.18, I get this error)
Added rstan example

I'm quite excited about tensorflow 2 and have been using the alpha quite a lot recently. I'm also really enjoying a lot of the functionallity in tensorflow probability which I show a little of here. I installed both of these via:

pip install tensorflow-gpu==2.0.0-alpha tfp-nightly

In this example I use LBFGS to find maximum likelihood estimates in a fairly big logistic regression with 1 million observations and 250 predictors. It's really easy to do in tensorflow and in stan, the only difference here is the execution time, and the gap in this (contrived) example is pretty large. Perhaps when stan on the GPU becomes available it will close the gap. Less the GPU difference, a few other differences worth mentioning are:

The LBFGS implementation and stopping criteria may be different in stan and tensorflow-probability
The tensorflow version uses 32 bit floats and stan 64 bit floats (I'm not aware of any way to change this in stan)
Stan only runs on a single core

The latter two points are limitations of stan, I think. I should note that stan does seem to support some multithreading (see pystan threading support), although it looks quite complicated to use. In tensorflow taking advantage of multiple CPUs or the GPUs requires little to no effort from the user.

Anyway, here's the data generation and the tensorflow version:

import tensorflow_probability as tfp
import tensorflow as tf
import time as tm

tfd = tfp.distributions

N = 1_000_000
P = 250


def loss(par, x, y):
    alpha = par[0]
    beta = tf.expand_dims(par[1:], 1)
    dist = tfd.Bernoulli(alpha + x @ beta)
    return -tf.reduce_sum(dist.log_prob(y))


@tf.function
def loss_and_gradient(par, x, y):
    return tfp.math.value_and_gradient(lambda par: loss(par, x, y), par)


def fit(x, y):
    init = tf.zeros(tf.shape(x)[1] + 1)
    opt = tfp.optimizer.lbfgs_minimize(
        lambda par: loss_and_gradient(par, x, y), init, max_iterations=1000
    )
    return opt


tf.random.set_seed(123)

alpha_true = tfd.Normal(0.666, 1.0).sample()
beta_true = tfd.Normal(0.0, 3.14).sample([P, 1])

x = tfd.Normal(0.0, 1.0).sample([N, P])
y = tfd.Bernoulli(alpha_true + x @ beta_true).sample()

start = tm.time()
mle = fit(x, y)
end = tm.time()

print(f"tensorflow took: {end - start:.4f} seconds")
print(f"converged: {mle.converged}")
print(f"iterations: {mle.num_iterations}")

tensorflow took: 7.3069 seconds
converged: True
iterations: 27

Here's the stan version:

data {
    int<lower = 0> N;
    int<lower = 0> P;
    matrix[N, P] x;
    int<lower = 0, upper = 1> y[N];
}

parameters {
    real alpha;
    vector[P] beta;
}

model {
    y ~ bernoulli_logit(alpha + x * beta);
}

import pystan

stan_model = pystan.StanModel(stan_file, extra_compile_args=["-O3", "-march=native"])
stan_data = {"N": N, "P": P, "x": x.numpy(), "y": y[:, 0].numpy()}

start = tm.time()
stan_mle = stan_model.optimizing(stan_data, init="0")
end = tm.time()

print(f"pystan took: {end - start:.4f} seconds")

pystan took: 97.4500 seconds

Let's also try rstan (unfortunately requires reading/writing to file so I can gather the results in python):

import numpy as np

np.savetxt("misc/x.csv", stan_data["x"], delimiter=",")
np.savetxt("misc/y.csv", stan_data["y"], delimiter=",")

library(rstan)
library(data.table)
library(tictoc)

x <- as.matrix(fread("misc/x.csv", header = FALSE))
y <- fread("misc/y.csv", header = FALSE)[[1]]

stan_data <- list(N = nrow(x), P = ncol(x), x = x, y = y)

model <- stan_model(stan_file)

tic("rstan")
fit <- optimizing(model, data = stan_data, hessian = FALSE)
toc()

fwrite(data.table(fit$par), "misc/rstan.csv", col.names = FALSE)

rstan: 30.54 sec elapsed

So in this example rstan is quite different to pystan!

Finally, let's check the results:

import pandas as pd

true_vector = np.concatenate(
    [np.expand_dims(alpha_true.numpy(), 0), np.squeeze(beta_true.numpy())], 0
)

pystan_vector = np.concatenate(
    [np.expand_dims(stan_mle["alpha"], 0), stan_mle["beta"]], 0
)

rstan_vector = np.genfromtxt("misc/rstan.csv", delimiter=",")

results = pd.DataFrame(
    {
        "true": true_vector,
        "tf2": mle.position,
        "pystan": pystan_vector,
        "rstan": rstan_vector,
    },
    index=["alpha", *[f"beta[{i}]" for i in range(P)]],
)

results.head(10).round(4)

	true	tf2	pystan	rstan
alpha	-0.232084	-0.227752	-0.22775	-0.227754
beta[0]	1.06369	1.06159	1.06159	1.0616
beta[1]	1.08326	1.09081	1.09081	1.09082
beta[2]	-2.07422	-2.06481	-2.06481	-2.06483
beta[3]	-0.896468	-0.908332	-0.908334	-0.908344
beta[4]	1.37697	1.35268	1.35268	1.3527
beta[5]	2.60261	2.62698	2.62698	2.62701
beta[6]	-1.68278	-1.66754	-1.66755	-1.66756
beta[7]	-1.68099	-1.68325	-1.68325	-1.68327
beta[8]	-3.24181	-3.2511	-3.2511	-3.25115

So all looks good! I did get quite a few odd tensorflow warnings doing this, so will investigate further… I'm also not sure about where to place @tf.function, putting it on the fit function actually makes exectution time nearly double!

My CPU details:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              12
On-line CPU(s) list: 0-11
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Stepping:            10
CPU MHz:             3791.861
CPU max MHz:         4100.0000
CPU min MHz:         800.0000
BogoMIPS:            4416.00
Virtualisation:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            9216K
NUMA node0 CPU(s):   0-11

and GPU details:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6078 MBytes (6373572608 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1, Device0 = GeForce GTX 1060
Result = PASS