I've made some edits to this post based on comments on the stan forums and from Bob Carpenter below (many thanks to all):

1. y ~ bernoulli_logit(alpha + x * beta); -> y ~ bernoulli_logit_glm(x, alpha, beta); in the stan code (actually nvm this doesn't work with the currently released pystan or rstan 2.18, I get this error)

I'm quite excited about tensorflow 2 and have been using the alpha quite a lot recently. I'm also really enjoying a lot of the functionallity in tensorflow probability which I show a little of here. I installed both of these via:

pip install tensorflow-gpu==2.0.0-alpha tfp-nightly


In this example I use LBFGS to find maximum likelihood estimates in a fairly big logistic regression with 1 million observations and 250 predictors. It's really easy to do in tensorflow and in stan, the only difference here is the execution time, and the gap in this (contrived) example is pretty large. Perhaps when stan on the GPU becomes available it will close the gap. Less the GPU difference, a few other differences worth mentioning are:

1. The LBFGS implementation and stopping criteria may be different in stan and tensorflow-probability
2. The tensorflow version uses 32 bit floats and stan 64 bit floats (I'm not aware of any way to change this in stan)
3. Stan only runs on a single core

The latter two points are limitations of stan, I think. I should note that stan does seem to support some multithreading (see pystan threading support), although it looks quite complicated to use. In tensorflow taking advantage of multiple CPUs or the GPUs requires little to no effort from the user.

Anyway, here's the data generation and the tensorflow version:

import tensorflow_probability as tfp
import tensorflow as tf
import time as tm

tfd = tfp.distributions

N = 1_000_000
P = 250

def loss(par, x, y):
alpha = par
beta = tf.expand_dims(par[1:], 1)
dist = tfd.Bernoulli(alpha + x @ beta)
return -tf.reduce_sum(dist.log_prob(y))

@tf.function
return tfp.math.value_and_gradient(lambda par: loss(par, x, y), par)

def fit(x, y):
init = tf.zeros(tf.shape(x) + 1)
opt = tfp.optimizer.lbfgs_minimize(
lambda par: loss_and_gradient(par, x, y), init, max_iterations=1000
)
return opt

tf.random.set_seed(123)

alpha_true = tfd.Normal(0.666, 1.0).sample()
beta_true = tfd.Normal(0.0, 3.14).sample([P, 1])

x = tfd.Normal(0.0, 1.0).sample([N, P])
y = tfd.Bernoulli(alpha_true + x @ beta_true).sample()

start = tm.time()
mle = fit(x, y)
end = tm.time()

print(f"tensorflow took: {end - start:.4f} seconds")
print(f"converged: {mle.converged}")
print(f"iterations: {mle.num_iterations}")

tensorflow took: 7.3069 seconds
converged: True
iterations: 27



Here's the stan version:

data {
int<lower = 0> N;
int<lower = 0> P;
matrix[N, P] x;
int<lower = 0, upper = 1> y[N];
}

parameters {
real alpha;
vector[P] beta;
}

model {
y ~ bernoulli_logit(alpha + x * beta);
}

import pystan

stan_model = pystan.StanModel(stan_file, extra_compile_args=["-O3", "-march=native"])
stan_data = {"N": N, "P": P, "x": x.numpy(), "y": y[:, 0].numpy()}

start = tm.time()
stan_mle = stan_model.optimizing(stan_data, init="0")
end = tm.time()

print(f"pystan took: {end - start:.4f} seconds")

pystan took: 97.4500 seconds



Let's also try rstan (unfortunately requires reading/writing to file so I can gather the results in python):

import numpy as np

np.savetxt("misc/x.csv", stan_data["x"], delimiter=",")
np.savetxt("misc/y.csv", stan_data["y"], delimiter=",")

library(rstan)
library(data.table)
library(tictoc)

stan_data <- list(N = nrow(x), P = ncol(x), x = x, y = y)

model <- stan_model(stan_file)

tic("rstan")
fit <- optimizing(model, data = stan_data, hessian = FALSE)
toc()

fwrite(data.table(fit\$par), "misc/rstan.csv", col.names = FALSE)

rstan: 30.54 sec elapsed



So in this example rstan is quite different to pystan!

Finally, let's check the results:

import pandas as pd

true_vector = np.concatenate(
[np.expand_dims(alpha_true.numpy(), 0), np.squeeze(beta_true.numpy())], 0
)

pystan_vector = np.concatenate(
[np.expand_dims(stan_mle["alpha"], 0), stan_mle["beta"]], 0
)

rstan_vector = np.genfromtxt("misc/rstan.csv", delimiter=",")

results = pd.DataFrame(
{
"true": true_vector,
"tf2": mle.position,
"pystan": pystan_vector,
"rstan": rstan_vector,
},
index=["alpha", *[f"beta[{i}]" for i in range(P)]],
)


true tf2 pystan rstan
alpha -0.232084 -0.227752 -0.22775 -0.227754
beta 1.06369 1.06159 1.06159 1.0616
beta 1.08326 1.09081 1.09081 1.09082
beta -2.07422 -2.06481 -2.06481 -2.06483
beta -0.896468 -0.908332 -0.908334 -0.908344
beta 1.37697 1.35268 1.35268 1.3527
beta 2.60261 2.62698 2.62698 2.62701
beta -1.68278 -1.66754 -1.66755 -1.66756
beta -1.68099 -1.68325 -1.68325 -1.68327
beta -3.24181 -3.2511 -3.2511 -3.25115

So all looks good! I did get quite a few odd tensorflow warnings doing this, so will investigate further… I'm also not sure about where to place @tf.function, putting it on the fit function actually makes exectution time nearly double!

My CPU details:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              12
On-line CPU(s) list: 0-11
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Stepping:            10
CPU MHz:             3791.861
CPU max MHz:         4100.0000
CPU min MHz:         800.0000
BogoMIPS:            4416.00
Virtualisation:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            9216K
NUMA node0 CPU(s):   0-11


and GPU details:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060"
CUDA Driver Version / Runtime Version          10.0 / 10.0
CUDA Capability Major/Minor version number:    6.1
Total amount of global memory:                 6078 MBytes (6373572608 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
GPU Max Clock rate:                            1671 MHz (1.67 GHz)
Memory Clock rate:                             4004 Mhz
Memory Bus Width:                              192-bit
L2 Cache Size:                                 1572864 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  2048
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     Yes
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Disabled
Device supports Unified Addressing (UVA):      Yes
Device supports Compute Preemption:            Yes
Supports Cooperative Kernel Launch:            Yes
Supports MultiDevice Co-op Kernel Launch:      Yes
Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1, Device0 = GeForce GTX 1060
Result = PASS