99+ tensorflow threads after running starnet multiple times, slow tensorflow performance

pfile · Jan 27, 2021

i have more than a month of uptime on my current PI instance and have run starnet multiple times. i was debugging some performance issues and i noticed that PI has 99 threads that look like the following:

Code:

    2383 Thread_19218631
    + 2383 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff7856240d]
    +   2383 _pthread_start  (in libsystem_pthread.dylib) + 66  [0x7fff78566249]
    +     2383 _pthread_body  (in libsystem_pthread.dylib) + 126  [0x7fff785632eb]
    +       2383 tensorflow::(anonymous namespace)::PThread::ThreadFn(void*)  (in libtensorflow_framework.2.dylib) + 104  [0x19f6049a8]
    +         2383 tensorflow::thread::EigenEnvironment::CreateThread(std::__1::function<void ()>)::'lambda'()::operator()() const  (in libtensorflow_framework.2.dylib) + 66  [0x19f6146b2]
    +           2383 Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)  (in libtensorflow_framework.2.dylib) + 589  [0x19f6149fd]
    +             2383 Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tensorflow::thread::EigenEnvironment::Task*)  (in libtensorflow_framework.2.dylib) + 870  [0x19f6152a6]
    +               2383 Eigen::EventCount::CommitWait(Eigen::EventCount::Waiter*)  (in libtensorflow_framework.2.dylib) + 229  [0x19f6155b5]
    +                 2383 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)  (in libc++.1.dylib) + 18  [0x7fff755a0a0a]
    +                   2383 _pthread_cond_wait  (in libsystem_pthread.dylib) + 722  [0x7fff7856656e]
    +                     2383 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff784a7866]

is this indicative of starnet not cleaning up after finishing with tensorflow?

seems like these threads are worker threads that are waiting for work, so it is probably OK for them to just be sitting there. but since my machine has 32 logical processors, shouldn't there be fewer than 99 of these hanging around? i suppose it is possible that libtensorflow can't respect PI's max thread count without guidance.

i find that when starnet is running the CPU is nowhere near at 100% utilization, which makes me wonder if there are just way too many worker threads for tensorflow and they are thrashing like crazy on context switches.

rob

pfile · Jan 27, 2021

while running starnet i see 96 workerloop threads which is coincidentally 3x the number of threads i have. this can't be right...

jbinpg · Jan 28, 2021

Hey, Rob. Apparently a known issue. See here:

Tensorflow does not close backgroud threads on shutdown · Issue #17739 · tensorflow/tensorflow

System information Have I written custom code OS Linux Ubuntu 17.10) TensorFlow installed using this doc https://www.tensorflow.org/install/install_c: Describe the problem I see the bug that Tensor...

github.com

Jack

pfile · Jan 28, 2021

thanks yea i had been looking at their issues database. it also seems like they could be mis-computing the number of threads to run and/or not respecting the options that have been set up with respect to max threads.

some of these bugs have been open for years so i guess google just does not care, oh well.

rob

99+ tensorflow threads after running starnet multiple times, slow tensorflow performance

pfile

Well-known member

pfile

Well-known member

jbinpg

Active member

Tensorflow does not close backgroud threads on shutdown · Issue #17739 · tensorflow/tensorflow

pfile

Well-known member