99+ tensorflow threads after running starnet multiple times, slow tensorflow performance

pfile

Well-known member
i have more than a month of uptime on my current PI instance and have run starnet multiple times. i was debugging some performance issues and i noticed that PI has 99 threads that look like the following:

Code:
    2383 Thread_19218631
    + 2383 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff7856240d]
    +   2383 _pthread_start  (in libsystem_pthread.dylib) + 66  [0x7fff78566249]
    +     2383 _pthread_body  (in libsystem_pthread.dylib) + 126  [0x7fff785632eb]
    +       2383 tensorflow::(anonymous namespace)::PThread::ThreadFn(void*)  (in libtensorflow_framework.2.dylib) + 104  [0x19f6049a8]
    +         2383 tensorflow::thread::EigenEnvironment::CreateThread(std::__1::function<void ()>)::'lambda'()::operator()() const  (in libtensorflow_framework.2.dylib) + 66  [0x19f6146b2]
    +           2383 Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)  (in libtensorflow_framework.2.dylib) + 589  [0x19f6149fd]
    +             2383 Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tensorflow::thread::EigenEnvironment::Task*)  (in libtensorflow_framework.2.dylib) + 870  [0x19f6152a6]
    +               2383 Eigen::EventCount::CommitWait(Eigen::EventCount::Waiter*)  (in libtensorflow_framework.2.dylib) + 229  [0x19f6155b5]
    +                 2383 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)  (in libc++.1.dylib) + 18  [0x7fff755a0a0a]
    +                   2383 _pthread_cond_wait  (in libsystem_pthread.dylib) + 722  [0x7fff7856656e]
    +                     2383 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff784a7866]

is this indicative of starnet not cleaning up after finishing with tensorflow?

seems like these threads are worker threads that are waiting for work, so it is probably OK for them to just be sitting there. but since my machine has 32 logical processors, shouldn't there be fewer than 99 of these hanging around? i suppose it is possible that libtensorflow can't respect PI's max thread count without guidance.

i find that when starnet is running the CPU is nowhere near at 100% utilization, which makes me wonder if there are just way too many worker threads for tensorflow and they are thrashing like crazy on context switches.

rob
 
while running starnet i see 96 workerloop threads which is coincidentally 3x the number of threads i have. this can't be right...
 
thanks yea i had been looking at their issues database. it also seems like they could be mis-computing the number of threads to run and/or not respecting the options that have been set up with respect to max threads.

some of these bugs have been open for years so i guess google just does not care, oh well.

rob
 
Back
Top