i have more than a month of uptime on my current PI instance and have run starnet multiple times. i was debugging some performance issues and i noticed that PI has 99 threads that look like the following:
is this indicative of starnet not cleaning up after finishing with tensorflow?
seems like these threads are worker threads that are waiting for work, so it is probably OK for them to just be sitting there. but since my machine has 32 logical processors, shouldn't there be fewer than 99 of these hanging around? i suppose it is possible that libtensorflow can't respect PI's max thread count without guidance.
i find that when starnet is running the CPU is nowhere near at 100% utilization, which makes me wonder if there are just way too many worker threads for tensorflow and they are thrashing like crazy on context switches.
rob
Code:
2383 Thread_19218631
+ 2383 thread_start (in libsystem_pthread.dylib) + 13 [0x7fff7856240d]
+ 2383 _pthread_start (in libsystem_pthread.dylib) + 66 [0x7fff78566249]
+ 2383 _pthread_body (in libsystem_pthread.dylib) + 126 [0x7fff785632eb]
+ 2383 tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) (in libtensorflow_framework.2.dylib) + 104 [0x19f6049a8]
+ 2383 tensorflow::thread::EigenEnvironment::CreateThread(std::__1::function<void ()>)::'lambda'()::operator()() const (in libtensorflow_framework.2.dylib) + 66 [0x19f6146b2]
+ 2383 Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) (in libtensorflow_framework.2.dylib) + 589 [0x19f6149fd]
+ 2383 Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tensorflow::thread::EigenEnvironment::Task*) (in libtensorflow_framework.2.dylib) + 870 [0x19f6152a6]
+ 2383 Eigen::EventCount::CommitWait(Eigen::EventCount::Waiter*) (in libtensorflow_framework.2.dylib) + 229 [0x19f6155b5]
+ 2383 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) (in libc++.1.dylib) + 18 [0x7fff755a0a0a]
+ 2383 _pthread_cond_wait (in libsystem_pthread.dylib) + 722 [0x7fff7856656e]
+ 2383 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff784a7866]
is this indicative of starnet not cleaning up after finishing with tensorflow?
seems like these threads are worker threads that are waiting for work, so it is probably OK for them to just be sitting there. but since my machine has 32 logical processors, shouldn't there be fewer than 99 of these hanging around? i suppose it is possible that libtensorflow can't respect PI's max thread count without guidance.
i find that when starnet is running the CPU is nowhere near at 100% utilization, which makes me wonder if there are just way too many worker threads for tensorflow and they are thrashing like crazy on context switches.
rob