__thread fastpath not being taken

Chris Metcalf cmetcalf at tilera.com
Wed Dec 9 19:09:38 UTC 2009

The __tls_get_addr() routine is called by the compiler for __thread
accesses (barring optimizations, IE mode, etc).  But I'm confused by how
the "generation" code flow is supposed to work.

The dlopen() and dlclose() routines, for example, both increment the
global _dl_tls_generation counter after loading/unloading a shared
object with TLS entries, and __tls_get_addr() responds to this by having
each thread call into _dl_update_slotinfo(), passing the index for the
particular module from which they are getting a __thread variable.  But
if that module itself is older, i.e. its "gen" count is less than
_dl_tls_generation, it may well also be less than the dtv[] generation
counter (i.e. sufficiently up-to-date), and so _dl_update_slotinfo()
short-cuts out and returns.  But this just means that the next time we
reference a __thread variable from that module, we will have to take the
slow path again, and again and again, until finally the thread
references a __thread variable from a module with a newer generation
stamp, at which point it will bump up its dtv[] generation count, and
then no longer need to call into _dl_update_slotinfo().  This seems
pretty bad.

This is made much more obvious for just normal operation with no dlopen
calls since the ld.so initialization bumps up _dl_tls_generation in
_dl_get_ready_to_run(), since that's the first time we call init_tls()
and therefore the was_tls_init_tp_called variable is false.  So we set
_dl_tls_generation == 1, but all the actual slotinfo generation counts
are zero, and we never stop using the slowpath on __tls_get_addr!  This
is a pretty harsh performance penalty on "__thread" variables.

My proposed fix is something like the following, in dl-tls.c.  (I'm
running code patched up from the old SVN NPTL branch, but looking at the
latest GIT NPTL it seems pretty similar here.)  Obviously a real patch
would re-indent all the code contained within the scope for which I
removed the "if" test.

This code is all pretty subtle, so it's hard to be sure I haven't missed
something.  In any case, does this seem plausible?

@@ -715,23 +715,25 @@

   while (idx >= listp->len)
       idx -= listp->len;
       listp = listp->next;

-  if (dtv[0].counter < listp->slotinfo[idx].gen)
-      /* The generation counter for the slot is higher than what the
-        current dtv implements.  We have to update the whole dtv but
-        only those entries with a generation counter <= the one for
-        the entry we need.  */
-      size_t new_gen = listp->slotinfo[idx].gen;
+      /* Update the whole dtv, but capture the current generation
+         counter value to compare against, so we can ignore entries
+         that are being created as we walk through the list. */
+      size_t new_gen = _dl_tls_generation;
       size_t total = 0;

+      /* Barrier to make sure the compiler has read _dl_tls_generation.
+         FIXME: there should be a portable macro to do this.  */
+      __asm__ __volatile__("": : :"memory");
       /* We have to look through the entire dtv slotinfo list.  */
       listp =  _dl_tls_dtv_slotinfo_list;
          size_t cnt;

          for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)

I may also have missed something with the need for a compiler barrier,
but it seems required by the spirit of "ignore incompletely set up entries".

Chris Metcalf, Tilera Corp.

More information about the uClibc mailing list