[Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

Lorenzo Bianconi lorenzo at kernel.org
Fri Sep 27 15:06:53 UTC 2024


On Sep 27, Arthur Fabre wrote:
> On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote:
> > "Arthur Fabre" <afabre at cloudflare.com> writes:
> >
> > >> >> The nice thing about an API like this, though, is that it's extensible,
> > >> >> and the kernel itself can be just another consumer of it for the
> > >> >> metadata fields Lorenzo is adding in this series. I.e., we could just
> > >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> > >> >> functions as above from within the kernel to set and get those values;
> > >> >> using the registry, there could even be an option to turn those off if
> > >> >> an application wants more space for its own usage. Or, alternatively, we
> > >> >> could keep the kernel-internal IDs hardcoded and always allocated, and
> > >> >> just use the getter/setter functions as the BPF API for accessing them.
> > >> >
> > >> > That's exactly what I'm thinking of too, a simple API like:
> > >> >
> > >> > get(u8 key, u8 len, void *val);
> > >> > set(u8 key, u8 len, void *val);
> > >> >
> > >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
> > >> >
> > >> > If a NIC doesn't support a certain well-known metadata, the key
> > >> > wouldn't be set, and get() would return ENOENT.
> > >> >
> > >> > I think this also lets us avoid having to "register" keys or bits of
> > >> > metadata with the kernel.
> > >> > We'd reserve some number of keys for hardware metadata.
> > >>
> > >> Right, but how do you allocate space/offset for each key without an
> > >> explicit allocation step? You'd basically have to encode the list of IDs
> > >> in the metadata area itself, which implies a TLV format that you have to
> > >> walk on every access? The registry idea in my example above was
> > >> basically to avoid that...
> > >
> > > I've been playing around with having a small fixed header at the front
> > > of the metadata itself, that lets you access values without walking them
> > > all.
> > >
> > > Still WIP, and maybe this is too restrictive, but it lets you encode 64
> > > 2, 4, or 8 byte KVs with a single 16 byte header:
> >
> > Ohh, that's clever, I like it! :)
> >
> > It's also extensible in the sense that the internal representation can
> > change without impacting the API, so if we end up needing more bits we
> > can always add those.
> >
> > Maybe it would be a good idea to make the 'key' parameter a larger
> > integer type (function parameters are always 64-bit anyway, so might as
> > well go all the way up to u64)? That way we can use higher values for
> > the kernel-reserved types instead of reserving part of the already-small
> > key space for applications (assuming the kernel-internal data is stored
> > somewhere else, like in a static struct as in Lorenzo's patch)?
> 
> Good idea! That makes it more extensible too if we ever support more
> keys or bigger lengths.
> 
> I'm not sure where the kernel-reserved types should live. Putting them
> in here uses up some the of KV IDs, but it uses less head room space than 
> always reserving a static struct for them.
> Maybe it doesn't matter much, as long as we use the same API to access
> them, we could internally switch between one and the other.
> 
> >
> > [...]
> >
> > >> > The remaining keys would be up to users. They'd have to allocate keys
> > >> > to services, and configure services to use those keys.
> > >> > This is similar to the way listening on a certain port works: only one
> > >> > service can use port 80 or 443, and that can typically beconfigured in
> > >> > a service's config file.
> > >>
> > >> Right, well, port numbers *do* actually have an out of band service
> > >> registry (IANA), which I thought was what we wanted to avoid? ;)
> > >
> > > Depends how you think about it ;)
> > >
> > > I think we should avoid a global registry. But having a registry per
> > > deployment / server doesn't seem awful. Services that want to use a
> > > field would have a config file setting to set which index it corresponds
> > > to.
> > > Admins would just have to pick a free one on their system, and set it in
> > > the config file of the service.
> > >
> > > This is similar to what we do for non-IANA registered ports internally.
> > > For example each service needs a port on an internal interface to expose
> > > metrics, and we just track which ports are taken in our config
> > > management.
> >
> > Right, this would work, but it assumes that applications are
> > well-behaved and do this correctly. Which they probably do in a
> > centrally-managed system like yours, but for random applications shipped
> > by distros, I'm not sure if it's going to work.
> >
> > In fact, it's more or less the situation we have with skb->mark today,
> > isn't it? I.e., applications can co-exist as long as they don't use the
> > same bits, so they have to coordinate on which bits to use. Sure, with
> > this scheme there will be more total bits to use, which can lessen the
> > pressure somewhat, but the basic problem remains. In other words, I
> > worry that in practice we will end up with another github repository
> > serving as a registry for metadata keys...
> 
> That's true. If applications hardcode keys, we'll be back to having
> conflicts. If someone creates a registry on github I'll be very sad.
> 
> (Maybe we can make the verifier enforce that the key passed to get() and
> set() isn't a constant? - only joking)
> 
> Applications don't tend to do this for ports though, I think most can be
> configured to listen on any port. Is that just because it's been
> instilled as "good practice" over time? Could we try to do the same with
> some very stern documentation and examples?
> 
> Thinking about it more, my only relectance for a registration API is how
> to communicate the ID back to other consumers (our discussion below).
> 
> >
> > > Dynamically registering fields means you have to share the returned ID
> > > with whoever is interested, which sounds tricky.
> > > If an XDP program sets a field like packet_id, every tracing
> > > program that looks at it, and userspace service, would need to know what
> > > the ID of that field is.
> > > Is there a way to easily share that ID with all of them?
> >
> > Right, so I'll admit this was one of the handwavy bits of my original
> > proposal, but I don't think it's unsolvable. You could do something like
> > (once, on application initialisation):
> >
> > __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
> > bpf_map_update(&shared_application_config, &my_key_index, &my_key);
> >
> > and then just get the key out of that map from all programs that want to
> > use it?
> 
> Passing it out of band works (whether it's via a pinned map like you
> described, or through other means like a unix socket or some other
> API), it's just more complicated.
> 
> Every consumer also needs to know about that API. That won't work with
> standard tools. For example if we set a PACKET_ID KV, maybe we could
> give it to pwru so it could track packets using it?
> Without registering keys, we could pass it as a cli flag. With
> registration, we'd have to have some helper to get the KV ID.
> 
> It also introduces ordering dependencies between the services on
> startup, eg packet tracing hooks could only be attached once our XDP
> service has registered a PACKET_ID KV, and they could query it's API.
> 
> >
> > We could combine such a registration API with your header format, so
> > that the registration just becomes a way of allocating one of the keys
> > from 0-63 (and the registry just becomes a global copy of the header).
> > This would basically amount to moving the "service config file" into the
> > kernel, since that seems to be the only common denominator we can rely
> > on between BPF applications (as all attempts to write a common daemon
> > for BPF management have shown).
> 
> That sounds reasonable. And I guess we'd have set() check the global
> registry to enforce that the key has been registered beforehand?
> 
> >
> > -Toke
> 
> Thanks for all the feedback!

I like this 'fast' KV approach but I guess we should really evaluate its
impact on performances (especially for xdp) since, based on the kfunc calls
order in the ebpf program, we can have one or multiple memmove/memcpy for
each packet, right?

Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
so suitable for nic hw metadata since:
- it grows backward 
- it is probably in a different cacheline with respect to xdp_frame
- nic hw metadata will not start at fixed and immutable address, but it depends
  on the running ebpf program

What about having something like:
- fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
  of the metadata area :)). Here he can reuse the same KV approach if it is fast
- user defined metadata: in the metadata area of the xdp_frame/xdp_buff

Regards,
Lorenzo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20240927/10f84a42/attachment-0001.asc>


More information about the Intel-wired-lan mailing list