[Intel-wired-lan] [PATCH] fm10k: Update ITR scaling mechanism based on PCIe link speed
Alexander Duyck
alexander.h.duyck at redhat.com
Thu Apr 30 02:20:55 UTC 2015
On 04/29/2015 05:10 PM, Jacob Keller wrote:
> Red Rock Canyon's interrupt throttle timers are based on the PCIe link
> speed. Because of this, the value being programmed into the ITR
> registers must be scaled.
>
> For the PF, this is as simple as reading the PCIe link speed and storing
> the result. However, in the case of SR-IOV, the VF's interrupt throttle
> timers are based on the link speed of the PF. However, the VF is unable
> to get the link speed information from its configuration space, so the
> PF must inform it of what scale to use.
>
> Rather than pass this scale via mailbox message, take advantage of
> unused bits in the TDLEN register to pass the scale. It is the
> responsibility of the PF to program this for the VF while setting up the
> VF queues and the responsibility of the VF to get the information
> accordingly. This is preferable because it allows the VF to set up the
> interrupts properly during initialization and matches how the MAC
> address is passed in the TDBAL/TDBAH registers.
>
> These changes also resolve an issue with the existing ITR scale being
> too restrictive. It would throttle down to 1300ints/s at 1538 byte
> frames. This patch modifies the algorithm to allow for a higher range of
> interrupts per second, from roughly 25000int/s up to 100000int/s
> depending on our detection of the workload.
>
> This patch fixes a single receiving TCP_STREAM in netperf:
> - Before: 450 Mbps
> - After: 20,000 Mbps
>
> Signed-off-by: Jacob Keller <jacob.e.keller at intel.com>
> Discovered-by: Matthew Vick <matthew.vick at intel.com>
Should probably be Reported-by, since I don't think Discovered-by is a
standard convention.
Also I don't think I see any of the changes I suggested. It seems like
this could still divide the interrupt throttle rate all the way down to 0.
> ---
> drivers/net/ethernet/intel/fm10k/fm10k.h | 4 ++++
> drivers/net/ethernet/intel/fm10k/fm10k_main.c | 17 +++++++++++++----
> drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 12 ++++++++++--
> drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 18 +++++++++++++++++-
> drivers/net/ethernet/intel/fm10k/fm10k_type.h | 9 +++++++++
> drivers/net/ethernet/intel/fm10k/fm10k_vf.c | 15 +++++++++++++--
> 6 files changed, 66 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k.h b/drivers/net/ethernet/intel/fm10k/fm10k.h
> index c8c8c5baefda..f8ff8b409df8 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k.h
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k.h
> @@ -131,6 +131,7 @@ struct fm10k_ring {
> * different for DCB and RSS modes
> */
> u8 qos_pc; /* priority class of queue */
> + u8 itr_scale; /* throttle scaler based on PCI speed */
> u16 vid; /* default vlan ID of queue */
> u16 count; /* amount of descriptors */
>
This really belongs in the fm10k_ring_container rather than in the ring
itself.
> @@ -164,6 +165,9 @@ struct fm10k_ring_container {
> #define FM10K_ITR_10K 100 /* 100us */
> #define FM10K_ITR_20K 50 /* 50us */
> #define FM10K_ITR_ADAPTIVE 0x8000 /* adaptive interrupt moderation flag */
> +#define FM10K_ITR_SCALE_SMALL 60 /* constant factor for small frames */
> +#define FM10K_ITR_SCALE_MEDIUM 50 /* constant factor for medium frames */
> +#define FM10K_ITR_SCALE_LARGE 40 /* constant factor for large frames */
>
> #define FM10K_ITR_ENABLE (FM10K_ITR_AUTOMASK | FM10K_ITR_MASK_CLEAR)
>
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
> index 4eafdad318b4..b8d43964c171 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
> @@ -1369,11 +1369,20 @@ static void fm10k_update_itr(struct fm10k_ring_container *ring_container)
> if (avg_wire_size > 3000)
> avg_wire_size = 3000;
>
> - /* Give a little boost to mid-size frames */
> - if ((avg_wire_size > 300) && (avg_wire_size < 1200))
> - avg_wire_size /= 3;
> + /* Simple throttle rate management based on average wire size,
> + * providing boost to small and medium packet loads. Divide the
> + * average wire size by a constant to calculate the minimum time
> + * until the next interrupt in microseconds.
> + */
> + if (avg_wire_size < 300)
> + avg_wire_size /= FM10K_ITR_SCALE_SMALL;
> + else if ((avg_wire_size >= 300) && (avg_wire_size < 1200))
> + avg_wire_size /= FM10K_ITR_SCALE_MEDIUM;
> else
> - avg_wire_size /= 2;
> + avg_wire_size /= FM10K_ITR_SCALE_LARGE;
> +
I was never really a fan of these divides. Perhaps you could look into
replacing them with a multiply/shift combo. Then you could reduce it by
a fractional amount, and it should be faster than doing the divide. For
example you could probably do a multiply followed by a shift by 8 so
that you could do fractions of 1/256th. Then you could use the values
4/256, 5/256, and 6/256 for large, medium, and small.
> + /* Scale for various PCIe link speeds */
> + avg_wire_size /= ring_container->ring->itr_scale;
>
This should probably be a shift instead of a divide. Otherwise you can
pile up delays fairly quickly since a divide can take up to 17 cycles.
If you combined it with the math above you could just shift by itr_scale
+ 7.
To address the fact that the value can go all the way down to zero you
might want to add ((1 << (itr_scale + 7)) - 1) before you do that last
shift. That way you would always be rounding the usecs value up to at
least 1 as long as there is any value in avg_wire_size.
Also moving itr_scale into the ring container gets rid of the need for
itr_scale.
> /* write back value and retain adaptive flag */
> ring_container->itr = avg_wire_size | FM10K_ITR_ADAPTIVE;
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> index df9fda38bdd1..62e2d7203846 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> @@ -532,6 +532,9 @@ static void fm10k_configure_tx_ring(struct fm10k_intfc *interface,
> /* store tail pointer */
> ring->tail = &interface->uc_addr[FM10K_TDT(reg_idx)];
>
> + /* store ITR scale */
> + ring->itr_scale = hw->mac.itr_scale;
> +
> /* reset ntu and ntc to place SW in sync with hardwdare */
> ring->next_to_clean = 0;
> ring->next_to_use = 0;
> @@ -638,6 +641,9 @@ static void fm10k_configure_rx_ring(struct fm10k_intfc *interface,
> /* store tail pointer */
> ring->tail = &interface->uc_addr[FM10K_RDT(reg_idx)];
>
> + /* store ITR scale */
> + ring->itr_scale = hw->mac.itr_scale;
> +
> /* reset ntu and ntc to place SW in sync with hardwdare */
> ring->next_to_clean = 0;
> ring->next_to_use = 0;
The itr_scale should probably be a part of the ring_container rather
than the ring. Then you can track one per itr, one for Tx and one for Rx.
> @@ -824,7 +830,8 @@ static irqreturn_t fm10k_msix_mbx_vf(int __always_unused irq, void *data)
>
> /* re-enable mailbox interrupt and indicate 20us delay */
> fm10k_write_reg(hw, FM10K_VFITR(FM10K_MBX_VECTOR),
> - FM10K_ITR_ENABLE | FM10K_MBX_INT_DELAY);
> + FM10K_ITR_ENABLE | (FM10K_MBX_INT_DELAY /
> + hw->mac.itr_scale));
>
> /* service upstream mailbox */
> if (fm10k_mbx_trylock(interface)) {
> @@ -1029,7 +1036,8 @@ static irqreturn_t fm10k_msix_mbx_pf(int __always_unused irq, void *data)
>
> /* re-enable mailbox interrupt and indicate 20us delay */
> fm10k_write_reg(hw, FM10K_ITR(FM10K_MBX_VECTOR),
> - FM10K_ITR_ENABLE | FM10K_MBX_INT_DELAY);
> + FM10K_ITR_ENABLE | (FM10K_MBX_INT_DELAY /
> + hw->mac.itr_scale));
>
> return IRQ_HANDLED;
> }
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
> index 891e21874b2a..5d67321bd8a4 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
> @@ -144,19 +144,26 @@ static s32 fm10k_init_hw_pf(struct fm10k_hw *hw)
> FM10K_TPH_RXCTRL_HDR_WROEN);
> }
>
> - /* set max hold interval to align with 1.024 usec in all modes */
> + /* set max hold interval to align with 1.024 usec in all modes and
> + * store ITR scale
> + */
> switch (hw->bus.speed) {
> case fm10k_bus_speed_2500:
> dma_ctrl = FM10K_DMA_CTRL_MAX_HOLD_1US_GEN1;
> + hw->mac.itr_scale = FM10K_TDLEN_ITR_SCALE_GEN1;
> break;
> case fm10k_bus_speed_5000:
> dma_ctrl = FM10K_DMA_CTRL_MAX_HOLD_1US_GEN2;
> + hw->mac.itr_scale = FM10K_TDLEN_ITR_SCALE_GEN2;
> break;
> case fm10k_bus_speed_8000:
> dma_ctrl = FM10K_DMA_CTRL_MAX_HOLD_1US_GEN3;
> + hw->mac.itr_scale = FM10K_TDLEN_ITR_SCALE_GEN3;
> break;
> default:
> dma_ctrl = 0;
> + /* just in case, assume Gen3 ITR scale */
> + hw->mac.itr_scale = FM10K_TDLEN_ITR_SCALE_GEN3;
> break;
> }
>
> @@ -910,6 +917,12 @@ static s32 fm10k_iov_assign_default_mac_vlan_pf(struct fm10k_hw *hw,
> fm10k_write_reg(hw, FM10K_TDBAL(vf_q_idx), tdbal);
> fm10k_write_reg(hw, FM10K_TDBAH(vf_q_idx), tdbah);
>
> + /* Provide the VF the ITR scale, using software-defined fields in TDLEN
> + * to pass the information during VF initialization
> + */
> + fm10k_write_reg(hw, FM10K_TDLEN(vf_q_idx), hw->mac.itr_scale <<
> + FM10K_TDLEN_ITR_SCALE_SHIFT);
> +
> err_out:
> /* configure Queue control register */
> txqctl = ((u32)vf_vid << FM10K_TXQCTL_VID_SHIFT) &
> @@ -1042,6 +1055,9 @@ static s32 fm10k_iov_reset_resources_pf(struct fm10k_hw *hw,
> for (i = queues_per_pool; i--;) {
> fm10k_write_reg(hw, FM10K_TDBAL(vf_q_idx + i), tdbal);
> fm10k_write_reg(hw, FM10K_TDBAH(vf_q_idx + i), tdbah);
> + fm10k_write_reg(hw, FM10K_TDLEN(vf_q_idx + i),
> + hw->mac.itr_scale <<
> + FM10K_TDLEN_ITR_SCALE_SHIFT);
> fm10k_write_reg(hw, FM10K_TQMAP(qmap_idx + i), vf_q_idx + i);
> fm10k_write_reg(hw, FM10K_RQMAP(qmap_idx + i), vf_q_idx + i);
> }
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_type.h b/drivers/net/ethernet/intel/fm10k/fm10k_type.h
> index 4af96686c584..b8a2e5142041 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_type.h
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_type.h
> @@ -271,6 +271,14 @@ struct fm10k_hw;
> #define FM10K_TDBAL(_n) ((0x40 * (_n)) + 0x8000)
> #define FM10K_TDBAH(_n) ((0x40 * (_n)) + 0x8001)
> #define FM10K_TDLEN(_n) ((0x40 * (_n)) + 0x8002)
> +/* Reuse some empty space prior to VF initialization in order for the VF to
> + * have enough knowledge to correctly program interrupt throttle rates.
> + */
> +#define FM10K_TDLEN_ITR_SCALE_SHIFT 9
> +#define FM10K_TDLEN_ITR_SCALE_MASK 0x00000E00
> +#define FM10K_TDLEN_ITR_SCALE_GEN1 4
> +#define FM10K_TDLEN_ITR_SCALE_GEN2 2
> +#define FM10K_TDLEN_ITR_SCALE_GEN3 1
> #define FM10K_TPH_TXCTRL(_n) ((0x40 * (_n)) + 0x8003)
> #define FM10K_TPH_TXCTRL_DESC_TPHEN 0x00000020
> #define FM10K_TPH_TXCTRL_DESC_RROEN 0x00000200
You would be better off reporting this as a shift value. So Gen3 would
be 0, Gen 2 would be 1, and Gen 1 would be 2.
> @@ -560,6 +568,7 @@ struct fm10k_mac_info {
> bool get_host_state;
> bool tx_ready;
> u32 dglort_map;
> + u8 itr_scale;
> };
>
> struct fm10k_swapi_table_info {
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_vf.c b/drivers/net/ethernet/intel/fm10k/fm10k_vf.c
> index 94f0f6a146d9..c39eea58b124 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_vf.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_vf.c
> @@ -28,7 +28,7 @@
> static s32 fm10k_stop_hw_vf(struct fm10k_hw *hw)
> {
> u8 *perm_addr = hw->mac.perm_addr;
> - u32 bal = 0, bah = 0;
> + u32 bal = 0, bah = 0, tdlen;
> s32 err;
> u16 i;
>
> @@ -48,6 +48,9 @@ static s32 fm10k_stop_hw_vf(struct fm10k_hw *hw)
> ((u32)perm_addr[2]);
> }
>
> + /* Restore default itr_scale for next VF initialization */
> + tdlen = hw->mac.itr_scale << FM10K_TDLEN_ITR_SCALE_SHIFT;
> +
> /* The queues have already been disabled so we just need to
> * update their base address registers
> */
> @@ -56,6 +59,7 @@ static s32 fm10k_stop_hw_vf(struct fm10k_hw *hw)
> fm10k_write_reg(hw, FM10K_TDBAH(i), bah);
> fm10k_write_reg(hw, FM10K_RDBAL(i), bal);
> fm10k_write_reg(hw, FM10K_RDBAH(i), bah);
> + fm10k_write_reg(hw, FM10K_TDLEN(i), tdlen);
> }
>
> return 0;
> @@ -124,9 +128,16 @@ static s32 fm10k_init_hw_vf(struct fm10k_hw *hw)
> /* record maximum queue count */
> hw->mac.max_queues = i;
>
> - /* fetch default VLAN */
> + /* fetch default VLAN and ITR scale */
> hw->mac.default_vid = (fm10k_read_reg(hw, FM10K_TXQCTL(0)) &
> FM10K_TXQCTL_VID_MASK) >> FM10K_TXQCTL_VID_SHIFT;
> + hw->mac.itr_scale = (fm10k_read_reg(hw, FM10K_TDLEN(0)) &
> + FM10K_TDLEN_ITR_SCALE_MASK) >>
> + FM10K_TDLEN_ITR_SCALE_SHIFT;
> +
> + /* ensure a non-zero ITR scale, assume Gen3 scale if unknown */
> + if (!hw->mac.itr_scale)
> + hw->mac.itr_scale = FM10K_TDLEN_ITR_SCALE_GEN3;
>
> return 0;
> }
>
By replacing the previous values with a shift you would then
automatically default to Gen3 if the itr_scale value is 0. It would
also make it easy to verify the value is valid as well since anything 3
or more would be invalid.
More information about the Intel-wired-lan
mailing list