[Intel-wired-lan] [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

tndave tushar.n.dave at oracle.com
Wed Dec 28 21:55:20 UTC 2016



On 12/27/2016 04:40 PM, maowenan wrote:
>
>
>> -----Original Message-----
>> From: tndave [mailto:tushar.n.dave at oracle.com]
>> Sent: Wednesday, December 28, 2016 6:28 AM
>> To: maowenan; jeffrey.t.kirsher at intel.com; intel-wired-lan at lists.osuosl.org
>> Cc: netdev at vger.kernel.org; weiyongjun (A); Dingtianhong
>> Subject: Re: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC
>>
>>
>>
>> On 12/26/2016 03:39 AM, maowenan wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: netdev-owner at vger.kernel.org
>>>> [mailto:netdev-owner at vger.kernel.org]
>>>> On Behalf Of Tushar Dave
>>>> Sent: Tuesday, December 06, 2016 1:07 AM
>>>> To: jeffrey.t.kirsher at intel.com; intel-wired-lan at lists.osuosl.org
>>>> Cc: netdev at vger.kernel.org
>>>> Subject: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC
>>>>
>>>> Unlike previous generation NIC (e.g. ixgbe) i40e doesn't seem to have
>>>> standard CSR where PCIe relaxed ordering can be set. Without PCIe
>>>> relax ordering enabled, i40e performance is significantly low on SPARC.
>>>>
>>> [Mao Wenan]Hi Tushar, you have referred to i40e doesn't seem to have
>>> standard CSR to set PCIe relaxed ordering, this CSR like TX&Rx DCA Control
>> Register in 82599, right?
>> Yes.
>> i40e datasheet mentions some CSR that can be used to enable/disable PCIe
>> relaxed ordering in device; however I don't see the exact definition of those
>> register in datasheet.
>> (https://www.mail-archive.com/netdev@vger.kernel.org/msg117219.html).
>>
>>> Is DMA_ATTR_WEAK_ORDERING the same as TX&RX control register in
>> 82599?
>> No.
>> DMA_ATTR_WEAK_ORDERING applies to the PCIe root complex of the system.
>>
>> -Tushar
>
> I understand that the PCIe Root Complex is the Host Bridge in the CPU that
> connects the CPU and memory to the PCIe architecture. So this attribute
> DMA_ATTR_WEAK_ORDERING is only applied on CPU side(the SPARC in you
> system), it can't apply on i40e, is it right?
Yes.
> And it is not the same as 82599 DCA control register's relax ordering bits.
It is not same as 82599 DCA control register's relax ordering bits.

-Tushar

> -Mao Wenan
>
>>>
>>> And to enable relax ordering mode in 82599 for SPARC using below codes:
>>> s32 ixgbe_start_hw_gen2(struct ixgbe_hw *hw) {
>>> 	u32 i;
>>>
>>> 	/* Clear the rate limiters */
>>> 	for (i = 0; i < hw->mac.max_tx_queues; i++) {
>>> 		IXGBE_WRITE_REG(hw, IXGBE_RTTDQSEL, i);
>>> 		IXGBE_WRITE_REG(hw, IXGBE_RTTBCNRC, 0);
>>> 	}
>>> 	IXGBE_WRITE_FLUSH(hw);
>>>
>>> #ifndef CONFIG_SPARC
>>> 	/* Disable relaxed ordering */
>>> 	for (i = 0; i < hw->mac.max_tx_queues; i++) {
>>> 		u32 regval;
>>>
>>> 		regval = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL_82599(i));
>>> 		regval &= ~IXGBE_DCA_TXCTRL_DESC_WRO_EN;
>>> 		IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL_82599(i), regval);
>>> 	}
>>>
>>> 	for (i = 0; i < hw->mac.max_rx_queues; i++) {
>>> 		u32 regval;
>>>
>>> 		regval = IXGBE_READ_REG(hw, IXGBE_DCA_RXCTRL(i));
>>> 		regval &= ~(IXGBE_DCA_RXCTRL_DATA_WRO_EN |
>>> 			    IXGBE_DCA_RXCTRL_HEAD_WRO_EN);
>>> 		IXGBE_WRITE_REG(hw, IXGBE_DCA_RXCTRL(i), regval);
>>> 	}
>>> #endif
>>> 	return 0;
>>> }
>>>
>>>
>>>
>>>> This patch sets PCIe relax ordering for SPARC arch by setting dma
>>>> attr DMA_ATTR_WEAK_ORDERING for every tx and rx DMA map/unmap.
>>>> This has shown 10x increase in performance numbers.
>>>>
>>>> e.g.
>>>> iperf TCP test with 10 threads on SPARC S7
>>>>
>>>> Test 1: Without this patch
>>>>
>>>> [root at brm-snt1-03 net]# iperf -s
>>>> ------------------------------------------------------------
>>>> Server listening on TCP port 5001
>>>> TCP window size: 85.3 KByte (default)
>>>> ------------------------------------------------------------
>>>> [  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40926 [
>>>> 5] local
>>>> 16.0.0.7 port 5001 connected with 16.0.0.1 port 40934 [  6] local
>>>> 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 40930 [  7] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 40928 [  8] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 40922 [  9] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 40932 [ 10] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 40920 [ 11] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 40924 [ 14] local
>>>> 16.0.0.7 port 5001 connected with 16.0.0.1 port 40982 [ 12] local
>>>> 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 40980
>>>> [ ID] Interval       Transfer     Bandwidth
>>>> [  4]  0.0-20.0 sec   566 MBytes   237 Mbits/sec
>>>> [  5]  0.0-20.0 sec   532 MBytes   223 Mbits/sec
>>>> [  6]  0.0-20.0 sec   537 MBytes   225 Mbits/sec
>>>> [  8]  0.0-20.0 sec   546 MBytes   229 Mbits/sec
>>>> [ 11]  0.0-20.0 sec   592 MBytes   248 Mbits/sec
>>>> [  7]  0.0-20.0 sec   539 MBytes   226 Mbits/sec
>>>> [  9]  0.0-20.0 sec   572 MBytes   240 Mbits/sec
>>>> [ 10]  0.0-20.0 sec   604 MBytes   253 Mbits/sec
>>>> [ 14]  0.0-20.0 sec   567 MBytes   238 Mbits/sec
>>>> [ 12]  0.0-20.0 sec   511 MBytes   214 Mbits/sec
>>>> [SUM]  0.0-20.0 sec  5.44 GBytes  2.33 Gbits/sec
>>>>
>>>> Test 2: with this patch:
>>>>
>>>> [root at brm-snt1-03 net]# iperf -s
>>>> ------------------------------------------------------------
>>>> Server listening on TCP port 5001
>>>> TCP window size: 85.3 KByte (default)
>>>> ------------------------------------------------------------
>>>> TCP: request_sock_TCP: Possible SYN flooding on port 5001. Sending
>> cookies.
>>>> Check SNMP counters.
>>>> [  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46876 [
>>>> 5] local
>>>> 16.0.0.7 port 5001 connected with 16.0.0.1 port 46874 [  6] local
>>>> 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 46872 [  7] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 46880 [  8] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 46878 [  9] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 46884 [ 10] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 46886 [ 11] local 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 46890 [ 12] local
>>>> 16.0.0.7 port 5001 connected with 16.0.0.1 port 46888 [ 13] local
>>>> 16.0.0.7 port
>>>> 5001 connected with 16.0.0.1 port 46882
>>>> [ ID] Interval       Transfer     Bandwidth
>>>> [  4]  0.0-20.0 sec  7.45 GBytes  3.19 Gbits/sec [  5]  0.0-20.0 sec
>>>> 7.48 GBytes  3.21 Gbits/sec [  7]  0.0-20.0 sec  7.34 GBytes  3.15
>>>> Gbits/sec [  8]  0.0-20.0 sec  7.42 GBytes  3.18 Gbits/sec [  9]
>>>> 0.0-20.0 sec  7.24 GBytes  3.11 Gbits/sec [ 10]  0.0-20.0 sec  7.40
>>>> GBytes  3.17 Gbits/sec [ 12]  0.0-20.0 sec  7.49 GBytes  3.21
>>>> Gbits/sec [  6]  0.0-20.0 sec  7.30 GBytes  3.13 Gbits/sec [ 11]
>>>> 0.0-20.0 sec  7.44 GBytes  3.19 Gbits/sec [ 13]  0.0-20.0 sec  7.22
>>>> GBytes  3.10 Gbits/sec [SUM]  0.0-20.0 sec  73.8 GBytes  31.6
>>>> Gbits/sec
>>>>
>>>> NOTE: In my testing, this patch does _not_ show any harm to i40e
>>>> performance numbers on x86.
>>>>
>>>> Signed-off-by: Tushar Dave <tushar.n.dave at oracle.com>
>>>> ---
>>>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 69
>>>> ++++++++++++++++++++---------
>>>> ++++++++++++++++++++drivers/net/ethernet/intel/i40e/i40e_txrx.h |
>>>> 1 +
>>>>  2 files changed, 49 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>>>> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>>>> index 6287bf6..800dca7 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>>>> @@ -551,15 +551,17 @@ static void
>>>> i40e_unmap_and_free_tx_resource(struct i40e_ring *ring,
>>>>  		else
>>>>  			dev_kfree_skb_any(tx_buffer->skb);
>>>>  		if (dma_unmap_len(tx_buffer, len))
>>>> -			dma_unmap_single(ring->dev,
>>>> -					 dma_unmap_addr(tx_buffer, dma),
>>>> -					 dma_unmap_len(tx_buffer, len),
>>>> -					 DMA_TO_DEVICE);
>>>> +			dma_unmap_single_attrs(ring->dev,
>>>> +					       dma_unmap_addr(tx_buffer, dma),
>>>> +					       dma_unmap_len(tx_buffer, len),
>>>> +					       DMA_TO_DEVICE,
>>>> +					       ring->dma_attrs);
>>>>  	} else if (dma_unmap_len(tx_buffer, len)) {
>>>> -		dma_unmap_page(ring->dev,
>>>> -			       dma_unmap_addr(tx_buffer, dma),
>>>> -			       dma_unmap_len(tx_buffer, len),
>>>> -			       DMA_TO_DEVICE);
>>>> +		dma_unmap_single_attrs(ring->dev,
>>>> +				       dma_unmap_addr(tx_buffer, dma),
>>>> +				       dma_unmap_len(tx_buffer, len),
>>>> +				       DMA_TO_DEVICE,
>>>> +				       ring->dma_attrs);
>>>>  	}
>>>>
>>>>  	tx_buffer->next_to_watch = NULL;
>>>> @@ -662,6 +664,8 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
>>>>  	struct i40e_tx_buffer *tx_buf;
>>>>  	struct i40e_tx_desc *tx_head;
>>>>  	struct i40e_tx_desc *tx_desc;
>>>> +	dma_addr_t addr;
>>>> +	size_t size;
>>>>  	unsigned int total_bytes = 0, total_packets = 0;
>>>>  	unsigned int budget = vsi->work_limit;
>>>>
>>>> @@ -696,10 +700,11 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
>>>>  		napi_consume_skb(tx_buf->skb, napi_budget);
>>>>
>>>>  		/* unmap skb header data */
>>>> -		dma_unmap_single(tx_ring->dev,
>>>> -				 dma_unmap_addr(tx_buf, dma),
>>>> -				 dma_unmap_len(tx_buf, len),
>>>> -				 DMA_TO_DEVICE);
>>>> +		dma_unmap_single_attrs(tx_ring->dev,
>>>> +				       dma_unmap_addr(tx_buf, dma),
>>>> +				       dma_unmap_len(tx_buf, len),
>>>> +				       DMA_TO_DEVICE,
>>>> +				       tx_ring->dma_attrs);
>>>>
>>>>  		/* clear tx_buffer data */
>>>>  		tx_buf->skb = NULL;
>>>> @@ -717,12 +722,15 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
>>>>  				tx_desc = I40E_TX_DESC(tx_ring, 0);
>>>>  			}
>>>>
>>>> +			addr = dma_unmap_addr(tx_buf, dma);
>>>> +			size = dma_unmap_len(tx_buf, len);
>>>>  			/* unmap any remaining paged data */
>>>>  			if (dma_unmap_len(tx_buf, len)) {
>>>> -				dma_unmap_page(tx_ring->dev,
>>>> -					       dma_unmap_addr(tx_buf, dma),
>>>> -					       dma_unmap_len(tx_buf, len),
>>>> -					       DMA_TO_DEVICE);
>>>> +				dma_unmap_single_attrs(tx_ring->dev,
>>>> +						       addr,
>>>> +						       size,
>>>> +						       DMA_TO_DEVICE,
>>>> +						       tx_ring->dma_attrs);
>>>>  				dma_unmap_len_set(tx_buf, len, 0);
>>>>  			}
>>>>  		}
>>>> @@ -1010,6 +1018,11 @@ int i40e_setup_tx_descriptors(struct i40e_ring
>>>> *tx_ring)
>>>>  	 */
>>>>  	tx_ring->size += sizeof(u32);
>>>>  	tx_ring->size = ALIGN(tx_ring->size, 4096);
>>>> +#ifdef CONFIG_SPARC
>>>> +	tx_ring->dma_attrs = DMA_ATTR_WEAK_ORDERING; #else
>>>> +	tx_ring->dma_attrs = 0;
>>>> +#endif
>>>>  	tx_ring->desc = dma_alloc_coherent(dev, tx_ring->size,
>>>>  					   &tx_ring->dma, GFP_KERNEL);
>>>>  	if (!tx_ring->desc) {
>>>> @@ -1053,7 +1066,11 @@ void i40e_clean_rx_ring(struct i40e_ring
>> *rx_ring)
>>>>  		if (!rx_bi->page)
>>>>  			continue;
>>>>
>>>> -		dma_unmap_page(dev, rx_bi->dma, PAGE_SIZE,
>>>> DMA_FROM_DEVICE);
>>>> +		dma_unmap_single_attrs(dev,
>>>> +				       rx_bi->dma,
>>>> +				       PAGE_SIZE,
>>>> +				       DMA_FROM_DEVICE,
>>>> +				       rx_ring->dma_attrs);
>>>>  		__free_pages(rx_bi->page, 0);
>>>>
>>>>  		rx_bi->page = NULL;
>>>> @@ -1113,6 +1130,11 @@ int i40e_setup_rx_descriptors(struct i40e_ring
>>>> *rx_ring)
>>>>  	/* Round up to nearest 4K */
>>>>  	rx_ring->size = rx_ring->count * sizeof(union i40e_32byte_rx_desc);
>>>>  	rx_ring->size = ALIGN(rx_ring->size, 4096);
>>>> +#ifdef CONFIG_SPARC
>>>> +	rx_ring->dma_attrs = DMA_ATTR_WEAK_ORDERING; #else
>>>> +	rx_ring->dma_attrs = 0;
>>>> +#endif
>>>>  	rx_ring->desc = dma_alloc_coherent(dev, rx_ring->size,
>>>>  					   &rx_ring->dma, GFP_KERNEL);
>>>>
>>>> @@ -1182,7 +1204,8 @@ static bool i40e_alloc_mapped_page(struct
>>>> i40e_ring *rx_ring,
>>>>  	}
>>>>
>>>>  	/* map page for use */
>>>> -	dma = dma_map_page(rx_ring->dev, page, 0, PAGE_SIZE,
>>>> DMA_FROM_DEVICE);
>>>> +	dma = dma_map_single_attrs(rx_ring->dev, page_address(page),
>>>> PAGE_SIZE,
>>>> +				   DMA_FROM_DEVICE, rx_ring->dma_attrs);
>>>>
>>>>  	/* if mapping failed free memory back to system since
>>>>  	 * there isn't much point in holding memory we can't use @@
>> -1695,8
>>>> +1718,11 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring
>>>> +*rx_ring,
>>>>  		rx_ring->rx_stats.page_reuse_count++;
>>>>  	} else {
>>>>  		/* we are not reusing the buffer so unmap it */
>>>> -		dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
>>>> -			       DMA_FROM_DEVICE);
>>>> +		dma_unmap_single_attrs(rx_ring->dev,
>>>> +				       rx_buffer->dma,
>>>> +				       PAGE_SIZE,
>>>> +				       DMA_FROM_DEVICE,
>>>> +				       rx_ring->dma_attrs);
>>>>  	}
>>>>
>>>>  	/* clear contents of buffer_info */ @@ -2737,7 +2763,8 @@ static
>>>> inline void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff
>>>> *skb,
>>>>  	first->skb = skb;
>>>>  	first->tx_flags = tx_flags;
>>>>
>>>> -	dma = dma_map_single(tx_ring->dev, skb->data, size, DMA_TO_DEVICE);
>>>> +	dma = dma_map_single_attrs(tx_ring->dev, skb->data, size,
>>>> +				   DMA_TO_DEVICE, tx_ring->dma_attrs);
>>>>
>>>>  	tx_desc = I40E_TX_DESC(tx_ring, i);
>>>>  	tx_bi = first;
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>>>> b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>>>> index 5088405..9a86212 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>>>> @@ -327,6 +327,7 @@ struct i40e_ring {
>>>>
>>>>  	unsigned int size;		/* length of descriptor ring in bytes */
>>>>  	dma_addr_t dma;			/* physical address of ring */
>>>> +	unsigned long dma_attrs;	/* DMA attributes */
>>>>
>>>>  	struct i40e_vsi *vsi;		/* Backreference to associated VSI */
>>>>  	struct i40e_q_vector *q_vector;	/* Backreference to associated
>> vector
>>>> */
>>>> --
>>>> 1.9.1
>>>
>>>
>


More information about the Intel-wired-lan mailing list