Re: [PATCH v6 03/24] rtw89: add core and trx files

From: Arnd Bergmann <arnd@arndb.de>
Date: 2021-10-06 07:33:24

On Wed, Oct 6, 2021 at 3:35 AM Pkshih [off-list ref] wrote:

quoted

Compare the object codes side-by-side, they are almost the same except
to some instructions. I think this is because the inline function
I apply __always_inline contains only a simple statement.

Ok. Did you check the output for the configuration that showed the
problem as well, after adding __always_inline? There are certain
compile-time options that could cause the code to become unoptimized,
e.g. KASAN, in addition to the OPTIMIZE_FOR_SIZE.

Summarize object code size of the combinations:

ccflag              default           -Os
======              =======           =============
inline              0x1AF             X
always_inline      0x1AA             0x1A4

With default ccflag, the difference of inline and always_inline is a
je/jne instruction for 'if (!desc_info->en_wd_info)'. The always_inline
doesn't affect the part that use RTW89_SET_TXWD().

Compare always_inline row, the case of default ccflag uses movzbl (4 bytes),
but -Os case uses mov (3 bytes).

By the results, -Os affect the object code size. always_inline doesn't
affect the code, but affect the instruction (je/jne) nearby.

Those are the known-good cases, yes.

I use Ubuntun kernel that doesn't enable KASAN.
# CONFIG_KASAN is not set

Ah, so you test using the driver backports package on a distro
kernel? While this may be a good option for your development
needs, I think it is generally a good idea to also be able to test
your patches against the latest mainline or linux-next kernel
directly, if only to ensure that there are no obvious regressions.

quoted

+#define RTW89_SET_TXWD_BODY_WP_OFFSET(txdesc, val) \
+ RTW89_SET_TXWD(txdesc, val, 0x00, GENMASK(31, 24))
+#define RTW89_SET_TXWD_BODY_MORE_DATA(txdesc, val) \
+ RTW89_SET_TXWD(txdesc, val, 0x00, BIT(23))
+#define RTW89_SET_TXWD_BODY_WD_INFO_EN(txdesc, val) \
+ RTW89_SET_TXWD(txdesc, val, 0x00, BIT(22))
+#define RTW89_SET_TXWD_BODY_FW_DL(txdesc, val) \
+ RTW89_SET_TXWD(txdesc, val, 0x00, BIT(20))

I would personally write this without the wrappers, instead defining the
bitmask macros as the masks and then open-coding the
le32p_replace_bits() calls instead, which I would find more
intuitive while it avoids the problem with the bitmasks.

Use these macros can address offset and bit fields quickly.
How about I use macro instead of inline function? Like,

#define RTW89_SET_TXWD (txdesc, val, offset, mask) \
do { \
        u32 *txd32 = (u32 *)txdesc; \
        le32p_replace_bits((__le32 *)(txd32 + offset), val, mask); \
} while (0)

That would obviously address the immediate bug, but I think
using le32p_replace_bits() directly here would actually be
more readable, after you define the descriptor layout using
a structure with named __le32 members to replace the offset.

I will remove the wrapper and use le32p_replace_bits() directly.

I don't plan to use structure, because these data contain bit-fields.
Then, I need to maintain little-/big-endian formats, like

struct foo {
#if BIG_ENDINA
        __le32 msb:1;
        __le32 rsvd:30;
        __le32 lsb:1;
#else
        __le32 lsb:1;
        __le32 rsvd:30;
        __le32 msb:1;
#endif
};

Right, bitfields would not work well here, as they are generally not
portable. Using an "#ifdef __BIG_ENDIAN_BITFIELD" check can
work, but as you say this is really ugly.

What I was trying to suggest instead is a structure like

struct descriptor {
     __le32 word0;
     __le32 word1;
     __le32 word2;
     __le32 word3;
};

And then build the descriptor like (with proper naming of the fields of course)

void fill_descriptor(struct my_device *dev, struct sk_buff *skb,
volatile struct descriptor *d)
{
          d->word0 = build_desc_word0(fieldA, fieldB, fieldC, fieldD);
          d->word1 = build_desc_word1(fieldE, fieldF);
          ...
}

where the build_desc_word0() functions are the ones that encode the
actual layout, e.g. using the linux/bitfield.h helpers like

static inline __le32 build_desc_word0(u32 fieldA, u32 fieldB, u32
fieldC, u32 fieldD)
{
        u32 word = FIELD_PREP(REG_FIELD_A, fieldA) |
                           FIELD_PREP(REG_FIELD_B, fieldB) |
                           FIELD_PREP(REG_FIELD_C, fieldC) |
                           FIELD_PREP(REG_FIELD_D, fieldD);

       return cpu_to_le32(word);
}

Doing it this way has the advantage of keeping the assignment
separate, which makes sure you don't accidentally introduce
a read-modify-write cycle on the descriptor. This should work
well on all architectures using dma_alloc_coherent() buffers.

quoted

Going back one more step, I see that that rtw89_core_fill_txdesc()
manipulates the descriptor fields in-memory, which also seems
like a bad idea: The descriptor is mapped as cache-coherent,
so on machines with no coherent DMA (i.e. most ARM or MIPS
machines), that is uncached memory, and writing the descriptor
using a series of read-modify-write cycles on uncached memory
will be awfully slow. Maybe the answer is to just completely
replace the descriptor access.

I'll think if we can use chached memory with single_map/unmap for
descriptor. That would improve the performance.

Using dma_unmap_single() with its cache flush may not work
correctly if the descriptor fields have to be written in a particular
order. Usually the last field in a descriptor contains a 'valid'
bit that must not be observed by the hardware before the rest
is visible. The cache flush however would not guarantee the
order of the update.

Is it possible to flush cache twice? Writing the fields other
than 'valid' bit, and do wmb() and first flush. Then, set 'valid' bit,
and do second flush.

This could work, but it would be really expensive, since the
dma-mapping API is based on ownership state transitions, so
you'd have to got through dma_sync_single_for_device(),
dma_sync_single_for_cpu(), and another
dma_sync_single_for_device(). On machines using swiotlb(),
those would in turn translate into copy operations.

quoted

It would also likely be slower than dma_alloc_coherent() on
machines that have cache-coherent PCI, such as most x86.

The best way is usually to construct the descriptor one word
at a time in registers, and write that word using WRITE_ONCE(),
with an explict dma_wmb() before the final write that makes
the descriptor valid.

Thanks for the guideline.

Fortunately, descriptor of this hardware uses circular ring buffer with
read/write index instead of 'valid' bit. To issue a packet with descriptor
to hardware, we fill descriptor and fill address of skb as well, and then
update write index (a register) to trigger hardware to start DMA this
packet. So, I think it is possible to use dma_map_single().

Anyway, I will try both methods later.

If you end up with the streaming mapping, I would suggest using a
single dma_alloc_noncoherent(), followed by dma_sync_single_*
later on, rather than multiple map/unmap calls that would need to
reprogram the IOMMU. The coherent API as I explained above
should be more efficient though, unless you need to do a lot of
reads from the descriptors.

        Arnd

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help