Re: [RFC v2 0/7] pch_can/c_can: fix races and add PCH support to c_can

From: Wolfgang Grandegger <hidden>
Date: 2012-12-06 14:31:24

On 12/06/2012 02:38 PM, Alexander Stein wrote:

Hello Wolfgang,

On Wednesday 05 December 2012 18:35:25, Wolfgang Grandegger wrote:

quoted

- The messages look still ok (not currupted, I mean)?

The received frames all look good (despite wrong counter sometimes due to 
wrong order or lost frames).

quoted

Even worse, if I use the following patch to check if PCI writes were 
successfully, I notices that some writes (or the consecutive read) don't 
succeed. And I also get lots of I2C timeouts waiting for a xfer complete.

Be careful, there might be some registers changing their values after
writing. Can you show the value read after writing and the register
offset? The influence on the I2C bus looks more like an overload or
hardware problem. What is your CAN interrupt rate?

I get about 33 interrupts per second on i2c. On a successful run I get 366886 
interrupts for 500000 messages with the c_can driver.

In what time? Is the CAN bus highly loaded.

Busload is about 14-18% according to canbusload with 1MBit. A complete sucessful run takes about 5 minutes.

quoted

On the second line you can see that the register isn't written at all (or the 
read failed for some reason).

I assume the latter. Could you please retry reading the register until
the correct value shows up. With some timeout, of course.

I notices having all error events printed on serial console using pch_uart driver has negative effect (I guess one problem causes another one), so I setup 'dmesg -n1' to reduce serial load before the test.
Running the test with just one heartbeat triggered LED set using I2C the test runs without errors. I only see
Lots of 'c_can_pci 0000:02:0c.3: can0: write 0x0 to offset 0x2c failed. got: 0x2000' at the beginning. It seems this (reserved?) bit is always 1 no matter what we write.

There are reserved bit! From the manual:

 "Reserved: This bit is reserved for future expansion. Only “0” is
  accepted as the write data to the reserved bit. When “1” is
  written, the operation is not guaranteed."

  When reading the reserved bit of the IFmMASK2 register (m = 1), a “1”
  is read. Write a “1” for write.

Well, I don't fully understand but it's clear that we always read "1!.

But things start to break when running the test while running 'watch sensors' (sensors queries several temperature sensors via I2C) in a seconds ssh session.
First off the driver error messages (omitting the messages as written above):
[  384.466217] c_can_pci 0000:02:0c.3: can0: write 0x73 to offset 0x24 failed. got: 0xbc


Hm...

[  384.466630] c_can_pci 0000:02:0c.3: can0: write 0x73 to offset 0x24 failed. got: 0xb8

...

[  700.646608] c_can_pci 0000:02:0c.3: can0: write 0xe to offset 0x0 failed. got: 0x28
[  700.647000] c_can_pci 0000:02:0c.3: can0: write 0x0 to offset 0x0 failed. got: 0x88

As you can see, sometimes it needs several write retries to succeed. Nevertheless my test application detects also some problems:

OK, obviously the real value needs some time to show up when there is
other activity on the bus. This hardware is really special. We need to
contact the hardware developers.

quoted

Error on MSG ID 0x252. Got counter 96480 and expected 96466
Error on MSG ID 0x251. Got counter 96480 and expected 96466
Error on MSG ID 0x252. Got counter 101706 and expected 101696
Error on MSG ID 0x251. Got counter 101706 and expected 101696

Here were messages missed/dropped for both CAN-IDs.

It is interesting that the same number of messages are missing for both
IDs... which come from different CAN nodes, right?

Do you see dropped messages with "ip -d -s link show canX"?
Did you check if the protocol layer did drop messages. You can
use "candump -d" to find that out.

quoted

Error on MSG ID 0x251. Got counter 108673 and expected 108672
Error on MSG ID 0x251. Got counter 108672 and expected 108674
Error on MSG ID 0x251. Got counter 108674 and expected 108673

                                     ^^^^^^
Here you can see the CAN frame with counter 108673 is read before 108672.

You could add some offset/mask to the counter data of MSG ID 0x252 to
see if they get mixed up.

quoted hunk ↗ jump to hunk

quoted

Error on MSG ID 0x251. Got counter 117488 and expected 117487
Error on MSG ID 0x251. Got counter 117487 and expected 117489
Error on MSG ID 0x251. Got counter 117489 and expected 117488

                                     ^^^^^^
Here you can see the CAN frame with counter 117488 is read before 117487.

quoted

Error on MSG ID 0x251. Got counter 142297 and expected 142296
Error on MSG ID 0x251. Got counter 142296 and expected 142298
Error on MSG ID 0x251. Got counter 142298 and expected 142297

Same again.

quoted

Error on MSG ID 0x251. Got counter 147932 and expected 147924
Error on MSG ID 0x251. Got counter 147936 and expected 147933
Error on MSG ID 0x252. Got counter 147936 and expected 147924
Error on MSG ID 0x251. Got counter 165268 and expected 165267
Error on MSG ID 0x251. Got counter 218560 and expected 218558
Error on MSG ID 0x252. Got counter 218560 and expected 218558
Error on MSG ID 0x252. Got counter 231076 and expected 231075
Error on MSG ID 0x251. Got counter 231077 and expected 231076
Error on MSG ID 0x251. Got counter 241959 and expected 241958

Messages missed/dropped again.

Below is the patch for c_can.

Best regards,
Alexander

diff --git a/drivers/net/can/c_can/c_can.c b/drivers/net/can/c_can/c_can.c
index d63aaa3..da9bbc0 100644
--- a/drivers/net/can/c_can/c_can.c
+++ b/drivers/net/can/c_can/c_can.c

@@ -1186,6 +1186,7 @@ struct net_device *alloc_c_can_dev(void)
                                        CAN_CTRLMODE_BERR_REPORTING;
 
        spin_lock_init(&priv->lock);
+       spin_lock_init(&priv->testlock);
 
        return dev;
 }

diff --git a/drivers/net/can/c_can/c_can.h b/drivers/net/can/c_can/c_can.h
index 3487d5e..2b58b75 100644
--- a/drivers/net/can/c_can/c_can.h
+++ b/drivers/net/can/c_can/c_can.h

@@ -173,6 +173,7 @@ struct c_can_priv {
        unsigned int instance;
        void (*init) (const struct c_can_priv *priv, bool enable);
        spinlock_t lock;        /* to protect tx and rx message objects */
+       spinlock_t testlock;    /* to protect tx and rx message objects */
 };
 
 struct net_device *alloc_c_can_dev(void);

diff --git a/drivers/net/can/c_can/c_can_pci.c b/drivers/net/can/c_can/c_can_pci.c
index 2516ea9..0ac4d43 100644
--- a/drivers/net/can/c_can/c_can_pci.c
+++ b/drivers/net/can/c_can/c_can_pci.c

@@ -74,13 +74,37 @@ static void c_can_pci_write_reg_aligned_to_32bit(struct c_can_priv *priv,
 static u16 c_can_pci_read_reg_32bit(struct c_can_priv *priv,
                                    enum reg index)
 {
-       return (u16)ioread32(priv->base + 2 * priv->regs[index]);
+       unsigned long flags;
+       u16 reg;
+
+       spin_lock_irqsave(&priv->testlock, flags);
+       reg = (u16)ioread32(priv->base + 2 * priv->regs[index]);
+       spin_unlock_irqrestore(&priv->testlock, flags);
+
+       return reg;
 }
 
 static void c_can_pci_write_reg_32bit(struct c_can_priv *priv,
                                      enum reg index, u16 val)
 {
-       iowrite32((u32)val, priv->base + 2 * priv->regs[index]);
+       u16 reg;
+       unsigned long flags;
+       int retries;
+
+       retries = 0;
+
+       spin_lock_irqsave(&priv->testlock, flags);
+
+       do
+       {
+               iowrite32((u32)val, priv->base + 2 * priv->regs[index]);

I think it's enough to write the message once.

+               reg = (u16)ioread32(priv->base + 2 * priv->regs[index]);
+               if (reg != val)
+               {
+                       netdev_err(priv->dev, "write 0x%x to offset 0x%x failed. got: 0x%x\n", val, 2 * priv->regs[index], reg);
+               }
+       } while ((reg != val) && (retries++ < 20));
+       spin_unlock_irqrestore(&priv->testlock, flags);
 }
 
 static void c_can_pci_reset_pch(const struct c_can_priv *priv, bool enable)

Wolfgang.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help