Re: [RFC v2 0/7] pch_can/c_can: fix races and add PCH support to c_can
From: Wolfgang Grandegger <hidden>
Date: 2012-12-06 14:31:24
On 12/06/2012 02:38 PM, Alexander Stein wrote:
Hello Wolfgang, On Wednesday 05 December 2012 18:35:25, Wolfgang Grandegger wrote:quoted
quoted
quoted
- The messages look still ok (not currupted, I mean)?The received frames all look good (despite wrong counter sometimes due to wrong order or lost frames).quoted
quoted
Even worse, if I use the following patch to check if PCI writes were successfully, I notices that some writes (or the consecutive read) don't succeed. And I also get lots of I2C timeouts waiting for a xfer complete.Be careful, there might be some registers changing their values after writing. Can you show the value read after writing and the register offset? The influence on the I2C bus looks more like an overload or hardware problem. What is your CAN interrupt rate?I get about 33 interrupts per second on i2c. On a successful run I get 366886 interrupts for 500000 messages with the c_can driver.In what time? Is the CAN bus highly loaded.Busload is about 14-18% according to canbusload with 1MBit. A complete sucessful run takes about 5 minutes.quoted
quoted
On the second line you can see that the register isn't written at all (or the read failed for some reason).I assume the latter. Could you please retry reading the register until the correct value shows up. With some timeout, of course.I notices having all error events printed on serial console using pch_uart driver has negative effect (I guess one problem causes another one), so I setup 'dmesg -n1' to reduce serial load before the test. Running the test with just one heartbeat triggered LED set using I2C the test runs without errors. I only see Lots of 'c_can_pci 0000:02:0c.3: can0: write 0x0 to offset 0x2c failed. got: 0x2000' at the beginning. It seems this (reserved?) bit is always 1 no matter what we write.
There are reserved bit! From the manual: "Reserved: This bit is reserved for future expansion. Only “0” is accepted as the write data to the reserved bit. When “1” is written, the operation is not guaranteed." When reading the reserved bit of the IFmMASK2 register (m = 1), a “1” is read. Write a “1” for write. Well, I don't fully understand but it's clear that we always read "1!.
But things start to break when running the test while running 'watch sensors' (sensors queries several temperature sensors via I2C) in a seconds ssh session. First off the driver error messages (omitting the messages as written above): [ 384.466217] c_can_pci 0000:02:0c.3: can0: write 0x73 to offset 0x24 failed. got: 0xbc
Hm...
[ 384.466630] c_can_pci 0000:02:0c.3: can0: write 0x73 to offset 0x24 failed. got: 0xb8
...
[ 700.646608] c_can_pci 0000:02:0c.3: can0: write 0xe to offset 0x0 failed. got: 0x28 [ 700.647000] c_can_pci 0000:02:0c.3: can0: write 0x0 to offset 0x0 failed. got: 0x88 As you can see, sometimes it needs several write retries to succeed. Nevertheless my test application detects also some problems:
OK, obviously the real value needs some time to show up when there is other activity on the bus. This hardware is really special. We need to contact the hardware developers.
quoted
Error on MSG ID 0x252. Got counter 96480 and expected 96466 Error on MSG ID 0x251. Got counter 96480 and expected 96466 Error on MSG ID 0x252. Got counter 101706 and expected 101696 Error on MSG ID 0x251. Got counter 101706 and expected 101696Here were messages missed/dropped for both CAN-IDs.
It is interesting that the same number of messages are missing for both IDs... which come from different CAN nodes, right? Do you see dropped messages with "ip -d -s link show canX"? Did you check if the protocol layer did drop messages. You can use "candump -d" to find that out.
quoted
Error on MSG ID 0x251. Got counter 108673 and expected 108672 Error on MSG ID 0x251. Got counter 108672 and expected 108674 Error on MSG ID 0x251. Got counter 108674 and expected 108673^^^^^^ Here you can see the CAN frame with counter 108673 is read before 108672.
You could add some offset/mask to the counter data of MSG ID 0x252 to see if they get mixed up.
quoted hunk ↗ jump to hunk
quoted
Error on MSG ID 0x251. Got counter 117488 and expected 117487 Error on MSG ID 0x251. Got counter 117487 and expected 117489 Error on MSG ID 0x251. Got counter 117489 and expected 117488^^^^^^ Here you can see the CAN frame with counter 117488 is read before 117487.quoted
Error on MSG ID 0x251. Got counter 142297 and expected 142296 Error on MSG ID 0x251. Got counter 142296 and expected 142298 Error on MSG ID 0x251. Got counter 142298 and expected 142297Same again.quoted
Error on MSG ID 0x251. Got counter 147932 and expected 147924 Error on MSG ID 0x251. Got counter 147936 and expected 147933 Error on MSG ID 0x252. Got counter 147936 and expected 147924 Error on MSG ID 0x251. Got counter 165268 and expected 165267 Error on MSG ID 0x251. Got counter 218560 and expected 218558 Error on MSG ID 0x252. Got counter 218560 and expected 218558 Error on MSG ID 0x252. Got counter 231076 and expected 231075 Error on MSG ID 0x251. Got counter 231077 and expected 231076 Error on MSG ID 0x251. Got counter 241959 and expected 241958Messages missed/dropped again. Below is the patch for c_can. Best regards, Alexanderdiff --git a/drivers/net/can/c_can/c_can.c b/drivers/net/can/c_can/c_can.c index d63aaa3..da9bbc0 100644 --- a/drivers/net/can/c_can/c_can.c +++ b/drivers/net/can/c_can/c_can.c@@ -1186,6 +1186,7 @@ struct net_device *alloc_c_can_dev(void) CAN_CTRLMODE_BERR_REPORTING; spin_lock_init(&priv->lock); + spin_lock_init(&priv->testlock); return dev; }diff --git a/drivers/net/can/c_can/c_can.h b/drivers/net/can/c_can/c_can.h index 3487d5e..2b58b75 100644 --- a/drivers/net/can/c_can/c_can.h +++ b/drivers/net/can/c_can/c_can.h@@ -173,6 +173,7 @@ struct c_can_priv { unsigned int instance; void (*init) (const struct c_can_priv *priv, bool enable); spinlock_t lock; /* to protect tx and rx message objects */ + spinlock_t testlock; /* to protect tx and rx message objects */ }; struct net_device *alloc_c_can_dev(void);diff --git a/drivers/net/can/c_can/c_can_pci.c b/drivers/net/can/c_can/c_can_pci.c index 2516ea9..0ac4d43 100644 --- a/drivers/net/can/c_can/c_can_pci.c +++ b/drivers/net/can/c_can/c_can_pci.c@@ -74,13 +74,37 @@ static void c_can_pci_write_reg_aligned_to_32bit(struct c_can_priv *priv, static u16 c_can_pci_read_reg_32bit(struct c_can_priv *priv, enum reg index) { - return (u16)ioread32(priv->base + 2 * priv->regs[index]); + unsigned long flags; + u16 reg; + + spin_lock_irqsave(&priv->testlock, flags); + reg = (u16)ioread32(priv->base + 2 * priv->regs[index]); + spin_unlock_irqrestore(&priv->testlock, flags); + + return reg; } static void c_can_pci_write_reg_32bit(struct c_can_priv *priv, enum reg index, u16 val) { - iowrite32((u32)val, priv->base + 2 * priv->regs[index]); + u16 reg; + unsigned long flags; + int retries; + + retries = 0; + + spin_lock_irqsave(&priv->testlock, flags); + + do + { + iowrite32((u32)val, priv->base + 2 * priv->regs[index]);
I think it's enough to write the message once.
+ reg = (u16)ioread32(priv->base + 2 * priv->regs[index]);
+ if (reg != val)
+ {
+ netdev_err(priv->dev, "write 0x%x to offset 0x%x failed. got: 0x%x\n", val, 2 * priv->regs[index], reg);
+ }
+ } while ((reg != val) && (retries++ < 20));
+ spin_unlock_irqrestore(&priv->testlock, flags);
}
static void c_can_pci_reset_pch(const struct c_can_priv *priv, bool enable)Wolfgang.