Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries

From: Hao Wu <hidden>
Date: 2021-05-09 06:18:15

On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen [off-list ref] wrote:

On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:

quoted

On Oct 17, 2020, at 10:20 PM, Hao Wu [off-list ref] wrote:

quoted

On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen [off-list ref] wrote:

On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:

quoted

On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen [off-list ref] wrote:

On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:

quoted

On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:

quoted

On 10/1/20 12:53 AM, James Bottomley wrote:

quoted

On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:

quoted

On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:

quoted

On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:

[...]

quoted

I also wonder if we could adjust the frequency dynamically.
I.e. start with optimistic value and lower it until finding
the sweet spot.

The problem is the way this crashes: the TPM seems to be
unrecoverable. If it were recoverable without a hard reset of
the entire machine, we could certainly play around with it.  I
can try alternative mechanisms to see if anything's viable, but
to all intents and purposes, it looks like my TPM simply stops
responding to the TIS interface.

A quickly scraped idea probably with some holes in it but I was
thinking something like

1. Initially set slow value for latency, this could be the
original 15 ms.
2. Use this to read TPM_PT_VENDOR_STRING_*.
3. Lookup based vendor string from a fixup table a latency that
works
(the fallback latency could be the existing latency).

Well, yes, that was sort of what I was thinking of doing for the
Atmel ... except I was thinking of using the TIS VID (16 byte
assigned vendor ID) which means we can get the information to set
the timeout before we have to do any TPM operations.

I wonder if the timeout issue exists for all TPM commands for the
same manufacturer.  For example, does the ATMEL TPM also crash when 
extending  PCRs ?

In addition to defining a per TPM vendor based lookup table for
timeout, would it be a good idea to also define a Kconfig/boot param
option to allow timeout setting.  This will enable to set the timeout
based on the specific use.

I don't think we need go that far (yet).  The timing change has been in
upstream since:

commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
Author: Nayna Jain [off-list ref]
Date:   Wed May 16 01:51:25 2018 -0400

tpm: reduce polling time to usecs for even finer granularity

Which was in the released kernel 4.18: over two years ago.  In all that
time we've discovered two problems: mine which looks to be an artifact
of an experimental upgrade process in a new nuvoton and the Atmel. 
That means pretty much every other TPM simply works with the existing
timings

quoted

I was also thinking how will we decide the lookup table values for
each vendor ?

I wasn't thinking we would.  I was thinking I'd do a simple exception
for the Atmel and nothing else.  I don't think my Nuvoton is in any way
characteristic.  Indeed my pluggable TPM rainbow bridge system works
just fine with a Nuvoton and the current timings.

We can add additional exceptions if they actually turn up.

I'd add a table and fallback.

Hi folks,

I want to follow up this a bit and check whether we reached a consensus 
on how to fix the timeout issue for Atmel chip.

Should we revert the changes or introduce the lookup table for chips.

Is there anything I can help from Rubrik side.

Thanks
Hao

There is nothing to revert as the previous was not applied but I'm
of course ready to review any new attempts.

Hi Jarkko,

By “revert” I meant we revert the timeout value changes by applying
the patch I proposed, as the timeout value discussed does cause issues.

Why don’t we apply the patch and improve the perf in the way of not
breaking TPMs ? 

Hao

Hi Jarkko and folks,

It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
It looks like we currently have following choices:
1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
 (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
 It is the current proposed patch
3. Fix regression by making exception for ateml chip.  

Should we reach consensus on which one we want to pursue before dig
into implementation of the patch? In my opinion, I prefer to fix the
regression with 2, and then pursue 1 as long-term solution. 3 is
hacky.

What does option 1 fix for *all* vendors?

quoted

Let me know what do you guys think

Hao

/Jarkko

Hi Jarkko and folks,

It has been a while again. In my previous message I answered Jarkko’s question about the option 1.
Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do.
Somehow I couldn’t found the last message I sent but it is in 
https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/

In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can
configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL
timing may cause performance degradation for other chips.

I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing
refactoring without testing chips from all manufacturing.

Let me know how should I push this forward.

Thanks
Hao

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help