Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
From: Hao Wu <hidden>
Date: 2021-05-09 06:18:15
On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen [off-list ref] wrote: On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:quoted
quoted
On Oct 17, 2020, at 10:20 PM, Hao Wu [off-list ref] wrote:quoted
On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen [off-list ref] wrote: On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:quoted
quoted
On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen [off-list ref] wrote: On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:quoted
On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:quoted
On 10/1/20 12:53 AM, James Bottomley wrote:quoted
On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:quoted
On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:quoted
On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:[...]quoted
quoted
quoted
quoted
quoted
I also wonder if we could adjust the frequency dynamically. I.e. start with optimistic value and lower it until finding the sweet spot.The problem is the way this crashes: the TPM seems to be unrecoverable. If it were recoverable without a hard reset of the entire machine, we could certainly play around with it. I can try alternative mechanisms to see if anything's viable, but to all intents and purposes, it looks like my TPM simply stops responding to the TIS interface.A quickly scraped idea probably with some holes in it but I was thinking something like 1. Initially set slow value for latency, this could be the original 15 ms. 2. Use this to read TPM_PT_VENDOR_STRING_*. 3. Lookup based vendor string from a fixup table a latency that works (the fallback latency could be the existing latency).Well, yes, that was sort of what I was thinking of doing for the Atmel ... except I was thinking of using the TIS VID (16 byte assigned vendor ID) which means we can get the information to set the timeout before we have to do any TPM operations.I wonder if the timeout issue exists for all TPM commands for the same manufacturer. For example, does the ATMEL TPM also crash when extending PCRs ? In addition to defining a per TPM vendor based lookup table for timeout, would it be a good idea to also define a Kconfig/boot param option to allow timeout setting. This will enable to set the timeout based on the specific use.I don't think we need go that far (yet). The timing change has been in upstream since: commit 424eaf910c329ab06ad03a527ef45dcf6a328f00 Author: Nayna Jain [off-list ref] Date: Wed May 16 01:51:25 2018 -0400 tpm: reduce polling time to usecs for even finer granularity Which was in the released kernel 4.18: over two years ago. In all that time we've discovered two problems: mine which looks to be an artifact of an experimental upgrade process in a new nuvoton and the Atmel. That means pretty much every other TPM simply works with the existing timingsquoted
I was also thinking how will we decide the lookup table values for each vendor ?I wasn't thinking we would. I was thinking I'd do a simple exception for the Atmel and nothing else. I don't think my Nuvoton is in any way characteristic. Indeed my pluggable TPM rainbow bridge system works just fine with a Nuvoton and the current timings. We can add additional exceptions if they actually turn up.I'd add a table and fallback.Hi folks, I want to follow up this a bit and check whether we reached a consensus on how to fix the timeout issue for Atmel chip. Should we revert the changes or introduce the lookup table for chips. Is there anything I can help from Rubrik side. Thanks HaoThere is nothing to revert as the previous was not applied but I'm of course ready to review any new attempts.Hi Jarkko, By “revert” I meant we revert the timeout value changes by applying the patch I proposed, as the timeout value discussed does cause issues. Why don’t we apply the patch and improve the perf in the way of not breaking TPMs ? HaoHi Jarkko and folks, It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. It looks like we currently have following choices: 1. generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 2. quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms. It is the current proposed patch 3. Fix regression by making exception for ateml chip. Should we reach consensus on which one we want to pursue before dig into implementation of the patch? In my opinion, I prefer to fix the regression with 2, and then pursue 1 as long-term solution. 3 is hacky.What does option 1 fix for *all* vendors?quoted
Let me know what do you guys think Hao/Jarkko
Hi Jarkko and folks, It has been a while again. In my previous message I answered Jarkko’s question about the option 1. Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do. Somehow I couldn’t found the last message I sent but it is in https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/ In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL timing may cause performance degradation for other chips. I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing refactoring without testing chips from all manufacturing. Let me know how should I push this forward. Thanks Hao