Thread (11 messages) 11 messages, 7 authors, 2017-01-25

Re: [LSF/MM TOPIC] block level event logging for storage media management

From: Song Liu <hidden>
Date: 2017-01-23 06:00:47

Hi Dan,=20

I think the the block level event log is more like log only system. When en=
 event=20
happens,  it is not necessary to take immediate action. (I guess this is di=
fferent
to bad block list?).=20

I would hope the event log to track more information. Some of these individ=
ual=20
event may not be very interesting, for example, soft error or latency outli=
ers.=20
However, when we gather event log for a fleet of devices, these "soft event=
"=20
may become valuable for health monitoring.=20

Thanks,
Song

On Jan 20, 2017, at 9:46 PM, Dan Williams [off-list ref] wrot=
e:
=20
On Wed, Jan 18, 2017 at 3:34 PM, Song Liu [off-list ref] wrote:
quoted
=20
Media health monitoring is very important for large scale distributed st=
orage systems.
quoted
Traditionally, enterprise storage controllers maintain event logs for at=
tached storage
quoted
devices. However, these controller managed logs do not scale well for la=
rge scale
quoted
distributed systems.
=20
While designing a more flexible and scalable event logging systems, we t=
hink it is better
quoted
to build the log in block layer. Block level event logging covers all ma=
jor storage media
quoted
(SCSI, SATA, NVMe), and thus minimizes redundant work for different prot=
ocols.
quoted
=20
In this LSF/MM, we would like to discuss the following topics with the c=
ommunity:
quoted
   1. Mechanism for drivers report events (or errors) to block layer.
      Basically, we will need a traceable function for the drivers to re=
port errors
quoted
      (most likely right before calling end_request or bio_endio).
=20
   2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the eve=
nt logging?
quoted
=20
   3. How should we categorize different events?
      Currently, there are existing code that translates ATA error (ata_=
to_sense_error)
quoted
      and NVMe error (nvme_trans_status_code) to SCSI sense code. So we =
can
quoted
      leverage SCSI Key Code Qualifier for event categorizations.
=20
   4. Detailed discussions on data structure for event logging.
=20
We will be able to show a prototype implementation during LSF/MM.
=20
Hi Song,
=20
How is this distinct from tracking a badblocks list?
=20
I'm interested in this topic since we have both media error reporting
/ scrubbing for nvdimms as well "SMART" media health retrieval
commands.
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help