Re: [Intel-wired-lan] [PATCH 00/38] docs: several improvements to kernel-doc
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Date: 2026-03-13 10:48:49
Also in:
intel-wired-lan, linux-doc, linux-hardening, lkml
On Wed, 04 Mar 2026 12:07:45 +0200 Jani Nikula [off-list ref] wrote:
On Mon, 23 Feb 2026, Jonathan Corbet [off-list ref] wrote:quoted
Jani Nikula [off-list ref] writes:quoted
There's always the question, if you're putting a lot of effort into making kernel-doc closer to an actual C parser, why not put all that effort into using and adapting to, you know, an actual C parser?Not speaking to the current effort but ... in the past, when I have contemplated this (using, say, tree-sitter), the real problem is that those parsers simply strip out the comments. Kerneldoc without comments ... doesn't work very well. If there were a parser without those problems, and which could be made to do the right thing with all of our weird macro usage, it would certainly be worth considering.I think e.g. libclang and its Python bindings can be made to work. The main problems with that are passing proper compiler options (because it'll need to include stuff to know about types etc. because it is a proper parser), preprocessing everything is going to take time, you need to invest a bunch into it to know how slow exactly compared to the current thing and whether it's prohitive, and it introduces an extra dependency. So yeah, there are definitely tradeoffs there. But it's not like this constant patching of kernel-doc is exactly burden free either.
On my tests with a simple C tokenizer: https://lore.kernel.org/linux-doc/cover.1773326442.git.mchehab+huawei@kernel.org/ (local) The tokenizer is working fine and didn't make it much slow: it increases the time to pass the entire Kernel tree from 37s to 47s for man pages generation, but should not change much the time for htmldocs, as right now only ~4 seconds is needed to read files pointed by Documentation kernel-doc tags and parse them. The code can still be cleaned up, as there are still some things hardcoded on the various dump_* functions that could be better implemented (*). The advantage of the approach I'm using is that it allows to gradually migrate to rely at the tokenized code, as it can be done incrementally. (*) for instance, __attribute__ and a couple of other macros are parsed twice at dump_struct() logic, on different places.
I don't know, is it just me, but I'd like to think as a profession we'd be past writing ad hoc C parsers by now.
Probably not, but I don't think we need a C parser, as kernel-doc just needs to understand data types (enum, struct, typedef, union, vars) and function/macro prototypes. For such purpose, a tokenizer sounds enough. Now, there is the code that it is now inside: https://github.com/mchehab/linux/blob/tokenizer-v5/tools/lib/python/kdoc/xforms_lists.py which contains a list of C/gcc/clang keywords that will be ignored, like: __attribute__ static extern inline Together with a sanitized version of the kernel macros it needs to handle or ignore: DECLARE_BITMAP DECLARE_HASHTABLE __acquires __init __exit struct_group ... Once we finish cleaning up kdoc_parser.py to rely only on it for prototype transformations, this will be the only file that will require changes when more macros start affecting kernel-doc. As this is complex, and may require manual adjustments, it is probably better to not try to auto-generate xforms list in runtime. A better approach is, IMO, to have a C pre-processor code to help periodically update it, like using a target like: make kdoc-xforms that would use either cpp or clang to generate a patch to update xforms_list content after adding new macros that affect docs generation. -- Thanks, Mauro