Language Server Protocol from Debug Symbols
August 12th, 2024
I was recently writing some Jai code and ran into a problem - no intellisense. Unfortunately Jai is a closed
beta language and no one has yet written an extension that allows for features such as
goto definition
.
Lack of intellisense makes me sad. Being able to tap F12 to goto any function is so helpful for learning your way around an unfamiliar code base.
Writing a custom Language Server Protocol implementation for Jai felt like a daunting task. It typically requires parsing the code to figure out what all the symbols and what not are. I don't have the time or energy to do that.
Then I came up with a crazy and maybe terrible idea. What if instead of trying to parse Jai code I could leverage the debug symbols already produced by the compiler? In theory this should have "perfect" knowledge of symbols and it should work with effectively any compiled language. That seems pretty nifty!
Well, that's exactly what I did. The end result is a VSCode extension called fts-lsp-pdb and it
provides goto definition
support for any language that produces a pdb
- Jai, C/C++,
Zig, etc.
The current implementation is super limited. It only works on function calls, not variable declarations. It
doesn't support inline functions or macros. It only supports Windows and .pdb
files. It's untested
with dlls
. It requires making a full debug build to produce a pdb
. And it's probably
buggy as I've only tested on small projects.
But it totally works! It supports any language that has .pdb
files, which includes pretty much
anything compiled by Clang/LLVM. And this was achieved in just ~450 lines of Rust code. 😲
Here's a video that shows fts-lsp-pdb
working for Jai, C++, Rust, Zig, Odin, and Nim.
This blog post isn't intended to be an announcement post for my new extension. To be honest I don't think more than a handful of folks will try it. Instead it's a place for me to share some of the things I learned. And at the end I'll spell out a vision for a future where intellisense is more reliable and requires less per-language work. I think my crazy idea is actually on the right track.
Language Server Protocol
Let's rewind for a moment. What is Language Server Protocol (LSP)? LSP is an open standard originally created by Microsoft for VS Code. It's a protocol that allows IDEs to send JSON messages to a separate process to query intellisense information. The core goal is to allow one language server to work with many different IDEs.

This screenshot isn't wrong, but I think it's incomplete. It doesn't reflect the amount of work that is necessary to support a given language. Here's how many lines of code are in various LSP servers.
- Nim — 4,400
- C++ — 35,000
- Odin — 38,000
- Zig — 53,000
- Python — 154,000
- Rust-Analyzer — 365,000
Making matters worse is that every build system may need to modify the language server. For example complex systems like Bazel and Buck don't work out of the box. The language server needs to know how to find toolchains, resolve includes/dependencies, etc.
LSP has been super successful and is much better than the before times. However it's a little unfortunate how much effort is necessary for each language.
Version 1: string to definition
On Windows debug symbols are typically stored in .pdb
files. MSVC and Clang/LLVM both produce
pdb
files which enables high quality debugging in Visual Studio, VSCode, WinDbg, etc. Given the
amount of debug information in this file surely we can leverage it, right?
I'm a big fan of the RAD Debugger
project. My first idea was to use the RAD Debugger codebase and provide goto definition
support by
taking a procedure name and looking up it's file/line source location.
This totally worked. The pure C source looked roughly like this:
// Parse RDI file (debug symbols) RDI_Parsed *rdi = 0; RDI_ParseStatus status = rdi_parse(input_data.str, input_data.size, rdi); // Procedure to find String8 targetProcName = ...; // Linearly loop over procedures RDI_Procedure *proc = rdi->procedures; for(U32 i = 0; i < rdi->procedures_count; i += 1, proc += 1) { // Get name of this procedure RDI_U32 idx = proc->name_string_idx; String8 name = {0}; name.str = rdi_string_from_idx(rdi, proc->name_string_idx, &name.size); // Check for match if (str8_match(targetProcName, name, 0)) { // procedure -> root scope RDI_Scope* scope = &rdi->scopes[proc->root_scope_idx]; // root_scope -> min_virtual_offset RDI_U64 voff = rdi->scope_voffs[scope->voff_range_first]; // min_virtual_offset -> unit RDI_U64 vmap_idx = rdi_vmap_idx_from_voff(rdi->unit_vmap, rdi->unit_vmap_count, voff); RDI_Unit* unit = &rdi->units[vmap_idx]; // unit -> line_info RDI_ParsedLineInfo lineInfo = { 0 }; rdi_line_info_from_unit(rdi, unit, &lineInfo); // line_info * min_virtual_offset -> line RDI_U64 lineInfoIdx = rdi_line_info_idx_from_voff(&lineInfo, voff); // line -> source_file RDI_Line* line = &lineInfo.lines[lineInfoIdx]; // source_file -> path RDI_SourceFile* sourceFile = &rdi->source_files[line->file_idx]; String8 filePath = {0}; filePath.str = rdi_string_from_idx(rdi, sourceFile->normal_full_path_string_idx, &filePath.size); // result is filePath + line_num return { filePath, line.line_num }; // tada! 🎉 } }
Unfortunately there was a problem. This worked by mapping string function names to their source location. Which works great for C and vanilla functions. But doesn't work at all with name mangling, function overloads, or templates/generics.
Version 2: call instructions
For my second version I wanted to properly support the aforementioned cases. But how?
My idea was to take a given file and line_num and query all functions called on that line. Unfortunately it
turned out that this information is not stored in a pdb
file! 😭
What a pdb
does store is what instructions are associated with a given line. So the new plan is to
take a given line, get the disassembled instructions from the exe
, find function calls, and then use the pdb
to map the function address to a source location.
To do this I switched from RAD Debugger C to Rust. This was for a variety of reasons I won't get into. After much iteration and asking ChatGPT a million questions I wound up with something that worked.
The pseudocode looks something like:
// 1. Parse exe with goblin crate // 2. Parse PDB with pdb crate // 3. Create PDB symcache with symbolic crate // 4. Find exe address_range for (source_file, line_number) // 5. Get exe instructions for address_range // 6. Identify `call` instructions // 7. Map `call` instruction target_address to source_location
The actual code looks vaguely like the following. A lot of error handling and initialization code is omitted.
fn goto_definition(params: GotoDefinitionParams) -> anyhow::Result<GotoDefinitionResponse> { let exe_path = ...; let pdb_path = ...; let source_file = ...; let line_number = ...; // 1. Parse exe with goblin crate let exe_data = std::fs::read(exe_path)?; let pe = PE::parse(&exe_data)?; // 2. Parse PDB with pdb crate let pdb_file = File::open(pdb_path)?; let mut pdb = PDB::open(pdb_file)?; // 3. Create PDB symcache with symbolic crate let buffer = ByteView::from_vec(...); let symcache = SymCache::parse(&buffer)?; // 4. Find exe address_range for (source_file, line_number) let mut address_range: (u32, u32) = (0,0); while let Some(module) = pdb.debug_information().modules().next()? { let info = pdb.module_info(&module)?.unwrap(); for line in info.line_program().lines() { if line.line_start == line_number { let file_name = pdb.string_table.get(line_program.get_file_info(line.file_index).name)?; if file_path == file_name { let rva = line.offset.to_rva(&pdb.address_map()).unwrap(); address_range = (rva.0, rva.0 + line.length.unwrap()); break break; // pseudocode! } } } } // 5. Get exe instructions for address_range let cs = Capstone::new().x86().mode(arch::x86::ArchMode::Mode64).build()?; let text_section = pe.sections.iter().find(|s| s.name().unwrap() == ".text")?; let bytes = &exe_data[text_section.pointer_to_raw_data as usize..text_section.size_of_raw_data as usize]; let exe_instructions = cs.disasm_all(bytes, text_section.virtual_address as u64)?; // 6. Identify `call` instructions let mut result: Vec<(String, u32)> = Default::default(); let (start, end) = address_range; let section = pe.sections.iter().find(|sec| sec.virtual_address <= start && start < sec.virtual_address + sec.virtual_size); if let Some(section) = section { // Get instructions for address range let offset = (start - section.virtual_address) as usize; let size = (end - start) as usize; let bytes = &exe_data[section.pointer_to_raw_data as usize + offset..section.pointer_to_raw_data as usize + offset + size]; let instructions = cs.disasm_all(bytes, start as u64)?; for instruction in instructions.iter() { // Find all `call` instructions if instruction.mnemonic().unwrap_or_default() == "call" { let op_str = instruction.op_str().unwrap(); let mut target_address = u64::from_str_radix(op_str.trim_start_matches("0x"), 16).unwrap(); // lookup target instruction, may be jmp let maybe_inst = exe_instructions.get(&target_address); if let Some(instruction) = maybe_inst { if instruction.mnemonic().unwrap_or_default() == "jmp" { let new_target_address = u64::from_str_radix(instruction.op_str().trim_start_matches("0x"), 16).unwrap(); target_address = new_target_address; } } // 7. Map instruction target_address to source_location let sym = symcache.lookup(target_address).collect::<Vec<_>>(); if let Ok((file, line) = sym[0]) { result.push((file, line)); } } } } // Omitted: convert result to lsp_types::Location return Ok(GotoDefinitionResponse::Array(result)); }
Tada! 🎉🍾🥇
If you read the code you may have noticed an interesting wrinkle when getting the target_address
from the call
instruction. When I first implemented this I simply could not find the function at
the supposed address. It was bewildering and frustrating.
Eventually I cracked open a debugger to inspect the disassembly. What I discovered is that the
call target_address
was indeed not the function I expected! Instead it was the address of a
jmp
instruction to the actual function.


VSCode Extension
Armed with some Rust code I tied it all together in a VSCode extension - fts-lsp-pdb.
I've never written a VSCode Extension before. It wasn't too bad. Microsoft provides a spectacular LSP sample project that I hacked up. There's a delightful Rust lsp-server crate that handles all the JSON message passing I don't want to think about. Documentation for it is non-existent, but ChatGPT/Claude helped me bootstrap.
VSCode Extension
What's next? Honestly, probably nothing. If folks want to issue PRs on the GitHub repo I'll pay attention. It'd be nice if
goto definition
worked on variables and not just functions. But I'm honestly not sure if
pdbs
have that information or not.
One huge limitation of the current server is that it only works on Windows with pdb
files. Maybe
someone wants to update it to support DWARF? I'm not a Linux person. But it'd probably be straight forward.
RAD Debugger defines it's own rdi
file type. It converts pdb
files to
rdi
and has preliminary support for dwarf
. That could be a nice unification when it's
ready.
VSCode Extension
Now I'd like paint a picture of what the future could look like.
My fts-lsp-pdb
server is neat but limited by the data currently contained in debug symbols.
However I think the general approach is actually quite sound.
LSP Servers fundamentally reimplement a large portion of the compiler. This is crazy! Resolving
templates/generics/overloads can be insanely complex. Why is that work being reimplemented? I find it
ludicrous that rust-analyzer
is over 365,000 lines of Rust code (even if a large chunk of
that is tests).
Here is an audacious proposal. I think Microsoft should create a new specification called "Intellisense
Database". Then compilers, such as Clang and MSVC, should be updated to natively create and incrementally update
.idb
files. This data can then be used by either a generic or language specific LSP server to
provide intellisense capabilities.
I'm probably being over-optimistic, but I bet an extremely rich and robust idb
file could be
generated for super complex languages like C++ in ~10 thousand lines of code. The data already exists during
compilation. It just needs to be not thrown away.
Long term I'd like to see both idb
data and full source code embedded in debug symbol files. I'm a
mega fan of symbol server and source indexing. It should be possible to download symbols for a binary and get
full source code with full intellisense. Oh how wonderful that would be!
Conclusion
I totally nerd sniped myself over my desire to have goto definition
support for Jai. I ended up
making a new LSP server that works with not only Jai but any compiled language that had pdb
files.
This isn't a replacement for any existing LSP server nor is it an ideal long-term solution. However for anyone
using LLVM to make a new language it's great way to get basic intellisense "for free".
I also happen to think the idea of using debug symbols for intellisense is clever, cool, and useful. Compilers should, imho, emit more data explicitly for the purpose of supporting intellisense.
Thanks for reading.
- Source Code: GitHub
- VSCode Extension: Marketplace
Bonus Section
Building Debug Symbols
One of the strongest arguments against my debug symbol approach is how slow and expensive it is to make a debug build. This is a fair and reasonable argument.
It's worth pointing out that typical LSP servers spend a lot of cycles parsing code. I'm not sure why
but rust-analyzer
takes noticeably longer to initialize than a full, clean build does. Maybe
because it's running single threaded? Not sure.
Anyhow, slow debug builds is definitely a valid reason to not use fts-lsp-pdb
! If my vision of the
future comes to fruition and compilers generate .idb
files then I believe that can be done
incrementally and quickly. It shouldn't be much different than what LSP servers do today. But it should be more
accurate, reliable, and simple.
Notes on PDB Support
In theory fts-lsp-pdb
should work with any language that generates pdbs
. However in
practice I found some issues. For example my test Odin pdb
hits an assert in debug mode. And
symbolic
hits an unreachable!
with D.
I'm not sure if these are errors in the PDB
or symbolic
crates or if the language
compilers are generating naughty .pdb
files.
Bottom line is that it should work, but it might not!
A Rust Rant: Self-Referential Structs
I'm a big fan of Rust. I've written small amounts of Rust for years. I've written numerous blog posts about or using Rust. Generally I do not find the borrow checker difficult to satisfy. However for the very first time I fought and partially lost a gnarly borrow checker battle.
Not shown above is code to pre-process and cache the exe
and pdb
. Finding the right
data requires some linear searches which is inefficient. Ideally they would be loaded once and pre-processed
into a cache. It turns out this is nigh impossible to do in Rust.
The well known issue is that Rust really, really, really hates self-referential structs. Even when it's PERFECTLY SAFE to do so. Which makes it immensely frustrating. The problem isn't that the code is wrong. The problem is the borrow checker isn't smart enough to prove it's correct!
I probably spent 12+ hours trying different strategies and crates. The fact that my cache is refreshed in a
message loop when the user modifies the VSCode config made it even more painful. I found a solution using self_cell which mostly works. I failed to find a way to
cache pdb.address_map()
.
The ideal Rust code should look like this:
struct ExeCache { // Owned data exe_bytes: Vec<u8>, exe_capstone: capstone::Capstone, pdb_path: PathBuf, pdb_bytes: Vec<u8>, // Self-referential data exe_parsed: gobline::pe::PE, exe_instructions: capstone:Instructions, exe_instructions_sorted: HashMap<u64, &capstone::Instruction>, pdb_line_infos: HashMap<String, HashMap<u32, &pdb::LineInfo>>, sym_cache: symbolic::symcache, }
The horrible, ugly, disgusting, very bad code that actually works looks like this:
// Cache for a single exe/PDB self_cell::self_cell!( struct ExeCache { owner: ExeCacheOwner, #[covariant] dependent: ExeCacheRefs, } ); struct ExeCacheOwner { exe_bytes: Pin<Box<[u8]>>, exe_capstone: capstone::Capstone, pdb_path: PathBuf, files: HashMap<String, HashMap<u32, pdb::LineInfo>>, symcache_bytes: Pin<Box<[u8]>>, } self_cell::self_cell!( struct ExeCacheRefs<'a> { owner: ExeCacheRefs1<'a>, #[covariant] dependent: ExeCacheRefs2, } ); struct ExeCacheRefs1<'a> { exe_parsed: goblin::pe::PE<'a>, exe_instructions: capstone::Instructions<'a>, symcache: SymCache<'a>, } struct ExeCacheRefs2<'a> { exe_instructions_sorted: HashMap<u64, &'a capstone::Insn<'a>>, }
Gross gross gross. 🤢🤮
There are a variety of crates that help provide support for self-referential structs. self_cell
,
ouroboros
, rental
, and more. These crates provide an existence proof that it is
possible to have safe and correct self-referential structs in Rust.
I'm sure this is a very difficult problem. A recent blog post discusses some of the backwards compatibility and ergonomic issues.
Unfortunately I don't really care how hard it is! It's a black mark on the language. And it may never be fixed not because it's intrinsically impossible but because Rust is now a mature language with a variety of historical artifacts and missteps.
Grumble grumble rant rant. 😤