Language Server Protocol from Debug Symbols

August 12th, 2024

I was recently writing some Jai code and ran into a problem - no intellisense. Unfortunately Jai is a closed beta language and no one has yet written an extension that allows for features such as goto definition.

Lack of intellisense makes me sad. Being able to tap F12 to goto any function is so helpful for learning your way around an unfamiliar code base.

Writing a custom Language Server Protocol implementation for Jai felt like a daunting task. It typically requires parsing the code to figure out what all the symbols and what not are. I don't have the time or energy to do that.

Then I came up with a crazy and maybe terrible idea. What if instead of trying to parse Jai code I could leverage the debug symbols already produced by the compiler? In theory this should have "perfect" knowledge of symbols and it should work with effectively any compiled language. That seems pretty nifty!

Well, that's exactly what I did. The end result is a VSCode extension called fts-lsp-pdb and it provides goto definition support for any language that produces a pdb - Jai, C/C++, Zig, etc.

The current implementation is super limited. It only works on function calls, not variable declarations. It doesn't support inline functions or macros. It only supports Windows and .pdb files. It's untested with dlls. It requires making a full debug build to produce a pdb. And it's probably buggy as I've only tested on small projects.

But it totally works! It supports any language that has .pdb files, which includes pretty much anything compiled by Clang/LLVM. And this was achieved in just ~450 lines of Rust code. 😲

Here's a video that shows fts-lsp-pdb working for Jai, C++, Rust, Zig, Odin, and Nim.

This blog post isn't intended to be an announcement post for my new extension. To be honest I don't think more than a handful of folks will try it. Instead it's a place for me to share some of the things I learned. And at the end I'll spell out a vision for a future where intellisense is more reliable and requires less per-language work. I think my crazy idea is actually on the right track.

Language Server Protocol

Let's rewind for a moment. What is Language Server Protocol (LSP)? LSP is an open standard originally created by Microsoft for VS Code. It's a protocol that allows IDEs to send JSON messages to a separate process to query intellisense information. The core goal is to allow one language server to work with many different IDEs.

This screenshot isn't wrong, but I think it's incomplete. It doesn't reflect the amount of work that is necessary to support a given language. Here's how many lines of code are in various LSP servers.

Making matters worse is that every build system may need to modify the language server. For example complex systems like Bazel and Buck don't work out of the box. The language server needs to know how to find toolchains, resolve includes/dependencies, etc.

LSP has been super successful and is much better than the before times. However it's a little unfortunate how much effort is necessary for each language.

Version 1: string to definition

On Windows debug symbols are typically stored in .pdb files. MSVC and Clang/LLVM both produce pdb files which enables high quality debugging in Visual Studio, VSCode, WinDbg, etc. Given the amount of debug information in this file surely we can leverage it, right?

I'm a big fan of the RAD Debugger project. My first idea was to use the RAD Debugger codebase and provide goto definition support by taking a procedure name and looking up it's file/line source location.

This totally worked. The pure C source looked roughly like this:

// Parse RDI file (debug symbols)
RDI_Parsed *rdi = 0;
RDI_ParseStatus status = rdi_parse(input_data.str, input_data.size, rdi);

// Procedure to find
String8 targetProcName = ...;

// Linearly loop over procedures
RDI_Procedure *proc = rdi->procedures;
for(U32 i = 0; i < rdi->procedures_count; i += 1, proc += 1)
{
  // Get name of this procedure
  RDI_U32 idx = proc->name_string_idx;
  String8 name = {0};
  name.str = rdi_string_from_idx(rdi, proc->name_string_idx, &name.size);

  // Check for match
  if (str8_match(targetProcName, name, 0)) {
    // procedure -> root scope
    RDI_Scope* scope = &rdi->scopes[proc->root_scope_idx];

    // root_scope -> min_virtual_offset
    RDI_U64 voff = rdi->scope_voffs[scope->voff_range_first];

    // min_virtual_offset -> unit
    RDI_U64 vmap_idx = rdi_vmap_idx_from_voff(rdi->unit_vmap, rdi->unit_vmap_count, voff);
    RDI_Unit* unit = &rdi->units[vmap_idx];

    // unit -> line_info
    RDI_ParsedLineInfo lineInfo = { 0 };
    rdi_line_info_from_unit(rdi, unit, &lineInfo);

    // line_info * min_virtual_offset -> line
    RDI_U64 lineInfoIdx = rdi_line_info_idx_from_voff(&lineInfo, voff);

    // line -> source_file
    RDI_Line* line = &lineInfo.lines[lineInfoIdx];

    // source_file -> path
    RDI_SourceFile* sourceFile = &rdi->source_files[line->file_idx];
    String8 filePath = {0};
    filePath.str = rdi_string_from_idx(rdi, sourceFile->normal_full_path_string_idx, &filePath.size);

    // result is filePath + line_num
    return { filePath, line.line_num }; // tada! 🎉
  }
}

Unfortunately there was a problem. This worked by mapping string function names to their source location. Which works great for C and vanilla functions. But doesn't work at all with name mangling, function overloads, or templates/generics.

Version 2: call instructions

For my second version I wanted to properly support the aforementioned cases. But how?

My idea was to take a given file and line_num and query all functions called on that line. Unfortunately it turned out that this information is not stored in a pdb file! 😭

What a pdb does store is what instructions are associated with a given line. So the new plan is to take a given line, get the disassembled instructions from the exe, find function calls, and then use the pdb to map the function address to a source location.

To do this I switched from RAD Debugger C to Rust. This was for a variety of reasons I won't get into. After much iteration and asking ChatGPT a million questions I wound up with something that worked.

The pseudocode looks something like:

// 1. Parse exe with goblin crate
// 2. Parse PDB with pdb crate
// 3. Create PDB symcache with symbolic crate
// 4. Find exe address_range for (source_file, line_number)
// 5. Get exe instructions for address_range
// 6. Identify `call` instructions
// 7. Map `call` instruction target_address to source_location

The actual code looks vaguely like the following. A lot of error handling and initialization code is omitted.

fn goto_definition(params: GotoDefinitionParams) -> anyhow::Result<GotoDefinitionResponse> {
  let exe_path = ...;
  let pdb_path = ...;
  let source_file = ...;
  let line_number = ...;

  // 1. Parse exe with goblin crate
  let exe_data = std::fs::read(exe_path)?;
  let pe = PE::parse(&exe_data)?;

  // 2. Parse PDB with pdb crate
  let pdb_file = File::open(pdb_path)?;
  let mut pdb = PDB::open(pdb_file)?;

  // 3. Create PDB symcache with symbolic crate
  let buffer = ByteView::from_vec(...);
  let symcache = SymCache::parse(&buffer)?;

  // 4. Find exe address_range for (source_file, line_number)
  let mut address_range: (u32, u32) = (0,0);
  while let Some(module) = pdb.debug_information().modules().next()? {
    let info = pdb.module_info(&module)?.unwrap();
    for line in info.line_program().lines() {
      if line.line_start == line_number {
        let file_name = pdb.string_table.get(line_program.get_file_info(line.file_index).name)?;

        if file_path == file_name {
          let rva = line.offset.to_rva(&pdb.address_map()).unwrap();
          address_range = (rva.0, rva.0 + line.length.unwrap());
          break break; // pseudocode!
        }
      }
    }
  }

  // 5. Get exe instructions for address_range
  let cs = Capstone::new().x86().mode(arch::x86::ArchMode::Mode64).build()?;
  let text_section = pe.sections.iter().find(|s| s.name().unwrap() == ".text")?;
  let bytes = &exe_data[text_section.pointer_to_raw_data as usize..text_section.size_of_raw_data as usize];
  let exe_instructions = cs.disasm_all(bytes, text_section.virtual_address as u64)?;

  // 6. Identify `call` instructions
  let mut result: Vec<(String, u32)> = Default::default();
  let (start, end) = address_range;
  let section = pe.sections.iter().find(|sec| sec.virtual_address <= start && start < sec.virtual_address + sec.virtual_size);
  
  if let Some(section) = section {
    // Get instructions for address range
    let offset = (start - section.virtual_address) as usize;
    let size = (end - start) as usize;
    let bytes = &exe_data[section.pointer_to_raw_data as usize + offset..section.pointer_to_raw_data as usize + offset + size];
    let instructions = cs.disasm_all(bytes, start as u64)?;

    for instruction in instructions.iter() {
      // Find all `call` instructions
      if instruction.mnemonic().unwrap_or_default() == "call" {
        let op_str = instruction.op_str().unwrap();
        let mut target_address = u64::from_str_radix(op_str.trim_start_matches("0x"), 16).unwrap();

        // lookup target instruction, may be jmp
        let maybe_inst = exe_instructions.get(&target_address);
        if let Some(instruction) = maybe_inst {
          if instruction.mnemonic().unwrap_or_default() == "jmp" {
            let new_target_address = u64::from_str_radix(instruction.op_str().trim_start_matches("0x"), 16).unwrap();
            target_address = new_target_address;
          }
        }

        // 7. Map instruction target_address to source_location
        let sym = symcache.lookup(target_address).collect::<Vec<_>>();
        if let Ok((file, line) = sym[0]) {
          result.push((file, line));
        }
      }
    }
  }

  // Omitted: convert result to lsp_types::Location
  return Ok(GotoDefinitionResponse::Array(result));
}

Tada! 🎉🍾🥇

If you read the code you may have noticed an interesting wrinkle when getting the target_address from the call instruction. When I first implemented this I simply could not find the function at the supposed address. It was bewildering and frustrating.

Eventually I cracked open a debugger to inspect the disassembly. What I discovered is that the call target_address was indeed not the function I expected! Instead it was the address of a jmp instruction to the actual function.

VSCode Extension

Armed with some Rust code I tied it all together in a VSCode extension - fts-lsp-pdb.

I've never written a VSCode Extension before. It wasn't too bad. Microsoft provides a spectacular LSP sample project that I hacked up. There's a delightful Rust lsp-server crate that handles all the JSON message passing I don't want to think about. Documentation for it is non-existent, but ChatGPT/Claude helped me bootstrap.

VSCode Extension

What's next? Honestly, probably nothing. If folks want to issue PRs on the GitHub repo I'll pay attention. It'd be nice if goto definition worked on variables and not just functions. But I'm honestly not sure if pdbs have that information or not.

One huge limitation of the current server is that it only works on Windows with pdb files. Maybe someone wants to update it to support DWARF? I'm not a Linux person. But it'd probably be straight forward.

RAD Debugger defines it's own rdi file type. It converts pdb files to rdi and has preliminary support for dwarf. That could be a nice unification when it's ready.

VSCode Extension

Now I'd like paint a picture of what the future could look like.

My fts-lsp-pdb server is neat but limited by the data currently contained in debug symbols. However I think the general approach is actually quite sound.

LSP Servers fundamentally reimplement a large portion of the compiler. This is crazy! Resolving templates/generics/overloads can be insanely complex. Why is that work being reimplemented? I find it ludicrous that rust-analyzer is over 365,000 lines of Rust code (even if a large chunk of that is tests).

Here is an audacious proposal. I think Microsoft should create a new specification called "Intellisense Database". Then compilers, such as Clang and MSVC, should be updated to natively create and incrementally update .idb files. This data can then be used by either a generic or language specific LSP server to provide intellisense capabilities.

I'm probably being over-optimistic, but I bet an extremely rich and robust idb file could be generated for super complex languages like C++ in ~10 thousand lines of code. The data already exists during compilation. It just needs to be not thrown away.

Long term I'd like to see both idb data and full source code embedded in debug symbol files. I'm a mega fan of symbol server and source indexing. It should be possible to download symbols for a binary and get full source code with full intellisense. Oh how wonderful that would be!

Conclusion

I totally nerd sniped myself over my desire to have goto definition support for Jai. I ended up making a new LSP server that works with not only Jai but any compiled language that had pdb files. This isn't a replacement for any existing LSP server nor is it an ideal long-term solution. However for anyone using LLVM to make a new language it's great way to get basic intellisense "for free".

I also happen to think the idea of using debug symbols for intellisense is clever, cool, and useful. Compilers should, imho, emit more data explicitly for the purpose of supporting intellisense.

Thanks for reading.

Bonus Section

Building Debug Symbols

One of the strongest arguments against my debug symbol approach is how slow and expensive it is to make a debug build. This is a fair and reasonable argument.

It's worth pointing out that typical LSP servers spend a lot of cycles parsing code. I'm not sure why but rust-analyzer takes noticeably longer to initialize than a full, clean build does. Maybe because it's running single threaded? Not sure.

Anyhow, slow debug builds is definitely a valid reason to not use fts-lsp-pdb! If my vision of the future comes to fruition and compilers generate .idb files then I believe that can be done incrementally and quickly. It shouldn't be much different than what LSP servers do today. But it should be more accurate, reliable, and simple.

Notes on PDB Support

In theory fts-lsp-pdb should work with any language that generates pdbs. However in practice I found some issues. For example my test Odin pdb hits an assert in debug mode. And symbolic hits an unreachable! with D.

I'm not sure if these are errors in the PDB or symbolic crates or if the language compilers are generating naughty .pdb files.

Bottom line is that it should work, but it might not!

A Rust Rant: Self-Referential Structs

I'm a big fan of Rust. I've written small amounts of Rust for years. I've written numerous blog posts about or using Rust. Generally I do not find the borrow checker difficult to satisfy. However for the very first time I fought and partially lost a gnarly borrow checker battle.

Not shown above is code to pre-process and cache the exe and pdb. Finding the right data requires some linear searches which is inefficient. Ideally they would be loaded once and pre-processed into a cache. It turns out this is nigh impossible to do in Rust.

The well known issue is that Rust really, really, really hates self-referential structs. Even when it's PERFECTLY SAFE to do so. Which makes it immensely frustrating. The problem isn't that the code is wrong. The problem is the borrow checker isn't smart enough to prove it's correct!

I probably spent 12+ hours trying different strategies and crates. The fact that my cache is refreshed in a message loop when the user modifies the VSCode config made it even more painful. I found a solution using self_cell which mostly works. I failed to find a way to cache pdb.address_map().

The ideal Rust code should look like this:

struct ExeCache {
    // Owned data
    exe_bytes: Vec<u8>,
    exe_capstone: capstone::Capstone,
    pdb_path: PathBuf,
    pdb_bytes: Vec<u8>,

    // Self-referential data
    exe_parsed: gobline::pe::PE,
    exe_instructions: capstone:Instructions,
    exe_instructions_sorted: HashMap<u64, &capstone::Instruction>,
    pdb_line_infos: HashMap<String, HashMap<u32, &pdb::LineInfo>>,
    sym_cache: symbolic::symcache,
}

The horrible, ugly, disgusting, very bad code that actually works looks like this:

// Cache for a single exe/PDB
  self_cell::self_cell!(
    struct ExeCache {
        owner: ExeCacheOwner,

        #[covariant]
        dependent: ExeCacheRefs,
    }
);

struct ExeCacheOwner {
    exe_bytes: Pin<Box<[u8]>>,
    exe_capstone: capstone::Capstone,

    pdb_path: PathBuf,
    files: HashMap<String, HashMap<u32, pdb::LineInfo>>,

    symcache_bytes: Pin<Box<[u8]>>,
}

self_cell::self_cell!(
    struct ExeCacheRefs<'a> {
        owner: ExeCacheRefs1<'a>,

        #[covariant]
        dependent: ExeCacheRefs2,
    }
);

struct ExeCacheRefs1<'a> {
    exe_parsed: goblin::pe::PE<'a>,
    exe_instructions: capstone::Instructions<'a>,
    symcache: SymCache<'a>,
}

struct ExeCacheRefs2<'a> {
    exe_instructions_sorted: HashMap<u64, &'a capstone::Insn<'a>>,
}

Gross gross gross. 🤢🤮

There are a variety of crates that help provide support for self-referential structs. self_cell, ouroboros, rental, and more. These crates provide an existence proof that it is possible to have safe and correct self-referential structs in Rust.

I'm sure this is a very difficult problem. A recent blog post discusses some of the backwards compatibility and ergonomic issues.

Unfortunately I don't really care how hard it is! It's a black mark on the language. And it may never be fixed not because it's intrinsically impossible but because Rust is now a mature language with a variety of historical artifacts and missteps.

Grumble grumble rant rant. 😤