Concept - B*Trees - NTFS Directory Trees

Previous Next

Home Add Delete


Here are some sites that I found helpful whilst writing the B-Tree code.

Basic Terminology

An object bearing data
A key with no children
A collection of keys
A node of order n, has a maximum of n-1 keys
An ordered data structure
Root Node
A node with no parent
The ceil((n-1)/2)th key in a node
Two keys in the same node, or two nodes with the same parent
The number of layers in the tree. Grandparent, parents, children = 3
A balanced tree
A balanced tree whose nodes are at least 1/2 full
A balanced tree whose nodes are at least 2/3 full



  fixed order
  height balanced
  during add/remove of keys
  minimal disturbance
  pointers downwards only

NTFS Trees

      index root
      index allocation
      dummy keys
      data in non-leaf keys
      on-disk pointer only point down

  What we have so far




  Add Rules

      Find the first key that is larger than the new key
      (this will be a necessarily be a leaf)
      Insert the new key before this key (in the same node)
      While the node is full
	  Split the current node in two
	  Promote the median key to the parent
	  Now consider the parent

  Delete Rules

      Delete the key
      If the key had children
	  Find the successor and move it to this node
	  Now consider the successor's old node
      While the node isn't full enough
	  If a sibling has enough keys
	    steal one
	    Combine with one of the sibling


This is a set of examples covering all the mechanisms for a B+Tree. There are other combinations which are the mirror images of some of the examples.

All the examples have nodes with order 4, i.e a maximum of 4 keys. With higher orders, some other manipulations will be necessary.

In each of the examples, a node (cyan) contains one or more keys (white, named). Each node is terminated with a dummy key (exclamation mark) which is lexicographically higher in any comparisons with normal key names.

Add      01 02 03 04 05 06 07 08 09 10 11 12 13 14
Delete      01 02 03 04 05 06 07

How to debug

Here's a picture of flatcap debugging the code using ddd - The Data Display Debugger.


A discussion log from #ntfs on IRC.

flatcaphi _Oracle_
_Oracle_hi there
flatcapanything I can do for you?
_Oracle_I was wondering about the B+ trees of ntfs
_Oracle_they seem to be a bit awkward, or at least - not what I expected :)
flatcapthey _do_ seem strange, but they are perfect for filesystems
_Oracle_no, i meant their on-disk representation
_Oracle_they have a dummy node of sorts?
flatcapthe trees in ntfs aren't proper b+trees
flatcapa dummy key
_Oracle_that's exactly what I was hoping to hear!
flatcap(thinking is still a bit hard this morning, bear with me :-)
_Oracle_no problem ;-)
flatcapthe trees consist of a node, which contains keys
flatcapthe keys in a real (ideal world) b+tree are just separators, and the data is only stored in the leaves
_Oracle_btw - how big is a node under ntfs? i mean, how many keys fit in there?
flatcapthe INDX record is 4k, an you can get 10's of filenames in it
flatcapbut..., that depends on the lengths of the filenames
_Oracle_i thought the number of keys in a node was a fixed property of a b+ tree?
flatcaphehe, usually, yes
flatcapthe keys of ntfs actually contain data and also a pointer to their children
_Oracle_so i noticed
AntonAone should add that INDX records of 2k size have also been seen in the wild -- by me (-:
_Oracle_what OS?
flatcapbecause there's one more child than key, there has to be a dummy key (no data, but has children)
AntonAsome of my directories (e.g. c:\winnt and c:\program files) have 2k INDX size while other dirs have 4k.
_Oracle_so the dummy key is always the "largest"?
_Oracle_i see...
_Oracle_so if the non-leaf nodes have data of themselves, wouldn't that make the tree a b-tree?
flatcapI've just written a test program to help me understand the trees, which is on bitkeeper
_Oracle_I'd love to see that
flatcapI read a lots of webpages and I think that the nearest term is a b*tree
_Oracle_and how is it different from a b-tree?
flatcapa b-tree maintains a minimum of 1/2 full nodes (except for the root node)
flatcapa b*tree changes the rules slightly and maintains 2/3 full
_Oracle_so it just changes the rules of combining two nodes to one and such?
_Oracle_let me think about that for a moment :)
flatcapin a true b+tree, the data keys (leaves) should also have pointers to the next (for quick sequential accesses), but that's probably just maintained in memory
flatcapI'm going to write up everything I know about ntfs trees soon
_Oracle_let me see if i got that...
_Oracle_the index root points to the root INDX record
flatcapyou can see my test prog at:
_Oracle_each INDX record contains keys that have pointers to the files themselves and to the keys with lower values
_Oracle_I see
flatcapthe index root lives in the MFT record
_Oracle_Yeah, this I managed to discover :)
flatcapall the rest (index allocations) are non-res
_Oracle_and the number of keys in a single INDX record is completely flexible?
flatcapyes, but there's a minimum
AntonAa minimum?
flatcapyes, that's part of the tree algorithm
AntonAsurely the minimum is a non-data containing terminator entry?
_Oracle_what's the minimum?
flatcapthe minimum for a b+tree is 1/2 full, b* 2/3 full
flatcaponly the root node may contain fewer
AntonAand the last node...
flatcapthe keys are moved about to keep this true
flatcapeven the last node will have the "right number" in it
AntonAthat would mean that in a really large directory deleting one file could take hours?
flatcapno, you might think that, but the balancing doesn't affect many other nodes
flatcapif the tree is 4 deep (NTFS equiv say 10^5 files), you'd only be altering 4 index records
flatcapI'll draw lots of pictures when I have a moment (probably tomorrow)
_Oracle_that should be interesting to read!
flatcapare you on our dev mailing list, _Oracle_
_Oracle_What mailing list? (er... no.)
AntonAthe major question that springs to my mind is what would windows ntfs do if it saw an imbalanced tree (because we messed up or because we simply chose to ignore balancing)
flatcaphehe, I hate to think :-)
_Oracle_I wouldn't want to be there, that's for sure
flatcapchkdsk would probably try and rebalance it and you might find that ntfs.sys would just balance it out as files were created/deleted
_Oracle_how do i join the list?
AntonAum, it would be a lot easier to get directory operations working while ignoring the existence of the tree (obviously operating on them correctly so we don't kill the fs)
flatcapI'll mail the list and answer questions there
AntonAif windows is able to pickup the pieces without complaint / failure, it would be worth considering as a first pass of implementation at least.
flatcapyes possibly, but I think I know enough now to build something close enough
flatcap(I just wanted a big project where I could start without tripping over you :-)
_Oracle_I've got a few more questions if you have the time
AntonAAs I said before. I am not going anywhere near directories. (-:
_Oracle_Smaller ones, though

Copyright ©