Glossary |
Here are some sites that I found helpful whilst writing the B-Tree code.
B+Trees
fixed order height balanced during add/remove of keys minimal disturbance pointers downwards only
index root index allocation dummy keys data in non-leaf keys on-disk pointer only point down What we have so far ... Overview ... Add Rules Find the first key that is larger than the new key (this will be a necessarily be a leaf) Insert the new key before this key (in the same node) While the node is full Split the current node in two Promote the median key to the parent Now consider the parent End Delete Rules Delete the key If the key had children Find the successor and move it to this node Now consider the successor's old node End While the node isn't full enough If a sibling has enough keys steal one Else Combine with one of the sibling End End
This is a set of examples covering all the mechanisms for a B+Tree. There are other combinations which are the mirror images of some of the examples.
All the examples have nodes with order 4, i.e a maximum of 4 keys. With higher orders, some other manipulations will be necessary.
In each of the examples, a node (cyan) contains one or more keys (white, named). Each node is terminated with a dummy key (exclamation mark) which is lexicographically higher in any comparisons with normal key names.
Add | 01 02 03 04 05 06 07 08 09 10 11 12 13 14 | |
Delete | 01 02 03 04 05 06 07 |
Here's a picture of flatcap debugging the code using ddd - The Data Display Debugger.
A discussion log from #ntfs on IRC.
flatcap | hi _Oracle_ |
_Oracle_ | hi there |
flatcap | anything I can do for you? |
_Oracle_ | I was wondering about the B+ trees of ntfs |
_Oracle_ | they seem to be a bit awkward, or at least - not what I expected :) |
flatcap | they _do_ seem strange, but they are perfect for filesystems |
_Oracle_ | no, i meant their on-disk representation |
_Oracle_ | they have a dummy node of sorts? |
flatcap | the trees in ntfs aren't proper b+trees |
flatcap | a dummy key |
_Oracle_ | that's exactly what I was hoping to hear! |
flatcap | (thinking is still a bit hard this morning, bear with me :-) |
_Oracle_ | no problem ;-) |
flatcap | the trees consist of a node, which contains keys |
flatcap | the keys in a real (ideal world) b+tree are just separators, and the data is only stored in the leaves |
_Oracle_ | right |
_Oracle_ | btw - how big is a node under ntfs? i mean, how many keys fit in there? |
flatcap | the INDX record is 4k, an you can get 10's of filenames in it |
flatcap | but..., that depends on the lengths of the filenames |
_Oracle_ | i thought the number of keys in a node was a fixed property of a b+ tree? |
flatcap | hehe, usually, yes |
flatcap | the keys of ntfs actually contain data and also a pointer to their children |
_Oracle_ | so i noticed |
AntonA | one should add that INDX records of 2k size have also been seen in the wild -- by me (-: |
_Oracle_ | really? |
_Oracle_ | what OS? |
AntonA | NT4 |
flatcap | because there's one more child than key, there has to be a dummy key (no data, but has children) |
_Oracle_ | interesting... |
AntonA | some of my directories (e.g. c:\winnt and c:\program files) have 2k INDX size while other dirs have 4k. |
_Oracle_ | so the dummy key is always the "largest"? |
flatcap | yes |
_Oracle_ | i see... |
_Oracle_ | so if the non-leaf nodes have data of themselves, wouldn't that make the tree a b-tree? |
flatcap | I've just written a test program to help me understand the trees, which is on bitkeeper |
_Oracle_ | I'd love to see that |
flatcap | I read a lots of webpages and I think that the nearest term is a b*tree |
_Oracle_ | and how is it different from a b-tree? |
flatcap | a b-tree maintains a minimum of 1/2 full nodes (except for the root node) |
flatcap | a b*tree changes the rules slightly and maintains 2/3 full |
_Oracle_ | so it just changes the rules of combining two nodes to one and such? |
flatcap | exactly |
_Oracle_ | hmmm... |
_Oracle_ | let me think about that for a moment :) |
flatcap | in a true b+tree, the data keys (leaves) should also have pointers to the next (for quick sequential accesses), but that's probably just maintained in memory |
flatcap | I'm going to write up everything I know about ntfs trees soon |
_Oracle_ | let me see if i got that... |
_Oracle_ | the index root points to the root INDX record |
flatcap | you can see my test prog at: http://linux-ntfs.bkbits.net:8080/tng-support/src/tree |
_Oracle_ | each INDX record contains keys that have pointers to the files themselves and to the keys with lower values |
flatcap | yes |
_Oracle_ | I see |
flatcap | the index root lives in the MFT record |
_Oracle_ | Yeah, this I managed to discover :) |
flatcap | all the rest (index allocations) are non-res |
_Oracle_ | and the number of keys in a single INDX record is completely flexible? |
AntonA | yes |
flatcap | yes, but there's a minimum |
AntonA | a minimum? |
flatcap | yes, that's part of the tree algorithm |
AntonA | surely the minimum is a non-data containing terminator entry? |
_Oracle_ | what's the minimum? |
flatcap | the minimum for a b+tree is 1/2 full, b* 2/3 full |
flatcap | only the root node may contain fewer |
_Oracle_ | oh. |
_Oracle_ | yeah |
AntonA | and the last node... |
flatcap | the keys are moved about to keep this true |
flatcap | even the last node will have the "right number" in it |
AntonA | that would mean that in a really large directory deleting one file could take hours? |
flatcap | no, you might think that, but the balancing doesn't affect many other nodes |
flatcap | if the tree is 4 deep (NTFS equiv say 10^5 files), you'd only be altering 4 index records |
flatcap | I'll draw lots of pictures when I have a moment (probably tomorrow) |
_Oracle_ | that should be interesting to read! |
flatcap | are you on our dev mailing list, _Oracle_ |
_Oracle_ | What mailing list? (er... no.) |
AntonA | the major question that springs to my mind is what would windows ntfs do if it saw an imbalanced tree (because we messed up or because we simply chose to ignore balancing) |
flatcap | hehe, I hate to think :-) |
_Oracle_ | I wouldn't want to be there, that's for sure |
flatcap | chkdsk would probably try and rebalance it and you might find that ntfs.sys would just balance it out as files were created/deleted |
_Oracle_ | how do i join the list? |
flatcap | http://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev |
AntonA | um, it would be a lot easier to get directory operations working while ignoring the existence of the tree (obviously operating on them correctly so we don't kill the fs) |
flatcap | I'll mail the list and answer questions there |
AntonA | if windows is able to pickup the pieces without complaint / failure, it would be worth considering as a first pass of implementation at least. |
flatcap | yes possibly, but I think I know enough now to build something close enough |
flatcap | (I just wanted a big project where I could start without tripping over you :-) |
AntonA | cool |
_Oracle_ | I've got a few more questions if you have the time |
AntonA | As I said before. I am not going anywhere near directories. (-: |
flatcap | sure |
_Oracle_ | Smaller ones, though |