This file documents how GtkTextView works, at least partially. You probably want to read the text widget overview in the reference manual to get an application programmer overview of the public API before reading this. The overview in the reference manual documents GtkTextBuffer, GtkTextView, GtkTextMark, etc. from a public API standpoint. The BTree === The heart of the text widget is a data structure called GtkTextBTree, which implements all the hard work of the public GtkTextBuffer object. The purpose of the btree is to make most operations at least O(log N), so application programmers can just use whatever API is convenient without worrying about O(N) performance pitfalls. The BTree is a tree of paragraphs (newline-terminated lines). The leaves of the tree are paragraphs, represented by a GtkTextLine. The nodes of the tree above the leaves are represented by GtkTextBTreeNode. The nodes are used to store aggregate data counts, so we can for example skip 100 paragraphs or 100 characters, without having to traverse 100 nodes in a list. You might guess from this that many operations are O(N) where N is the number of bytes in a paragraph, and you would be right. The text widget is efficient for huge numbers of paragraphs, but will choke on extremely long blocks of text without intervening newlines. ("newline" is a slight lie, we also honor \r, \r\n, and some funky Unicode characters for paragraph breaks. So this means annoyingly that the paragraph break char may be more than one byte.) The idea of the btree is something like: ------ Node (lines = 6) / Line 0 / Line 1 / Line 2 / Line 3 / Line 4 / Line 5 Node (lines = 12) \ \---------- Node (lines = 6) Line 6 Line 7 Line 8 Line 9 Line 10 Line 11 In addition to keeping aggregate line counts at each node, we count characters, and information about the tag toggles appearing below each node. Structure of a GtkTextLine === A GtkTextLine contains a single paragraph of text. It should probably be renamed GtkTextPara someday but ah well. GtkTextLine is used for the leaf nodes of the BTree. A line is a list of GtkTextLineSegment. Line segments contain the actual data found in the text buffer. Here are the types of line segment (see gtktextsegment.h, gtktextchild.h, etc.): Character: contains a block of UTF-8 text. Mark: marks a position in the buffer, such as a cursor. Tag toggle: indicates that a tag is toggled on or toggled off at this point. when you apply a tag to a range of text, we add a toggle on at the start of the range, and a toggle off at the end. (and do any necessary merging with existing toggles, so we always have the minimum number possible) Child widget: stores a child widget that behaves as a single Unicode character from an editing perspective. (well, stores a list of child widgets, one per GtkTextView displaying the buffer) Image: stores a GdkPixbuf that behaves as a single character from an editing perspective. Each line segment has a "class" which identifies its type, and also provides some virtual functions for handling that segment. The functions in the class are: - SplitFunc, divides the segment so another segment can be inserted. - DeleteFunc, finalizes the segment - CleanupFunc, after modifying a line by adding/removing segments, this function is used to try merging segments that can be merged, e.g. two adjacent character segments with no marks or toggles in between. - LineChangeFunc, called when a segment moves to a different line; according to comments in the code this function may not be needed anymore. - SegCheckFunc, does sanity-checking when debugging is enabled. Basically equivalent to assert(segment is not broken). The segment class also contains two data fields: - the name of the segment type, used for debugging - a boolean flag for whether the segment has right or left gravity. A segment with right gravity ends up on the right of a newly-inserted segment that's placed at the same character offset, and a segment with left gravity ends up on the left of a newly-inserted segment. For example the insertion cursor has right gravity, because as you type new text is inserted, and the cursor ends up on the right. The segment itself contains contains a header, plus some variable-length data that depends on the type of the segment. The header contains the length of the segment in characters and in bytes. Some segments have a length of zero. Segments with nonzero length are referred to as "indexable" and would generally be user-visible; indexable segments include text, images, and widgets. Segments with zero length occupy positions between characters, and include marks and tag toggles. The GtkText*Body structs are the type-specific portions of GtkTextSegment. Character segments have the actual character data allocated in the same malloc() block as the GtkTextSegment, to save both malloc() overhead and the overhead of a pointer to the character data. Storing and tracking tags in the BTree === A GtkTextTag is an object representing some text attributes. A tag can affect zero attributes (for example one used only for internal application bookkeeping), a single attribute such as "bold", or any number of attributes (such as large and bold and centered for a "header" tag). The tags that can be applied to a given buffer are stored in the GtkTextTagTable for that buffer. The tag table is just a collection of tags. The real work of applying/removing tags happens in the function _gtk_text_btree_tag(). Essentially we remove all tag toggle segments that affect the tag being applied or removed from the given range; then we add a toggle-on and a toggle-off segment at either end of the range; then for any lines we modified, we call the CleanupFunc routines for the segments, to merge segments that can be merged. This is complicated somewhat because we keep information about the tag toggles in the btree, allowing us to locate tagged regions or add/remove tags in O(log N) instead of O(N) time. Tag information is stored in "struct Summary" (that's a bad name, it could probably use renaming). Each BTreeNode has a list of Summary hanging off of it, one for each tag that's toggled somewhere below the node. The Summary simply contains a count of tag toggle segments found below the node. Views of the BTree (GtkTextLayout) === Each BTree has one or more views that display the tree. Originally there was some idea that a view could be any object, so there are some "gpointer view_id" left in the code. However, at some point we decided that all views had to be a GtkTextLayout and so the btree does assume that from time to time. The BTree maintains some per-line and per-node data that is specific to each view. The per-line data is in GtkTextLineData and the per-node data is in another badly-named struct called NodeData (should be PerViewNodeData or something). The purpose of these is to store: - aggregate height, so we can calculate the Y position of each paragraph in O(log N) time, and can get the full height of the buffer in O(1) time. The height is per-view since each GtkTextView may have a different size allocation. - maximum width (the longest line), so we can calculate the width of the entire buffer in O(1) time in order to properly set up the horizontal scrollbar. - a flag for whether the line is "valid" - valid lines have not been modified since we last computed their width and height. Invalid lines need to have their width and height recomputed. At all times, we have a width and height for each view that can be used. This starts out as 0x0. Lines can be incrementally revalidated, which causes the width and height of the buffer to grow. So if you open a new text widget with a lot of text in it, you can watch the scrollbar adjust as the height is computed in an idle handler. Lines whose height has never been computed are taken to have a height of 0. Iterators (GtkTextIter) === Iterators are fairly complex in order to avoid re-traversing the btree or a line in the btree each time the iterator is used. That is, they save a bunch of pointers - to the current segment, the current line, etc. Two "validity stamps" are kept in the btree that are used to detect and handle possibly-invalid pointers in iterators. The "chars_changed_stamp" is incremented whenever a segment with char_count > 0 (an indexable segment) is added or removed. It is an application bug if the application uses an iterator with a chars_changed_stamp different from the current stamp of the BTree. That is, you can't use an iterator after adding/removing characters. The "segments_changed_stamp" is incremented any time we change any segments, and tells outstanding iterators that any pointers to GtkTextSegment that they may be holding are now invalid. For example, if you are iterating over a character segment, and insert a mark in the middle of the segment, the character segment will be split in half and the original segment will be freed. This increments segments_changed_stamp, causing your iterator to drop its current segment pointer and count from the beginning of the line again to find the new segment. Iterators also cache some random information such as the current line number, just because it's free to do so. GtkTextLayout === If you think of GtkTextBTree as the backend for GtkTextBuffer, GtkTextLayout is the backend for GtkTextView. GtkTextLayout was also used for a canvas item at one point, which is why its methods are not underscore-prefixed and the header gets installed. But GtkTextLayout is really intended to be private. The main task of GtkTextLayout is to validate lines (compute their width and height) by converting the lines to a PangoLayout and using Pango functions. GtkTextLayout is also used for visual iteration, and mapping visual locations to logical buffer positions. Validating a line involves creating the GtkTextLineDisplay for that line. To save memory, GtkTextLineDisplay objects are always created transiently, we don't keep them around. The layout has three signals: - "invalidated" means some line was changed, so GtkTextView needs to install idle handlers to revalidate. - "changed" means some lines were validated, so the aggregate width/height of the BTree is now different. - "allocate_child" means we need to size allocate a child widget gtk_text_layout_get_line_display() is sort of the "heart" of GtkTextLayout. This function validates a line. Line validation involves: - convert any GtkTextTag on the line to PangoAttrList - add the preedit string - keep track of "visible marks" (the cursor) A given set of tags is composited to a GtkTextAttributes. (In the Tk code this was called a "style" and there are still relics of this in the code, such as "invalidate_cached_style()", that should be cleaned up.) There's a single-GtkTextAttributes cache, "layout->one_style_cache", which is used to avoid recomputing the mapping from tags to attributes for every segment. The one_style_cache is stored in the GtkTextLayout instead of just a local variable in gtk_text_layout_get_line_display() so we can use it across multiple lines. Any time we see a segment that may change the current style (such as a tag toggle), the cache has to be dropped. To compute a GtkTextAttributes from the GtkTextTag that apply to a given segment, the function is _gtk_text_attributes_fill_from_tags(). This "mashes" a list of tags into a single set of text attributes. If no tags affect a given attribute, a default set of attributes are used. These defaults sometimes come from widget->style on the GtkTextView, and sometimes come from a property of the GtkTextView such as "pixels_above_lines" GtkTextView === Once you get GtkTextLayout and GtkTextBTree the actual GtkTextView widget is not that complicated. The main complexity is the interaction between scrolling and line validation, which is documented with a long comment in gtktextview.c. The other thing to know about is just that the text view has "border windows" on the sides, used to draw line numbers and such; these scroll along with the main window. Invisible text === Invisible text doesn't work yet. It is a property that can be set by a GtkTextTag; so you determine whether text is invisible using the same mechanism you would use to check whether the text is bold, or orange. The intended behavior of invisible text is that it should vanish completely, as if it did not exist. The use-case we were thinking of was a code editor with function folding, where you can hide all function bodies. That could be implemented by creating a "function_body" GtkTextTag and toggling its "invisible" attribute to hide/show the function bodies. Lines are normally validated in an idle handler, but as an exception, lines that are onscreen are always validated synchronously. Thus invisible text raises the danger that we might have a huge number of invisible lines "onscreen" - this needs to be handled efficiently. At one point we were considering making "invisible" a per-paragraph attribute (meaning the invisibility state of the first character in the paragraph makes the whole paragraph visible or not visible). Several existing tag attributes work this way, such as the margin width. I don't remember why we were going to do this, but it may have been due to some implementation difficulty that will become clear if you try implementing invisible text. ;-) To finish invisible text support, all the cursor navigation etc. functions (the _display_lines() stuff) will need to skip invisible text. Also, various functions with _visible in the name, such as gtk_text_iter_get_visible_text(), have to be audited to be sure they don't get invisible text. And user operations such as cut-and-paste need to copy only visible text.