Aquaria/ExternalLibs/iprof/iprof.txt

IPROF: A Portable Industrial-Strength Interactive Profiler for C++ and C
by Sean Barrett

Version 0.2


CONTENTS
  Overview
  User Manual
    Platform
    Instrumentation
      Private Zones
      Public Zones
    Initialization
    Processing Data
    Displaying Results
    Controlling Display
    Understanding CALL GRAPH output
    Performance Expectation
  Implementation Notes
  Version History


OVERVIEW

IProf is an interactive profiler which works by intrusively instrumenting
code. Code is divided into zones by programmer-inserted statements. Zones
are both lexically and dynamically scoped--all time spent within a
lexically scoped zone, and any code which it calls which is not itself
zoned, is attributed to that zone.

Profiling occurs interactively; time is divided into "frames", and the
profiler shows time spent on the previous frame (or a smoothed average
or possibly even a frame a second or two ago).

Like a traditional profiler, IProf records or can compute the number of
times a zone is entered, the amount of time spent in the zone ("self
time"), and the amount of time spent in the zone and its descendents
("hierarchical time" -- "self + child time" in gprof).

Furthermore, IProf computes information along the lines of gprof--number of
times a given zone is entered from any other specific zone; self and
hierarchical time spent in a given zone on behalf of a specific parent
zone, etc. (However, where gprof estimates this information based only on
call counts, IProf measures the actual values. So, for example, IProf will
accurately report if a ray-casting routine called by both physics and AI
always spends longer per AI-call because the casts are longer.)

Precise information is available for recursive routines, including call
depths etc. [The current version of IProf does not yet completely handle
reporting of recursive data, although it is measured correctly.]

Additionally, IProf provides all numbers in instantaneous form or as two
differently weighted moving averages. It's easy to pause the profile
updating so that you can switch between multiple views of the paused data
set. Two optional flags allow trading off memory for deeper historical
views. The cheaper option provides only zone-self-time history, suitable
for a real-time graph of behavior. The more memory-expensive flag keeps a
history of all the data for a certain number of frames, allowing full
profile analysis of old frames.

IProf is designed for its monitoring/gathering mode to be "always on", even
in release/optimized builds. The monitoring routines are designed to be
reasonably efficient--the full hash on every function entry required by
gprof is avoided in most cases--and the programmer can minimize the impact
by limiting the instrumentation to relatively large routines. (One could
certainly instrument a vector add function and possibly get useful call
count data from it, but the monitoring overhead would be significant and
noticeable in that case.) In combination with history information, it
becomes possible to run an application, notice poor behavior, pause the
(always on) profiling and the application, and start browsing through the
historical profiling information.

IProf uses both per-call monitoring and a separate per-frame
gathering/analysis phase. The latter is itself instrumented so the overhead
due to it is easy to see.


USER MANUAL

These sections document the necessary code you must use and code changes
you must make to use the profiler.

The profiling system expects to be able to use any identifier which is
prefixed with "Prof_" with exactly that pattern of uppercase/lowercase
(i.e. "PROF_" and "prof_" can be used freely by other code).


COMPILING THE PROFILING SYSTEM

The profiler was developed using MSVC 6.0, but should be reasonably
portable. The implementation files are provided as .C files so they can be
used with C compilers; however, they can be renamed to C++ files and
compiled in that form. The implementations automatically insert extern "C"
on the public routines. Internal routines will use either C or C++ linkage
depending on which way you compile them; you must compile all the profiler
files as either C or C++, without intermixing.

[[ NOTE: Originally the code was written in C++, and then it was
   converted to compile with C, and then some additional small changes
   were made. As of this writing, I haven't actually tested compiling
   everything as C++ again. Feel free to test for me. Or just compile
   the C files as C--you can still USE the C++ interfaces fine.]]

Needed files:
   prof_win32.c    -- Win32 implementation of seconds-based timer
   prof_gather.c   -- raw data collection
   prof_process.c  -- high-level data collection, report generator
   prof_draw.c     -- opengl rendering interface

   prof.h          -- public front-end
   prof_win32.h    -- Win32 implementation of fast integer timestamp
   prof_gather.h   -- instrumentation macros (included by prof.h)
   prof_internal.h -- private interfaces


PLATFORM SUPPORT

IProf requires a small amount--less than fifty lines--of platform-specific
code.

Win32 under MSVC is automatically supported with no further effort on your
part, using the files prof_win32.c and prof_win32.h

To use other platforms, just create equivalent files for your platform. The
C file contains a routine for getting an accurate floating point time
reading; the H file contains the definition of a 64-bit integer type and a
fast routine for reading a timestamp of that size. If 64-bit math isn't
available on your platform, or if your timestamp is only 32-bit, you can
replace the 64-bit type with a 32-bit type, as long as that item won't
overflow in the course of running the application. (A 31-bit millisecond
timer is good for 24 days, but is very imprecise for this application.) If
reading the timestamp is slow, you will want to minimize how often the zone
entry and exit points are called.

Also required is a display interface; an opengl one is provided, although
others would be easy to code. (The primary display is purely textual, and
is available through a text interface.)


INSTRUMENTATION

First, #include "prof.h" in files that need profiling.

The flag Prof_ENABLED determines whether the monitoring code is compiled or
not, to make it easy to turn off all profiling code for final shippable
builds. Additional flags controlling amount of history data and memory
usage therein are defined at the top of the file prof_process.c and should
just be changed there since they affect no other files.

There are two main ways of instrumenting, and each offers a C++ interface
and a C interface.

  Private zones
     C++    Prof(zone);

     C      Prof_Begin(zone)
            Prof_End

  Public zones
            Prof_Define(zone);
            Prof_Declare(zone);

     C++    Prof_Scope(zone);

     C      Prof_Region(zone)
            Prof_End

Zone names--indicated by "zone" above--must obey the rules for identifiers,
although they can begin with a number, and they exist in a separate
namespace from regular identifiers.

So these are valid zone names:
    my_zone_2
    2_my_zone
    __

And these are NOT valid zone names:
    "my_zone"
    my_class::my_zone


PRIVATE ZONES

The simplest, and highly recommended, approach to instrumentation is to
create a private zone which only exists in a single location. In C++, you
do this by declaring a lexically scoped zone with a statement which behaves
semantically like a variable declaration:

   // C++ instrumentation
   void my_routine()
   {
      Prof(my_routine_name);
      ... my code ...
   }

This will cause all time spent after Prof(my_routine_name) to accumulate in
a zone in the profiling reports labeled "my_routine_name". The zone ends
when the name goes out of scope, that is, when a destructor would be called
corresponding to this declaration.

Zones don't have to appear at routine-level function scope; for example:

   // C++ instrumentation
   void my_routine()
   {
      Prof(my_routine);
      ...                // zone my_routine
      if (...)
      {
         Prof(my_routine_special_case);
         ...             // zone my_routine_special_case
      }
      ...                // zone my_routine
   }

Instrumenting in C requires more work, because C doesn't provide
destructors, so it's not possible to lexically scope zones automatically.
Instead, the programmer must insert Begin/End pairs and make sure those
pairs are accurately balanced. All paths out of a function must be
accounted for. A crash or severe slowdown is likely to occur with
unbalanced pairs.

   // C instrumentation
   void my_routine(void)
   {
      Prof_Begin(my_routine)
      int x = some_func();
      if (x == 0) {
         Prof_End
         return;
      }
      ...
      Prof_End
   }

Prof_Begin() is declaration-like; however, it takes no trailing semicolon.
(This is necessary so it can be compiled out; C doesn't allow the empty
statement ";" to precede variable declarations.) Prof_End takes no
trailing semicolon or parentheses to help remind you of this. (You can
change the definition of Prof_End in prof_gather.h if you don't like that.)

Profiling instructions like Prof() and Prof_Begin() can be placed anywhere
that variable declarations are legal; generally you want to define them
before other variables so the variable initializations are profiled.

The C interfaces are also available in C++ if you should want to use a not-
exactly-lexically-scoped zone, e.g. end a zone before the destructor would
go out of scope. (You can't, however, end Prof() with Prof_End.)


PUBLIC ZONES

If you define multiple private zones with the same name, they will be
treated as entirely unrelated zones that happen to have the same name, and
you will see the same name multiple times in the profiling output.

Instead, you probably want to use public zones, to use the same zone in
multiple regions of code. For example, we might have two routines that
serve the same purpose which we always want to measure as one. Or we might
have two blocks of code within a single routine which we want to credit to
the same zone.

To do this, first define the zone with Prof_Define(zone), and then use it
with Prof_Scope(zone) [C++] or Prof_Region(zone) ... Prof_End [C].

   // C++ instrumentation
   Prof_Define(my_routine);

   void my_routine_v1()
   {
      Prof_Scope(my_routine);
      ...
   }

   void my_routine_v2()
   {
      Prof_Scope(my_routine);
      ...
   }

or

   // C instrumentation
   Prof_Define(my_routine);

   void my_routine_v1(void)
   {
      Prof_Region(my_routine)
      ...
      Prof_End
   }

   void my_routine_v2(void)
   {
      Prof_Region(my_routine)
      ...
      Prof_End
   }

Because Prof_Define defines an actual global symbol (if used at file
scope), the symbol can even be referenced from other files by saying:

   extern Prof_Declare(my_routine);

   void my_routine()
   {
      Prof_Scope(my_routine);
   }

You can use 'extern "C" Prof_Declare()' or Prof_Define() to share a zone
between C and C++ code.


USER MANUAL - INIIALIZATION

The profiling system is self-initializing.


USER MANUAL - PROCESSING DATA

Every frame, you should call Prof_update(). Prof_update() will gather
results and record frame-history information on the assumption that each
call is a frame. Prof_update() takes a boolean flag which indicates whether
to update the history or not; passing in false means profiling is "paused"
and doesn't change.

You might wire this to its own toggle, or you might simply pass in a pre-
existing flag for whether the simulation itself is active or not, thus
allowing you to pause the simulation and automatically pause the profiling.
(On the other hand, if you're profiling a renderer, you might want to
pause the simulation and keep profiling.)


USER MANUAL - DISPLAYING RESULTS

IProf offers two separate types of display: the report, which is primarily
textual, and the graph, which is entirely graphical.

If you're using OpenGL, output is straightforward. For the text report,
call Prof_draw_gl() with the display set to a 2d rendering mode--one that
can use integer addressing, e.g. integers the size of pixels, virtual
pixels (e.g. a 640x480 screen regardless of actual dimension), or even
characters. Set the blending state to whatever blending mode you want for
the report display. For the  graphics report, call Prof_draw_graph_gl().
Details of the parameters to these functions are available in the header
file.

For other output devices (Direct3D, text), you'll have to write your own
functions equivalent to Prof_draw_gl() and Prof_draw_graph_gl(). These
should not be too difficult; these functions don't compute any of the
profiling information; they simply format a text report or dataset to the
screen. The text report format consists of several title fields to be
printed, and then a collection of data records. Each data record has a name
and an indentation amount for that name (used for call graph
parent/children formatting), a collection of unnamed data "values", and a
flag field indicating which of the data values should be displayed.
Additionally, data records have a "heat" which indicates how rapidly
changing they are, and one record may be "highlighted" indicating a virtual
cursor is on that line.

[[ In practice, Prof_draw_gl makes few enough GL calls that maybe it's
worth modularizing things out further. ]]


USER MANUAL - CONTROLLING DISPLAY

IProf features some easy-to-use UI elements that allow program-direct
control or user-interaction-based control over what data is reported.
Simply hook these calls up to hotkey presses to complete your working
profile system. (You could even write code to support mouse clicking on the
report by calling Prof_set_cursor and on the graph by calling
Prof_set_frame, but the hit detection is up to you.)

These are in rough order of the priority with which you might want to
implement them.

Most important

   Prof_set_report_mode(enum ...)
      Selects what to show in the report:
         Prof_SELF_TIME: flat times sorted by self time
         Prof_HIERARCHICAL_TIME: flat times sorted by hierarchical time
         Prof_CALL_GRAPH: call graph parent/children information

   Prof_move_cursor(int delta)
      Move the cursor up-or-down by delta lines

   Prof_select(void)
      Switch to call graph mode on whichever zone is currently selected

   Prof_select_parent(void)
      Go to largest-hierarchical-time parent of the active zone in
      the call graph. (Roughly like "go up a directory".)

Important if you support history

   Prof_move_frame(int delta)
      Move backwards or forwards in history by delta frames

Not too important

   Prof_set_average(int type)
      Selects which moving average to use (0 == instantaneous, 1=default);
      only meaningful if frame# = 0; when looking at history, instantaneous
      values are always used.

   Prof_set_frame(int frame)
      Selects which history entry to view (0==current, 1==previous, etc.)

   Prof_set_cursor(int pos)
      Set the position of the up-and-down cursor.

   Prof_set_recursion(enum ...)
      Selects whether to show recursive routines as a single zone or
      as a series of distinct zones for each recursion level.
      [[ currently unimplemented ]]


UNDERSTANDING CALL GRAPH OUTPUT

The call graph output focuses on a single zone, and provides information
about the parents (callers) and children (callees) of that zone.

The general format is something like this:

     zone                self   hier   count
       +my_parent1       0.75   2.50     4.0
       +my_parent2       1.00   3.25     6.0
    -my_routine          1.75   5.75    10.0
       +my_child1        1.00   2.00    15.0
       +my_child2        0.25   1.50   500.0
        my_child3        0.50   0.50     3.0

"self" indicates self-time (time in this zone), "hier" is hierarchical-time
(time in this zone or its descendents), and "count" is the number of times
the zone was entered. (Entry counts are inherently integral, but are shown
as floating point since they may be a moving average of several integers.)

Currently the zone "my_routine" is being examined. It accounts for 5.75
milliseconds of time between itself and the zones it calls. 1.75ms are
spent in itself. The zone was entered (called) 10 times this frame.

The difference between my_routine's self time and hierarchical time is
4.00ms; that much time must be being spent in its descendents. Its
immediate children--the zones that my_routine calls directly--appear below
it on the table. The hierarchical times of each child represents the time
spent in that child and all its descendents *on behalf of my_routine*--
other calls to that child are not counted. Thus, the sum of all the
children's hierarchical time should account for all time spent in
descendents of my_routine; hence, the sum of the child hier times is 4.00,
identical to the difference between self and hier for my_routine.

Above "my_routine" in the chart is information about the callers of
my_routine. However, the timings and counts in this section are not the
self and hierarchical times of the parent functions themselves--there is no
sensible meaning of "on behalf of my_routine" for the parents. Instead, the
self, hier, and count fields show the time spent *in my_routine* on behalf
of those parents. Thus, for each field, all of the parent entries sum to
the corresponding entry in my_routine. Again, these are computed exactly.
If my_routine was the public interface to a raycaster called by both AI and
physics, but it passed the raycast on to further routines which were
themselves explicitly zoned, then most of the my_routine time would be
spent in descendents. This would show up in the "hierarchical time" field,
and the parent zones, AI and physics, would show that hierarchical time
attributed accurately.

There is additional data available in the system--it would be possible to
drill down into lower-level functions and still attribute them to zones
several parent levels above; there just isn't currently any user interface
or computation functionality to do it.


PERFORMANCE EXPECTATION

Except for recursive routines (see Implementation Notes section), the
expected performance on zone entry comes from running roughly the following
code:

   extern Something *p0,*p1;
   if (p0->ptr_field != p1) { ... /* rarely runs */ }
   p0->int64_field0  = RDTSC; // read timestamp counter
   p0->int32_field  += 1;
   p1->int64_field1 += p0->int64_field0 - p1->int64_field0;
   p1 = p0;

Zone exit costs a bit less.


IMPLEMENTATION NOTES

IProf uses two relatively unknown techniques to produce accurate call
information with minimal overhead. The first technique produces accurate
call information at a similar cost to gprof's mcount monitoring; the second
reduces the overhead.

_Zone Stack Tracking_

gprof's mcount technique combines two separate measurements. At every
function entry, the function and the caller (grabbed from the return
address on the stack) are hashed to determine a unique "data-gathering
slot", and an integer in that slot is incremented. Thus, exact pairwise
call counts are computed. Simultaneously, gprof periodically samples the
instruction pointer to measure the time spent in any given routine--"self
times". Hierarchical times are computed by distributing the self times up
the tree based on the call graph counts. (If routine X is called 9 times
from routine Z, and one time from routine Y, then 90% of X's time is
attributed to Z, and 10% to Y.)

An intrusive profiler which samples a timer at zone entry and again at zone
exit will compute accurate hierarchical times. By keeping a stack of zones,
it's possible to compute accurate hierarchical and self times. The stack of
zones also provides caller information, so hierarchical and self times can
be attributed to each unique pair of caller & callee zones (via hashing).
This will allow much more accurate attribution. In fact, it is sufficient
to compute exact values for all the information gprof outputs, except in
the face of recursion. Performance is fairly good; unlike a single-zone
intrusive profiler, which must measure both self and hierarchical time,
since neither can be derived from the other, the zone-pair profiler can
only measure hierarchical time; self-time can be derived from hierarchical
time (but not vice versa).

A further improvement is, instead of having one data-gathering slot per
zone--that is, representing the state of the top of the zone stack--and
instead of having one data-gathering slot per caller/callee zone pair--that
is, representing the state of the top two entries of the zone stack--to
have one data-gathering slot per unique full stack state. This can be done
straightforwardly by building the stack as a linked list (creating an
inverted tree--a tree of all stack states with only parent-pointer links),
and hashing the "zone to be pushed" and the current stack to find the new
stack. Thus the cost of the hash computation is essentially identical to
the previous case. If every zone is only called from one specific place,
there will still only be one data-gathering slot per zone; if a routine is
recursive, it will create a large number of data-gathering slots, one for
each depth of recursion. (A complex mutually recursive program like a
compiler might generate an unreasonable number of unique states.)

With zone-stack tracking, it's possible to measure only either hierarchical
time or self-time and derive the other. Hierarchical time is actually more
efficient to measure, but it leaves handling the top-level overarching
global state as a special case (since it will have a timer that starts but
never ends). It's easier to instead measure self-time and rederive
hierarchical time. Moreover, a recursive routine will automatically
"overcount" hierchical time (it's accrued at each level of the hierarchy),
requiring significant fixup. It's more straightforward to just compute the
recursive data correctly from the self times in the first place.


_Hash Cacheing_

Although the hash lookup described above is coded to proceed as quickly as
possible if the hash hits on the first probe, it still requires enough
computation and a function call that it is worth avoiding if possible. To
that end, each zone-entry location declares a hidden static variable
private to that zone-entry point which caches the hash lookup. At zone-
entry, the code checks the cache's "next node in the linked list" field
with the current stack state. If the two are equal, then the cache is
valid, and no hash lookup occurs. If it does not much, then the cache is
wrong, and the hash lookup proceeds, and updates the cache. The cache is
initialized to a impossible value, so the first time the code is run a hash
lookup always occurs.

The result is that in the normal case, a routine called from a single
place, the cache is always valid (after the very first call). Furthermore,
the branch will always predict correctly, since it always branches
identically. However, for a routine that is called from several places,
there is a "switching" overhead each time it's called from a different
place. So, for example, a raycaster called by both physics and AI might pay
the overhead twice per frame, if all the AI calls occur before all the
physics calls. However, a common low-level routine (e.g. a vector add)
called alternately from two different zones would have to perform the hash
lookup every time.

The actual common "failure" case is a recursive routine, for which, each
time the routine is entered, the state of the call stack is different from
the last time, thus almost always paying the hash lookup case. For
something like a recursive linked list traversal, the hash occurs every
time. (It doesn't matter if the routine is tail-recursive; once you insert
the profiling instrumentation, it's no longer tail-recursive.) A full
binary tree traversal will always enter a different zone-stack-state from
last time, except after reaching a left-child leaf. (The recursion then
returns and then goes down to the right child, which is at the same height
as the left child.) So a full binary tree traversal will have to hash about
3/4 of the time. A full quadtree traversal will have to hash about 2/5 of
the time. If the traversal is doing anything complicated, this should not
be a problem; but if it's a simple traversal, the performance overhead may
be significant. Like the vector add case, it may be better to remove
instrumentation from low-inherent-cost recursive routines except when
absolutely needed. Of course, it's easy enough to compare performance
behavior before and after adding the instrumenting and see if the overhead
is acceptable.


VERSION HISTORY

version 0.2 -- 2003-02-06 STB
   - Significant interface changes to Prof_draw_gl:
      - accepts floating point instead of int for 2d screen metrics
      - accepts a total width and height of the display and conforms
        to that
      - accepts a precision specification for display of time values
   - added little '+' and '-' signs reminiscent of list displays
     so you know which ones can be drilled down on
   - expanded this doc's description of what's legal for a zone-name
   - fixed an error trying to compile the C files as C++
   - added Prof_select_parent() for moving up the tree

version 0.1 -- 2003-02-05 STB
   - First public version, heavily refactored, 1500 lines
   - win32 timing interface and smooth "moving average" code derived
        from Jonathan Blow's Game Developer Magazine articles
   - missing functionality:
      - correct attribution of time to zones that are parents of
        recursive zones in call graph view (hierarchical times don't
        bubble up correctly)
      - spread recursion display (displaying each depth of a recursive
        zone as if it were a separate zone)