libbpg-0.9.6

2015-10-27 11:46:00 +01:00 · 2015-10-27 11:46:00 +01:00 · 35a8402710
commit 35a8402710
parent 3035b41edf
248 changed files with 232891 additions and 100 deletions
--- a/x265/doc/reST/threading.rst
+++ b/x265/doc/reST/threading.rst
@ -0,0 +1,266 @@
+*********
+Threading
+*********
+
+.. _pools:
+
+Thread Pools
+============
+
+x265 creates one or more thread pools per encoder, one pool per NUMA
+node (typically a CPU socket). :option:`--pools` specifies the number of
+pools and the number of threads per pool the encoder will allocate. By
+default x265 allocates one thread per (hyperthreaded) CPU core on each
+NUMA node.
+
+If you are running multiple encoders on a system with multiple NUMA
+nodes, it is recommended to isolate each of them to a single node in
+order to avoid the NUMA overhead of remote memory access.
+
+Work distribution is job based. Idle worker threads scan the job
+providers assigned to their thread pool for jobs to perform. When no
+jobs are available, the idle worker threads block and consume no CPU
+cycles.
+
+Objects which desire to distribute work to worker threads are known as
+job providers (and they derive from the JobProvider class).  The thread
+pool has a method to **poke** awake a blocked idle thread, and job
+providers are recommended to call this method when they make new jobs
+available.
+
+Worker jobs are not allowed to block except when absolutely necessary
+for data locking. If a job becomes blocked, the work function is
+expected to drop that job so the worker thread may go back to the pool
+and find more work.
+
+On Windows, the native APIs offer sufficient functionality to discover
+the NUMA topology and enforce the thread affinity that libx265 needs (so
+long as you have not chosen to target XP or Vista), but on POSIX systems
+it relies on libnuma for this functionality. If your target POSIX system
+is single socket, then building without libnuma is a perfectly
+reasonable option, as it will have no effect on the runtime behavior. On
+a multiple-socket system, a POSIX build of libx265 without libnuma will
+be less work efficient, but will still function correctly. You lose the
+work isolation effect that keeps each frame encoder from only using the
+threads of a single socket and so you incur a heavier context switching
+cost.
+
+Wavefront Parallel Processing
+=============================
+
+New with HEVC, Wavefront Parallel Processing allows each row of CTUs to
+be encoded in parallel, so long as each row stays at least two CTUs
+behind the row above it, to ensure the intra references and other data
+of the blocks above and above-right are available. WPP has almost no
+effect on the analysis and compression of each CTU and so it has a very
+small impact on compression efficiency relative to slices or tiles. The
+compression loss from WPP has been found to be less than 1% in most of
+our tests.
+
+WPP has three effects which can impact efficiency. The first is the row
+starts must be signaled in the slice header, the second is each row must
+be padded to an even byte in length, and the third is the state of the
+entropy coder is transferred from the second CTU of each row to the
+first CTU of the row below it.  In some conditions this transfer of
+state actually improves compression since the above-right state may have
+better locality than the end of the previous row.
+
+Parabola Research have published an excellent HEVC
+`animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_
+which visualizes WPP very well.  It even correctly visualizes some of
+WPPs key drawbacks, such as:
+
+1. the low thread utilization at the start and end of each frame
+2. a difficult block may stall the wave-front and it takes a while for
+   the wave-front to recover.
+3. 64x64 CTUs are big! there are much fewer rows than with H.264 and
+   similar codecs
+
+Because of these stall issues you rarely get the full parallelisation
+benefit one would expect from row threading. 30% to 50% of the
+theoretical perfect threading is typical.
+
+In x265 WPP is enabled by default since it not only improves performance
+at encode but it also makes it possible for the decoder to be threaded.
+
+If WPP is disabled by :option:`--no-wpp` the frame will be encoded in
+scan order and the entropy overheads will be avoided.  If frame
+threading is not disabled, the encoder will change the default frame
+thread count to be higher than if WPP was enabled.  The exact formulas
+are described in the next section.
+
+Bonded Task Groups
+==================
+
+If a worker thread job has work which can be performed in parallel by
+many threads, it may allocate a bonded task group and enlist the help of
+other idle worker threads from the same thread pool. Those threads will
+cooperate to complete the work of the bonded task group and then return
+to their idle states. The larger and more uniform those tasks are, the
+better the bonded task group will perform.
+
+Parallel Mode Analysis
+~~~~~~~~~~~~~~~~~~~~~~
+
+When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to
+8x8) will distribute its analysis work to the thread pool via a bonded
+task group. Each analysis job will measure the cost of one prediction
+for the CU: merge, skip, intra, inter (2Nx2N, Nx2N, 2NxN, and AMP).
+
+At slower presets, the amount of increased parallelism from pmode is
+often enough to be able to reduce or disable frame parallelism while
+achieving the same overall CPU utilization. Reducing frame threads is
+often beneficial to ABR and VBV rate control.
+
+Parallel Motion Estimation
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When :option:`--pme` is enabled all of the analysis functions which
+perform motion searches to reference frames will distribute those motion
+searches to other worker threads via a bonded task group (if more than
+two motion searches are required).
+
+Frame Threading
+===============
+
+Frame threading is the act of encoding multiple frames at the same time.
+It is a challenge because each frame will generally use one or more of
+the previously encoded frames as motion references and those frames may
+still be in the process of being encoded themselves.
+
+Previous encoders such as x264 worked around this problem by limiting
+the motion search region within these reference frames to just one
+macroblock row below the coincident row being encoded. Thus a frame
+could be encoded at the same time as its reference frames so long as it
+stayed one row behind the encode progress of its references (glossing
+over a few details). 
+
+x265 has the same frame threading mechanism, but we generally have much
+less frame parallelism to exploit than x264 because of the size of our
+CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock
+rows available each frame while x265 only has 17 64x64 CTU rows.
+
+The second extenuating circumstance is the loop filters. The pixels used
+for motion reference must be processed by the loop filters and the loop
+filters cannot run until a full row has been encoded, and it must run a
+full row behind the encode process so that the pixels below the row
+being filtered are available. On top of this, HEVC has two loop filters:
+deblocking and SAO, which must be run in series with a row lag between
+them. When you add up all the row lags each frame ends up being 3 CTU
+rows behind its reference frames (the equivalent of 12 macroblock rows
+for x264). And keep in mind the wave-front progression pattern; by the
+time the reference frame finishes the third row of CTUs, nearly half of
+the CTUs in the frame may be compressed (depending on the display aspect
+ratio).
+
+The third extenuating circumstance is that when a frame being encoded
+becomes blocked by a reference frame row being available, that frame's
+wave-front becomes completely stalled and when the row becomes available
+again it can take quite some time for the wave to be restarted, if it
+ever does. This makes WPP less effective when frame parallelism is in
+use.
+
+:option:`--merange` can have a negative impact on frame parallelism. If
+the range is too large, more rows of CTU lag must be added to ensure
+those pixels are available in the reference frames.
+
+.. note::
+
+	Even though the merange is used to determine the amount of reference
+	pixels that must be available in the reference frames, the actual
+	motion search is not necessarily centered around the coincident
+	block. The motion search is actually centered around the motion
+	predictor, but the available pixel area (mvmin, mvmax) is determined
+	by merange and the interpolation filter half-heights.
+
+When frame threading is disabled, the entirety of all reference frames
+are always fully available (by definition) and thus the available pixel
+area is not restricted at all, and this can sometimes improve
+compression efficiency. Because of this, the output of encodes with
+frame parallelism disabled will not match the output of encodes with
+frame parallelism enabled; but when enabled the number of frame threads
+should have no effect on the output bitstream except when using ABR or
+VBV rate control or noise reduction.
+
+When :option:`--nr` is enabled, the outputs of each number of frame threads
+will be deterministic but none of them will match becaue each frame
+encoder maintains a cumulative noise reduction state.
+
+VBV introduces non-determinism in the encoder, at this point in time,
+regardless of the amount of frame parallelism.
+
+By default frame parallelism and WPP are enabled together. The number of
+frame threads used is auto-detected from the (hyperthreaded) CPU core
+count, but may be manually specified via :option:`--frame-threads`
+
+	+-------+--------+
+	| Cores | Frames |
+	+=======+========+
+	|  > 32 |  6..8  |
+	+-------+--------+
+	| >= 16 |   5    |
+	+-------+--------+
+	| >= 8  |   3    |
+	+-------+--------+
+	| >= 4  |   2    |
+	+-------+--------+
+
+If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)**
+
+Over-allocating frame threads can be very counter-productive. They
+each allocate a large amount of memory and because of the limited number
+of CTU rows and the reference lag, you generally get limited benefit
+from adding frame encoders beyond the auto-detected count, and often
+the extra frame encoders reduce performance.
+
+Given these considerations, you can understand why the faster presets
+lower the max CTU size to 32x32 (making twice as many CTU rows available
+for WPP and for finer grained frame parallelism) and reduce
+:option:`--merange`
+
+Each frame encoder runs in its own thread (allocated separately from the
+worker pool). This frame thread has some pre-processing responsibilities
+and some post-processing responsibilities for each frame, but it spends
+the bulk of its time managing the wave-front processing by making CTU
+rows available to the worker threads when their dependencies are
+resolved.  The frame encoder threads spend nearly all of their time
+blocked in one of 4 possible locations:
+
+1. blocked, waiting for a frame to process
+2. blocked on a reference frame, waiting for a CTU row of reconstructed
+   and loop-filtered reference pixels to become available
+3. blocked waiting for wave-front completion
+4. blocked waiting for the main thread to consume an encoded frame
+
+Lookahead
+=========
+
+The lookahead module of x265 (the lowres pre-encode which determines
+scene cuts and slice types) uses the thread pool to distribute the
+lowres cost analysis to worker threads. It will use bonded task groups
+to perform batches of frame cost estimates, and it may optionally use
+bonded task groups to measure single frame cost estimates using slices.
+(see :option:`--lookahead-slices`)
+
+The main slicetypeDecide() function itself is also performed by a worker
+thread if your encoder has a thread pool, else it runs within the
+context of the thread which calls the x265_encoder_encode().
+
+SAO
+===
+
+The Sample Adaptive Offset loopfilter has a large effect on encode
+performance because of the peculiar way it must be analyzed and coded.
+
+SAO flags and data are encoded at the CTU level before the CTU itself is
+coded, but SAO analysis (deciding whether to enable SAO and with what
+parameters) cannot be performed until that CTU is completely analyzed
+(reconstructed pixels are available) as well as the CTUs to the right
+and below.  So in effect the encoder must perform SAO analysis in a
+wavefront at least a full row behind the CTU compression wavefront.
+
+This extra latency forces the encoder to save the encode data of every
+CTU until the entire frame has been analyzed, at which point a function
+can code the final slice bitstream with the decided SAO flags and data
+interleaved between each CTU.  This second pass over the CTUs can be
+expensive, particularly at large resolutions and high bitrates.