Welcome to the pChase Benchmark Page!
pChase is a memory performance benchmark written by Doug Pase. It can tell you both the
latency and bandwidth of different access patterns, for various levels
of cache and for main memory. The access patterns may have a constant
stride or completely random. The benchmark gets its name from the fact
that it chases pointers in memory. Chasing pointers ensures that we
actually measure the latency and bandwidth of memory references, as the
next reference cannot be generated until the contents of the pointer
are actually retrieved. Other benchmark approaches (for example, STREAM) can often generate addresses arithmetically, which may measure memory bandwidth but not latency.
The conceptual model for this benchmark is that memory is divided
into hierarchies, including the cache line, DRAM page and memory pool
within a NUMA domain (here called a "chain"). The size of each level in
the hierarchy can be specified when the benchmark is run. The benchmark
progresses by selecting a page to reference. Within a selected page all
cache lines are referenced before the next page is selected. One
iteration walks through all pages within a chain. One experiment walks
through a chain for a specified number of iterations.
Cache lines may be selected in random order or by using a constant
stride. Strided access may be forward (increasing addresses) or reverse
(decreasing addresses). When the access is random, the page selection
is also random. When the access is strided, the next contiguous page is
selected in the direction of the stride.
An experiment may specify the number of threads that access memory
concurrently. This is useful in establishing contention between
different paths to memory within a system. In a NUMA architecture, the
contention between threads should be minimal when each thread accesses
only its own local memory. However, in SMP and multi-core
architectures, two threads may share a path to memory, causing
contention for the shared path.
An experiment may also specify the number of concurrent references
that is allowed per thread. This allows the benchmark to load up the
memory paths with references, showing more accurately what the
sustainable throughput of the system may be. Two references per chain
indicates that two memory fetches will take place concurrently from the
same thread. This is different than two references taking place
concurrently in separate threads, as the memory paths and the effect on
resource usage will be different.
The benchmark options are as follows:
usage: ./pChase options
where options are selected from the following:
[-h|--help] # this message
[-l|--line] number # bytes per cache line (cache line size)
[-p|--page] number # bytes per page (page size)
[-c|--chain] number # bytes per chain (used to compute pages per chain)
[-r|--references] number # chains per thread (memory loading)
[-t|--threads] number # number of threads (concurrency and contention)
[-i|--iterations] number # iterations
[-e|--experiments] number # experiments
[-a|--access] pattern # memory access pattern
[-o|--output] format # output format
[-n|--numa] placement # numa placement
[-s|--seconds] number # number of seconds to run each experiment
[-x|--strict] # fail rather than adjust options to sensible values
pattern is selected from the following:
random # all chains are accessed randomly
forward stride # chains are in forward order with constant stride
reverse stride # chains are in reverse order with constant stride
Note: stride is always a small positive integer.
format is selected from the following:
hdr # csv header only
csv # results in csv format only
both # header and results in csv format
table # human-readable table of values
placement is selected from the following:
local # all chains are allocated locally
xor mask # exclusive OR and mask
add offset # addition and offset
map map # explicit mapping of threads and chains to domains
map has the form
t1:c11,c12,...,c1m;t2:c21,...,c2m;...;tn:cn1,...,cnm
where ti is the NUMA domain where the ith thread is run,
and cij is the NUMA domain where the jth chain in the ith thread is allocated.
(The values ti and cij must all be zero or small positive integers.)
Note: for maps, each thread must have the same number of chains,
maps override the -t or --threads specification,
NUMA domains are whole numbers in the range of 0..N, and
thread or chain domains that exceed the maximum NUMA domain
are wrapped around using a MOD function.
To determine the number of NUMA domains currently available on your system, use a command such as "numastat".
Final note: strict is not yet implemented, and maps do not gracefully handle ill-formed map specifications.
|