We show the effect of using a strip mining optimization with the 
matrix multiply algorithm. Although both the regular multiply 
algorithm and the strip mining algorithm have the same number of floating
point operations, they have dramatically different cache behaviors. 
Using PAPI we can see the exact number of floating point operations and
secondary data cache misses as shown below. To use papi, configure TAU 
with 
% configure -papi=<dir>

pyros [tau2/examples/papi]% setenv PAPI_EVENT PAPI_FP_INS
pyros [tau2/examples/papi]% simple
pyros [tau2/examples/papi]% pprof
Reading Profile files in profile.*

NODE 0;CONTEXT 0;THREAD 0:
---------------------------------------------------------------------------------------
%Time   Exclusive   Inclusive       #Call      #Subrs Count/Call Name
           counts total counts                            
---------------------------------------------------------------------------------------
100.0           6   5.409E+07           1           1   54090019 main() int (int, char **)
100.0   9.001E+04   5.409E+07           1           2   54090013 multiply void (void)
 49.9     2.7E+07     2.7E+07           1           0   27000001 multiply-with-strip-mining-optimization void (void)
 49.9     2.7E+07     2.7E+07           1           0   27000001 multiply-regular void (void)
pyros [tau2/examples/papi]% setenv PAPI_EVENT PAPI_L2_DCM
pyros [tau2/examples/papi]% simple
pyros [tau2/examples/papi]% pprof
Reading Profile files in profile.*

NODE 0;CONTEXT 0;THREAD 0:
---------------------------------------------------------------------------------------
%Time   Exclusive   Inclusive       #Call      #Subrs Count/Call Name
           counts total counts                            
---------------------------------------------------------------------------------------
100.0           3   1.553E+04           1           1      15534 main() int (int, char **)
100.0        1295   1.553E+04           1           2      15531 multiply void (void)
 66.9    1.04E+04    1.04E+04           1           0      10398 multiply-regular void (void)
 24.7        3838        3838           1           0       3838 multiply-with-strip-mining-optimization void (void)

