Explanation Missing p-values

Download PDF

Description of a significant program error in TestU01 / Bigcrush                          Pdf available here:

94 (37%) of the 254 possible p-values are never included in the summary!

Introduction

Since we only included the p-values within the reports in our calculations for our analyses of the 250,000 Bigcrush tests, we did not immediately notice that a significant number of the p-values marked as "outsiders" (*****) were missing from the summaries. When publishing the results, we wrote: "Are 'outsiders' sometimes not reported in the summary?"

After conducting a study on this issue, we would like to present the results here. You can find all the details in a total of 32 files under the navigation point "Download Reports and Statistics." The importance of the results of our study can be easily explained: In the file "s Missing in Summaries" BW or Color, the top line reads: Report 29845 of AHS-RNG-2, Statistics 73.2 p-value 1 – 2.3e-8, .999999977 in decimal, and thus 43478 times smaller than the defined range for outsiders, > .999. Nevertheless, the summary concludes that "All tests were passed." In the calculation programs, these outsiders are correctly marked with *****. In our 250,000 reports, there were 131,756 lines, but only 87,659 lines appeared in the summaries.

Description of our analysis

The best overview of the problem is provided by the upper files "Statistic ok-to-errors" in BW or Colors. Black and white is without the ANSI color codes and is best suited for machine processing, as only the pure information is in the correct place in the record. This is very important when sorting or selecting with awk, for example. Color is intended for viewing on screen (notepad++, therefore .log) or for printing from notepad++. Tip: If you have a Din A3 color printer, print the 18 pages of this file in color on one side and lay them side by side on a conference table, or pin them to a large pinboard! This will allow you to grasp the problem at a glance.

The summaries only show the serial number of the statistic. We want to refer to all statistics that only output a p-value as a result as mono-pv. There are 61 statistics, i.e. 61 p-values out of a total of 254. Of the remaining 45 statistics, 6 have 2 sub-statistics, 7 have 3, 12 have 4, 12 have 5, 4 have 6, and 4 have 7 sub-statistics, i.e. 45 statistics with 193 sub-statistics. The problem is that only 1, or 2 in the case of statistics with 7 substatistics, are included in the summaries for different statistics. The results of 94 sub-statistics are missing in total. This is 37% of the 254 p-values. Calculated on the number of outsiders, this is 33.47%. These missing 33.47% are divided into different categories in our analysis according to their significance.

Classification of missing outsiders: Red

Not every missing outsider has the same importance for practical work with BigCrush. We divide them into 5 different categories. Red combines 3 different categories. Red means that no row with the same statistics number exists in the summary for this pv number.

Subgroups of red are:

- Despite the presence of the outsider, the summary states: "All tests were passed." We have included this line in the commented details. This is not meant to be ironic, but it seemed like the appropriate explanation to us. As mentioned in the introduction, this category has the most serious consequences. If this explanation is the result, no one will manually scroll through the 48 pages of the report.

- The second category concerns cases in which one or more statistics are listed in the summary, but none of them relate to the outsider's statistics. If the p-value of the unlisted outsider is smaller than the smallest p-value of all statistics in the summary, we believe this case is relevant. The omitted p-value is supplemented with the comment: "only xx.xxx% of min.Sum.", i.e., this invisible result is only x percent of the smallest of all summary values.

- The third category is essentially the same as the second category, but where the p-value is equal to or greater than the smallest in the summary. Although this may be of interest for an analysis of the RNG, we do not consider this case to be as serious and do not add a comment in the detailed statistics.

Since these three categories are shown in the same "red" column, here is a breakdown of the total 32,638 "red" cases: "All tests were passed" = 23,116 / only xx,xxx% = 3,931 / no comment = 5,591 The value of 23,116 does not mean that so many runs were classified incorrectly, as runs with multiple cases in this category are also possible. A check reveals that there are 17,845 cases in which the wrong conclusion was drawn.

Classification of missing outsiders: magenta and blue

The other two categories concern cases in which a missing p-value, which actually belongs in the summary but does not appear there, is "represented" in the summary by a p-value that has the same statistics number but a different sub-statistics number. Here we again distinguish between two cases: The color magenta is used to mark cases where the missing p-value is smaller than the displayed p-value of the same statistic. The following comment is added to the detailed statistics: "only xx.xxx % of nnn.n," where x is the percentage value and nnn.n is the complete name of the statistic/substatistic.

However, if the missing value is equal to or greater than the value of the same statistic shown in the summary, it is colored BLUE. This case seems least relevant to us, but could possibly be helpful for a detailed analysis of the RNG.
In the files, these two cases appear as magenta and blue.

As you might intuitively guess, the color green refers to cases that are considered correct. The pv-value with ***** is shown under the statistics number in the summary. We have inserted the number of the sub-statistic in the summary line record. This can be found via the p-v/test number of the outsider. It would be helpful if, in the event of a revision of TestU01, the sub-statistics were also printed in the summary.

Special case MT19937

It has been known for many years that the Mersenne Twister MT19937 systematically fails the "scomp_LinearComp" test in the second sub-statistic (normal statistic for number of jumps); in fact, one can speak of a systematic "failure." In our 50,000 runs, the value 1 – eps1 was calculated without exception (eps1 is value < than 1.0e-15). The cause is its membership in the F2 generator family. To enable our evaluations, we initially hid these 100,000 p-values, numbers 177 and 179. Strictly speaking, it would now no longer be possible to insert the comment "All tests were passed" because it does not appear in any run. We did so anyway, based on the following consideration: Both "failures" are well known and their exclusion has been clearly communicated. Therefore, we consider the cases in which only these two outsiders appear in the summary as "All tests were passed." In fact, below these two outsiders it says: "All other tests were passed," which logically supports our decision.

A brief description of the data records in the various files:

The reports were assigned run numbers in advance, and the p-values were numbered consecutively. All p-values marked with ***** were included in the file and marked with a "V" in the 81st position of the record. The run numbers were all included and marked with an "N." The results listed in the summary were transferred and marked with a "Z." The original reports of the 250,000 test runs have exactly 946,739,188 lines of text.
The p-v ***** lines were supplemented with the statistics/substatistics number (via a concordance table for the pv number) and with the decimal representation of the p-v value.
The following information was added consecutively:

- pos. 83 to 87             Run No
- pos. 89 to 91             Statistics number, right-aligned with blanks filled in on the left
- pos. 93                      Substatistics number, 0 is a mono-pv statistic
- pos. 95 to 97             Three-digit pv number (test no)
- pos. 99 to 110           Decimal representation of the distance to the endpoint, 0 or 1. For the
                                    right tail, the complement to 1 is calculated in order to obtain easy
                                    comparability and sortability
- pos. 112 to 125         Original expression from the report, for reconciliation pv with the
                                    summary
- pos. 127                    Color status: G=green, R=red, M=magenta, B=blue. All summaries are green,
                                    as all cases found a corresponding pv value.
- From pos. 129          Possible comment.

Note: Always use the BW files for "sort," otherwise the positions will no longer be correct. Tip: The "sort" under Unix works with words. To treat the entire line as a single word, simply define "@" as the separator: # sort -t "@" -k 1.83n,1.87 sorted by run no.

Conclusion

Even though these errors did not interfere with our previous evaluation, one is somehow shocked by the nonchalance with which a half-finished product was handed over to the general public. No responsible project manager would allow a software package with such glaring errors to go into production. What is even more surprising is that no report on this has been published to date, given that 18 years have passed since then. When asked, ChatGpt5 was not aware of any studies on this subject. From my own experience, I know that developers of new RNGs depend on the results of BigCrush. But users also want to be properly informed when testing different RNGs.
As a recommendation, I can therefore only advise documenting all the BigCrush reports and not just looking at the summary, but also search in the text for not reported outsiders. (e.g., with # grep " \*\*\*\*\*". Two spaces before the asterisks to avoid the headers!).