I was investigating whether the 'R' register behaves the same on Z80 and R800.
I initially wrote this program:

	di
	ld	hl,#0100
	xor	a
	ld	b,a
	ld	r,a
loop:	ld	a,r        ; (2)
	ld	(hl),a     ; (4)
	inc	hl         ; (1)
	djnz	loop       ; (3)
	ei
	ret


On Z80, for every iteration of this loop, R is increased by 5 (+2 for 'ld a,r'
and +1 for the other 3 instructions). On R800 R is also increased by 5 each
iteration. Though sometimes it's increased by 6!

The increase by 6 instead of 5 happened every 18th or 19th (alternating)
Iteration. One iteration takes 10 cycles (taking R800-page-breaks into account),
so that's on average every 185th clock cycle.

This number 185 is very close to the number of useful cycles in a full R800
refresh cycle. Currently (2010-08-16) R800-refresh in openmsx stalls the R800
for 26 cycles every 210 cycles (so that leaves 210-26=184 useful cycles).

So all this lead me to the hypotheses that a R800-refresh cycle also increases
the R register by one.

This would be very useful because it would allow to measure in more detail how
refresh works exactly on R800 (the current model in openMSX (210/26) is not
100% correct). So that's what I'll do in the rest of this document.




New test program. This records sequences longer than 256, it's important that
each iteration takes the same amount of cycles. So I had to replace the djnz
instruction.

	di
	ld	hl,#0100
	xor	a
	ld	r,a
loop:	ld	a,r        ; (2)
	ld	(hl),a     ; (4)
	inc	hl         ; (1)
	ld	a,h        ; (1)
	cp	#c0        ; (2)
	jr	nz,loop    ; (3)

	[ some code to transform memory block #0100-#C000 into
	  the differential of this block, so it's easy to see by
	  how much R increased each iteration ]

Each iteration takes 13 cycles. Per iteration R is increased by either 7 or 8.

The difference table (a small portion) looked like this:
   07 07 07 07 07 07 07 07  07 07 07 07 07 08 07 07
   07 07 07 07 07 07 07 07  07 07 07 08 07 07 07 07
   07 07 07 07 07 07 07 07  07 08 07 07 07 07 07 07
   07 07 07 07 07 07 07 08  07 07 07 07 07 07 07 07
   07 07 07 07 07 07 08 07  07 07 07 07 07 07 07 07
   07 07 07 07 08 07 07 07  07 07 07 07 07 07 07 07
   07 07 08 07 07 07 07 07  07 07 07 07 07 07 07 07
   08 07 07 07 07 07 07 07  07 07 07 07 07 07 07 08
   07 07 07 07 07 07 07 07  07 07 07 07 07 07 08 07
   07 07 07 07 07 07 07 07  07 07 07 07 08 07 07 07
   ...

These tables always have only two different numbers, in this case a series of
7's followed by one 8, then again a series of 7's followed by one 8 and so one.
The actual number '7' or '8' is not important for timing, only the pattern of
these numbers in the table is important. So I'll create a more compact notation
for the table above:

  14 14 14 14 15 14 14 14 14 15 14 ...       (these are decimal numbers)
or even
  4*14 15 ...

This means there are 4 repetitions of '13*0x07 + 1* 0x08' (length 14) followed by
one time '14*0x07 + 1*0x08' (length 15).

So this notation indicates how many iterations there are before there is a
refresh.

For this test on average there are '(4 * 14 + 1 * 15) / 5 = 14.2' iterations
between two refresh cycles. Each iteration takes 13 cycles. So that's 184.6
useful clock cycles between refresh cycles.




Next I realized this test program could be speed up one cycle:

	ld	c,#c0
loop:	ld	a,r        ; (2)
	ld	(hl),a     ; (4)
	inc	hl         ; (1)
	ld	a,h        ; (1)
	cp	c          ; (1)
	jr	nz,loop    ; (3)

An iteration now takes 12 cycles (R is still increased by 7 or 8).

Difference table looks like this
  16 3*15 16 2*15 16 2*15    (this sequence repeats all the time)

That's (3*16+7*15)/10 = 15.3 iteration
or 15.3*12 = 183.6 clock cycles between refresh.




Then I started inserting extra instructions in this test program:

	ld	c,#c0
loop:	ld	a,r        ; (2)
	ld	(hl),a     ; (4)
	[***]
	inc	hl         ; (1)
	ld	a,h        ; (1)
	cp	c          ; (1)
	djnz	loop       ; (3)


* NOP
  13 cycles per iteration  (R increases 8 or 9)
  pattern: 8*14 15
  cycles between refresh: 183.44      (8*14+15)/9*13

* 2 x NOP
  14 cycles/iteration  (R incr 9 or 10)
  pattern: 5*13 14
  cycles/refresh = 184.33

  I noticed the start of the difference table was a bit different, it went like
  this:
     9*13 14 7*13 14 5*13 14 5*13 14 5*13 14
  So it took some time before it stabilized on the '5*13 14' pattern. For the
  rest of the tests I didn't look at the start of the table anymore. I only
  searched for the 'stable' pattern.

  I ran this same test again, but now the (stable) pattern was
     9*13 14
  that gives
     cycles/refresh = 183.4

  This is very interesting behaviour, depending on some (yet unknown) initial
  conditions, _the_whole_test_ runs slightly faster or slower. This is
  interesting because it may give a clue to why there is considerable variation
  to *some* speed measurements on a real R800 (see doc/r800-test.txt) while for
  other tests the results are much more stable. (Also in the current R800 openMSX
  emulation, the speed measurements are always stable).

  I didn't repeat the previous tests. Maybe they also show different patterns
  on different runs. I did repeat all future tests (there always seems to be
  either only 1 or 2 different stable patterns)

* NEG   (like 2xNOP also takes 2 cycles)
  14 cycles/iteration (R incr 9 or 10)
  pattern: 7*13 14
  cycles/refresh = 183.75

* 3 x NOP
  15 cycles/iteration (R incr 10 or 11)
  pattern: 2*12 13
           3*12 13
  cycles/refresh = 185
                   183.75

* IM 1   (3 cycles, like 3xNOP)
  15 cycles/iteration (R incr 9 or 10)
  pattern: same as 3 x NOP

* 4 x NOP
  16 cycles/iteration (R incr 11 or 12)
  pattern: 11 12
  cycles/refresh = 184

* LD (HL),A  (4 cycles (with page-breaks))
  16 cycles/iteration (R incr 8 or 9)
  pattern: same as 4 x NOP

* 5 x NOP
  17 cycles/iteration (R incr 12 or 13)
  pattern: 4*11 10
           7*11 10 6*11 10
  cycles/refresh = 183.6
                   184.73

* BIT 0,(HL)   (5 cycles)
  17 cycles/iteration (R incr 9 or 10)
  pattern: 5*11 10 4*11 10
           6*11 10
  cycles/refresh = 183.90
                   184.57





I still need to do more experiments, but some *guesses* so far: useful number
of cycles always seems to be within (183, 185]. That's a variation of more than
one clock cycle. Maybe refresh also waits for an even clock cycle (just like IO
does). This could explain why there are two stable patterns for some tests (in
one case you have to insert an extra cycle to align to an even number of
cycles, in the other case you're already aligned). This could explain why there
can be variation in speed between different runs of the same test. And finally
it explains why the documentation (e.g. atoc or even the turbor datapack) talks
about half clock cycles for the duration of the refresh (in 50% of the cases it
needs to add one cycle). Those docs talk about 21.5 cycles refresh every 222
cycles, those number seem wrong to me (don't match measurements on real HW),
but at least I now have an idea about that half clock cycle.


---------


I implemented the above guess (refresh waits for even clock cycle) in openMSX
revision 11643. I can now more or less reproduce the results above: The
number of useful clock cycles per refresh is also in range (183, 185], but the
number per test is not the same as above (the 'stable patterns' from the tests
above are different). Also in openMSX there's always only one stable pattern,
while on a real R800, for some test, there clearly were two different possible
patterns.






New test: above we measured the number of useful clock cycles per 'refresh
cycle'. Now we're going to measure how many cycles the refresh itself takes.
The idea goes like this: by observing the R register we can detect that there
was a refresh, so by doing a longer test we can count how many refresh cycles
there were. We can combine this with measuring the total time of the test
(using the E6-timer). We can also calculate how many 'useful' cycles there
were in the whole test. So the difference between actual and useful cycles
must be cycles spend on refresh. And finally if we divide that by the counted
number of refreshes, we should know how many cycles a single refresh takes.

Full test program:

        org     #c000

        di
        ld      hl,#0100
        ld      c,#bc

        ld      e,3
l1      ld      a,r        ; This loop waits till there was a refresh.
        ld      d,a        ; It's an attempt to get more stable measurements.
        ld      a,r        ; Note that this loop will not terminate in Z80
        sub     d          ; mode.
        sub     e
        jr      z,l1

        out     (#e6),a

loop    ld      a,r        ; (2)   Actual test loop, same as in tests above
        ld      (hl),a     ; (4)   One iteration takes 12 cycles
        inc     hl         ; (1)
        ;[***]             ;       extra instructions here (see below)
        ld      a,h        ; (1)
        cp      c          ; (1)
        jr      nz,loop    ; (3/2) 3 cycles when jump is taken, 2 otherwise

        in      a,(#e6)
        ld      l,a
        in      a,(#e7)
        ld      h,a
        ld      (#be00),hl ; store total duration of the test

        ld      hl,#0100   ; Next we do some post-processing on the data:
l2      inc     hl         ; First calculate the difference table, same
        ld      a,(hl)     ; routine as in tests above (though not shown
        dec     hl         ; there).
        sub     (hl)
        ld      (hl),a
        inc     hl
        ld      a,h
        cp      c
        jr      nz,l2

        ld      hl,#bc00
        ld      de,#bc00+1
        ld      bc,#200-1
        ld      (hl),0
        ldir

        ld      de,#100    ; Next we calculate a histogram.
        ld      hl,#bc00   ; We expect this histogram to be all zero, except for
l3      ld      a,(de)     ; two entries: the iterations with and without a
        ld      l,a        ; refresh cycle.
        inc     (hl)       ; There is a single other point in this histogram,
        jr      nz,nc      ; that's because we don't calculate the difference of
        inc     h          ; the very last iteration correctly (we'd have to
        inc     (hl)       ; sample the R register one more time outside the
        dec     h          ; loop). But for now we ignore this outlier.
nc      inc     de
        ld      a,d
        cp      h
        jr      nz, l3

        ei
        ret


Test results:

* no extra instructions:
   real machine:
     histogram: 0x07: 0xAECE
                0x08: 0x0C31
                0x66: 0x0001 -> belongs to 0x07, can be seen by the pattern
                                in the difference table, I won't show this
                                entry anymore in the next results
     E6-ticks:  0x5B75

     total cycles: E6-timer * 28 cycles/E6-tick = 655564 cycles
     useful cycles: 0xBB00 iterations * 12 cycles/iteration = 574464 cycles
     overhead cycles: 655564 - 574464 = 81100 cycles
     cycles per refresh = 81100 cycles / 0xc31 refreshes
                        = 25.985 cycles/refresh

   openMSX revision 11648:
     histogram: 0x07: 0xAECD+1
                0x08: 0x0C32
     E6-ticks:  0x5B78
      --> 26.004 cycles/refresh

* extra instruction: NOP (1 cycles, increases R by 1)
   real machine:
     histogram: 0x8: 0xADCA
                0x9: 0x0D36
     E6-ticks:  0x6318
      --> 26.011 cycles/refresh

   openMSX revision 11648:
     histogram: 0x8: 0xADC8
                0x9: 0x0D38
     E6-ticks:  0x632A
      --> 26.144 cycles/refresh

* extra instruction: IM 1 (3 cycles, incr R 2)
   real machine:
     histogram: 0x9: 0xABCA
                0xA: 0x0F36
     E6-ticks:  0x721B
      --> 25.636 cycles/refresh

   openMSX revision 11648:
     histogram: 0x9: 0xABBC
                0xA: 0x0F44
     E6-ticks:  0x727E
      --> 26.25 cycles/refresh

   !! Big difference between openmsx and real MSX !!

* extra instructions: EXX ; MULUW HL,BC ; EXX (1+36+1 cycles, incr R 1+2+1)
   real machine:
     histogram: 0xB: 0x882D
                0xC: 0x32D3
     E6-ticks:  0x17D33  (of course I couldn't measure how many times the
                          timer overflowed, but it must be one time)
      --> 26.042 cycles/refresh

   openMSX revision 11648:
     histogram: 0xB: 0x8801
                0xC: 0x32FF
     E6-ticks:  0x17E80
      --> 26.669 cycles/refresh

   !! Big difference between openmsx and real MSX !!



Preliminary conclusion:
  Refresh seems to take about 26 cycles, this is also the value we currently
  use in openMSX. Though especially for the 'IM 1' case there's still a
  relatively big difference between openMSX and the real hardware. Needs
  more experiments.





One other difference between Z80 and R800
	di
	xor	a
	ld      r,a
	ld	a,r
	ld	c,a	; Z80: c=2   R800: c=1
	ld	a,r
	ld	b,a     ; Z80: b=5   R800: b=4
	ld	a,r     ; Z80: a=8   R800: a=7
And of course once in a while the numbers for R800 are (partially) increased by
one.

This difference can possibly be explained by a difference in the order of
storing the result to the 'R' register and increasing the 'R' register on each
M1 cycle.
