Language…
17 users online:  AmperSam,  Anorakun, CharlieUltra, Courage2006, DanMario24YT, Dennsen86, elegist, GamesInTweed, Golden Yoshi, Green Jerry, GRIMMKIN, Metal-Yoshi94, Nayfal, PuffleDreemurr, slopcore, TheXander, Tulip Time Scholarship Games - Guests: 308 - Bots: 433
Users: 64,795 (2,370 active)
Latest user: mathew

Optimize the LC_LZ2 decompression!

  • Pages:
  • 1
  • 2
The DMA version actually makes the screen flicker during OW->level load. The color of the flicker is the level's BG color I think (most likely it is).

edit: apparently this only happens in snes9x 1.51 debugger.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Sweet. Sounds like we need to double buffer. xP
A non-DMA version is worth making for those who want to use something like this midframe maybe.. MVN + unrolling would still be OK in performance. NMI can be turned off/on as an option, just have to be careful when to re-enable. That's worth trying actually, using non-HDMA channel (same used by SMW to upload stuff) and disabling NMI until it is done.

Anyone plan on unrolling it? I can do it but if anyone is willing go ahead. I'll get to it if no one bothers.

edit: DMA mid-frame can also mess up IRQ status bar or anything IRQ related since it can't respond to interrupts when transferring.
I actually think we should just have one standard version that's compatible with (basically) everything, otherwise it'll only cause confusion. Even if it's not entire 100% optimal, there are many other ways to improve the performance. It's already magnitudes better than SMW's original routine, and levels with 14 ExGFX files (ExGFX Revolution option #2, plus sprite GFX) load just as fast as unmodified SMW levels did previously.

Plus, LC_LZ2 decompression isn't the only thing slowing down level load.

Edit: so looks like this is the final patch: (for now at least)
copypasta from Ersanio's post

Code
header
lorom

!Freespace = $1D8000|$800000

macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE +
LDX #$8000
INC $06
INC $8C
+
endmacro

org $80B8E3
	JSL CodeStart		;was JML before
	RTS
	
org !Freespace
	reset bytes
	db "STAR"
	dw CodeEnd-CodeStart-$01
	dw CodeEnd-CodeStart-$01^$FFFF

CodeStart:
	PHB
	PEI ($03)
	PEI ($05)
	PEI ($07)
	PEI ($09)
	PEI ($0B)
	PEI ($8A)
	SEP #$20
	REP #$10
	LDA $02
	PHA
	PLB
	STA $05		; dest_bank
	INC
	STA $03		; dest_bank [plus or minus]
	LDA #$54
	STA $04		; mvn
	LDA #$4C
	STA $07		; jump
	LDA $8C
	STA $06		; src_bank
	LDX.w #.back
	STX $08
	
	LDY $00		; dest_low
	LDX $8A		; src_low
	STZ $8A
	STZ $8B
	BRA .main
	
.case_e0
	AND #$03
	STA $8E
	EOR $8D
	ASL
	ASL
	ASL
	XBA
	%ReadByte()
	STA $8D
	XBA
	BRA .type

.case_80_or_e0
	BPL .lz
	LDA $8D
	CMP #$1F
	BNE .case_e0
	PLX : STX $8A
	PLX : STX $0B
	PLX : STX $09
	PLX : STX $07
	PLX : STX $05
	PLX : STX $03
	REP #$20
	TYA
	SEC
	SBC $00
	STA $8D		; size!!!
	SEP #$30
	PLB
	RTL			;JML $80B8EA
	
.lz
	%ReadByte()
	XBA
	%ReadByte()
	
	STX $0B
	REP #$21
	ADC $00
	TAX
	LDA $8D
	SEP #$20
	BIT $03
	BPL +
	MVN $7F7F
	BRA ++
+	MVN $7E7E
++	LDX $0B
	
.main
	%ReadByte()
	STA $8D
	STZ $8E
	AND #$E0
	TRB $8D

.type
	ASL
	BCS .case_80_or_e0
	BMI .case_40_or_60
	ASL
	BMI .case_20

.case_00
	REP #$20
	LDA $8D
	STX $8D
	
-	SEP #$20
	JMP $0004
	
.back
	CPX $8D
	BCS .main
	
	INC $06
	INC $8C
	CPX #$0000
	BEQ ++
	
	DEX
	STX $0B
	REP #$21
	LDX #$8000
	STX $8D
	TYA
	SBC $0B
	TAY
	LDA $0B
	BRA -
	
++	LDX #$8000
	BRA .main

.case_20
	%ReadByte()
	STX $0B
	PHA
	PHA
	REP #$20
	
.case_20_main
	LDA $8D
	INC
	LSR
	TAX
	PLA
	
-	STA $0000,Y
	INY
	INY
	DEX
	BNE -
	
	SEP #$20
	BCC +
	STA $0000,Y
	INY
+	LDX $0B
	BRA .main
	
.case_40_or_60
	ASL
	BMI .case_60
	%ReadByte()
	XBA
	%ReadByte()

	XBA
	STX $0B
	REP #$20
	PHA
	BRA .case_20_main
	
.case_60
	%ReadByte()
	STX $0B
	LDX $8D
-	STA $0000,Y
	INC
	INY
	DEX
	BPL -
	LDX $0B
	JMP .main
	

CodeEnd:
print "Insert Size: ",bytes," bytes"
So I guess we'll go by edit's code.

I think all that there's left is unrolling. However, I don't really know how to do that especially with variable loop counts. I've sorted out the loops though:

Code
- STA $0000,Y 
INY 
INY 
DEX 
BNE - 
SEP #$20 
BCC + 
STA $0000,Y 
INY 

I'd try to unroll this, but the part after BNE - confused me.

Code
- STA $0000,Y 
INC 
INY 
DEX 
BPL - 

One can easily change this into a mass of INC : STA $xxxx,y. However, it'll eat up many bytes. But I think it's worth it due to performance gain. I'll attempt to unroll this one.
e: I'm having trouble with it though. >_>
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Originally posted by Ersanio
The DMA version actually makes the screen flicker during OW->level load. The color of the flicker is the level's BG color I think (most likely it is).

edit: apparently this only happens in snes9x 1.51 debugger.


shouldn't that not happen if the screen is turned off?
Your layout has been removed.
I don't really know.

Also, apparently I'm having difficulties with unrolling since I don't even know how I could start.

Could some expert out there unroll the loops in this code or something, or atleast help me? D:
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
I started to unroll the loops but I didn't bother finishing because by the time I got close to a working piece of code it was already less efficient than the loops themselves for small amounts of data
I see. So this is as close as we can get?
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Code
SWR says (2:56 PM):
 Hey, I noticed the LZ2 goes past where scratch RAM is allowed
 The PEIs prevent bad stuff from happening, right?
Ersanio says (2:56 PM):
 yeah
Ersanio says (2:57 PM):
 PEI pushes the stuff at the RAM addresses
 and later on
SWR says (2:57 PM):
 What if an interrupt occurs?
Ersanio says (2:57 PM):
 PLX STX blah recovers them
 hm
 idk
 D:
SWR says (2:57 PM):
 =|

Would an interrupt cause any issues?
----------

Interested in MushROMs? View its progress, source code, and make contributions here.

I found a way to further optimize the routine by replacing BRA .case_20_main with the routine itself, which should save about 3 cycles every time it gets called. Personally I think the saved time is disappointingly small, but still good enough.

Code
HEADER
LOROM

!Freespace = $178000|$800000

macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE +
LDX #$8000
INC $06
INC $8C
+
endmacro

org $80B8E3
	JSL CodeStart		;was JML before
	RTS
	
org !Freespace
	reset bytes
	db "STAR"
	dw CodeEnd-CodeStart-$01
	dw CodeEnd-CodeStart-$01^$FFFF

CodeStart:
	PHB
	PEI ($03)
	PEI ($05)
	PEI ($07)
	PEI ($09)
	PEI ($0B)
	PEI ($8A)
	SEP #$20
	REP #$10
	LDA $02
	PHA
	PLB
	STA $05		; dest_bank
	INC
	STA $03		; dest_bank [plus or minus]
	LDA #$54
	STA $04		; mvn
	LDA #$4C
	STA $07		; jump
	LDA $8C
	STA $06		; src_bank
	LDX.w #.back
	STX $08
	
	LDY $00		; dest_low
	LDX $8A		; src_low
	STZ $8A
	STZ $8B
	BRA .main
	
.case_e0
	AND #$03
	STA $8E
	EOR $8D
	ASL
	ASL
	ASL
	XBA
	%ReadByte()
	STA $8D
	XBA
	BRA .type

.case_80_or_e0
	BPL .lz
	LDA $8D
	CMP #$1F
	BNE .case_e0
	PLX : STX $8A
	PLX : STX $0B
	PLX : STX $09
	PLX : STX $07
	PLX : STX $05
	PLX : STX $03
	REP #$20
	TYA
	SEC
	SBC $00
	STA $8D		; size!!!
	SEP #$30
	PLB
	RTL			;JML $80B8EA
	
.lz
	%ReadByte()
	XBA
	%ReadByte()
	
	STX $0B
	REP #$21
	ADC $00
	TAX
	LDA $8D
	SEP #$20
	BIT $03
	BPL +
	MVN $7F7F
	BRA ++
+	MVN $7E7E
++	LDX $0B
	
.main
	%ReadByte()
	STA $8D
	STZ $8E
	AND #$E0
	TRB $8D

.type
	ASL
	BCS .case_80_or_e0
	BMI .case_40_or_60
	ASL
	BMI .case_20

.case_00
	REP #$20
	LDA $8D
	STX $8D
	
-	SEP #$20
	JMP $0004
	
.back
	CPX $8D
	BCS .main
	
	INC $06
	INC $8C
	CPX #$0000
	BEQ ++
	
	DEX
	STX $0B
	REP #$21
	LDX #$8000
	STX $8D
	TYA
	SBC $0B
	TAY
	LDA $0B
	BRA -
	
++	LDX #$8000
	BRA .main

.case_20
	%ReadByte()
	STX $0B
	PHA
	PHA
	REP #$20
	
.case_20_main
	LDA $8D
	INC
	LSR
	TAX
	PLA
	
-	STA $0000,Y
	INY
	INY
	DEX
	BNE -
	
	SEP #$20
	BCC +
	STA $0000,Y
	INY
+	LDX $0B
	BRA .main
	
.case_40_or_60
	ASL
	BMI .case_60
	%ReadByte()
	XBA
	%ReadByte()

	XBA
	STX $0B
	REP #$20
	PHA
;Replace BRA .case_20_main the code itself
	LDA $8D
	INC
	LSR
	TAX
	PLA
-	STA $0000,Y
	INY
	INY
	DEX
	BNE -
	SEP #$20
	BCC +
	STA $0000,Y
	INY
+	LDX $0B
	JMP .main
	
.case_60
	%ReadByte()
	STX $0B
	LDX $8D
-	STA $0000,Y
	INC
	INY
	DEX
	BPL -
	LDX $0B
	JMP .main
CodeEnd:
print "Insert Size: ",bytes," bytes"

My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
That's a nice job you people did!

Has anybody also considered doing something similar for the LC_LZ3 compression? I haven't looked into the algorithm that LM deploys, when selecting this mode. In the documentation I found so far, it seems that two extra modes are added that were unused in LC_LZ2. It could be helpful to improve that version as well. I don't mind looking into that myself, unless it would be considered useless...
Originally posted by Last post date
24/06/2010

Read rule 2.8:
Originally posted by Part of the rules
8. Don't bump old threads.

Posting in threads older than one month is considered "bumping", and should be avoided. An exception to this rule is if you are reviving an old thread you created.

I don't know if this post was back mod-y, if yes, sorry.
Yes it was. The thread got bumped for a good reason, no need to shoot the guy down. Don't do this again.

As for your question Diortem, I wouldn't know. Maybe someone else can answer it for you, though.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Actually, isn't the LZ3 algorithm already based off the optimized LZ2 one?

----------------

I'm working on a hack! Check it out here. Progress: 64/95 levels.
Yes, it was already based off this patch.
Thanks for clearing this up.

LZ3 is an improvement of LZ2 (adding two reserved opcodes IIRC) so that would mean the optimized LZ3 always beats the original LZ2 compression in both area/speed. Unless there's some custom asm dependent on it, wouldn't it be better to go for this option by default?
  • Pages:
  • 1
  • 2