Language…
7 users online: crm0622, crocodileman94, KoJi, Maw, Metakabe, Neuromancer, Nirv - Guests: 249 - Bots: 292
Users: 64,795 (2,377 active)
Latest user: mathew

Optimize the LC_LZ2 decompression!

  • Pages:
  • 1
  • 2
Lately I've been attempting to optimize the LC_LZ2 decompression routine of SMW. It went perfectly fine until I hit my limit.

Why I'm optimizing this? To decrease the level loading time even if it is for a split-second. My code currently looks as the following:

Code
HEADER
LOROM

!Freespace = $1D8000

ORG $00B8E3
JML Decomp_start

macro ReadByte()
	LDA [$8A]
	LDX $8A
	INX
	BNE +
	LDX.w #$8000
	INC $8C
+	STX $8A
endmacro

ORG !Freespace

Decomp:
.return			JML $00B8EA
.start			%ReadByte()
			CMP.b #$FF
			BEQ .return

			STA $8F
			AND.b #$E0
			CMP.b #$E0
			BEQ +
			PHA 
			LDA $8F
			REP #$20
			AND.w #$001F
			BRA .label2

+			LDA $8F
			ASL
			ASL
			ASL
			AND.b #$E0
			PHA
			LDA $8F
			AND.b #$03
			XBA 
			%ReadByte()
			REP #$20
.label2			INC A
			STA $8D
			SEP #$20
			PLA
			BEQ .label3
			BPL .nextup

.label4			%ReadByte()
			XBA
			%ReadByte()
			TAX

-			PHY
			TXY
			LDA [$00],Y
			PLY
			STA [$00],Y
			INY
			INX
			REP #$20
			DEC $8D
			SEP #$20
			BNE -
			JMP.w .start

.nextup			ASL
			BPL .label5
			ASL
			BPL .label6
			%ReadByte()
			LDX $8D
-			STA [$00],Y
			INC A
			INY
			DEX
			BNE -
			JMP.w .start

.label3			%ReadByte()
			STA [$00],Y
			INY
			LDX $8D
			DEX
			STX $8D
			BNE .label3
			JMP .start

.label5			%ReadByte()
			LDX $8D
-			STA [$00],Y
			INY
			DEX
			BNE -
			JMP .start

.label6			%ReadByte()
			XBA
			%ReadByte()
			LDX $8D
-			XBA
			STA [$00],Y
			INY
			DEX
			BEQ +
			XBA 
			STA [$00],Y
			INY
			DEX
			BNE -
+			JMP .start


I'm pretty sure that this can be optimized even more. But seeing that I hit my limit, I have no idea how. I'd prefer advanced ASM hackers to contribute to this optimization. But other people can try too.

With optimizing I mean faster code, even if it costs ROM space. This means that the code should use minimal amount of cycles. It's all for the sake of decreasing level loading times and what not. It would be awesome if we actually saw some visible faster loading time.

You can find a list of cycles in this document.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
I'm sorry. I can't speak English.
So I put only a code.
Really I'm sorry.

Code
HEADER
LOROM
		!Freespace = $1D8000
		org $80B8E3
		JML Decomp_Start

macro ReadByte()
		LDA [$8A]
		LDX $8A
		INX
		BNE +
		LDX.w #$8000
		INC $8C
+		STX $8A
endmacro

		org !Freespace
Decomp:
.Return		PLB
		JML $80B8EA
.Start		PHB
		LDA $02
		PHA
		PLB
.Loop		%ReadByte()
		CMP #$FF
		BEQ .Return
		STA $8F
		AND #$E0
		CMP #$E0
		BEQ +
		PHA
		LDA $8F
		REP #$20
		AND.w #$001F
		BRA .Label2
+		LDA $8F
		ASL #3
		AND #$E0
		PHA
		LDA $8F
		AND #$03
		XBA
		%ReadByte()
		REP #$20
.Label2		INC A
		STA $8D
		SEP #$20
		PLA
		BEQ .Label3
		BPL .NextUp

.Label4		%ReadByte()
		XBA
		%ReadByte()
		TAX
		REP #$20
		LSR $8D
		LDA $8D
		BEQ .LoopEnd0
-		PHY
		TXY
		LDA ($00),y
		PLY
		STA ($00),y
		INY #2
		INX #2
		DEC $8D
		BNE -
.LoopEnd0	SEP #$20
		BCS +
		JMP .Loop
+		PHY
		TXY
		LDA ($00),y
		PLY
		STA ($00),y
		INY
		JMP .Loop

.NextUp		ASL A
		BPL .Label5
		ASL A
		BPL .Label6

		%ReadByte()
		LDX $8D
-		STA ($00),y
		INC A
		INY
		DEX
		BNE -
		JMP .Loop

.Label3		%ReadByte()
		STA ($00),y
		INY
		LDX $8D
		DEX
		STX $8D
		BNE .Label3
		JMP .Loop

.Label5		%ReadByte()
		LDX $8D
-		STA ($00),y
		INY
		DEX
		BNE -
		JMP .Loop

.Label6		%ReadByte()
		XBA
		%ReadByte()
		XBA
		REP #$20
		LSR $8D
		LDX $8D
		BEQ .LoopEnd
-		STA ($00),y
		INY #2
		DEX
		BNE -
.LoopEnd	SEP #$20
		BCS +
		JMP .Loop
+		STA ($00),y
		INY
		JMP .Loop
I just added in a little piece of code that stores the Y value (which contains the size of the decompressed ExGFX file) to $8D/$8E before the high byte gets destroyed by the SEP #$10 at $00B8EA.

This is useful for patches that need to upload arbitrary-sized ExGFX files without having to include a full copy of this routine in their code just to get the decompressed size.

$8D/$8E is already overwritten many times in the decompression routine and doesn't contain any useful information afterwards, so it's a good address to use.

EDIT: I'm also still leaning towards coding a GZIP decompression routine for SMW and implementing it into LM - not necessarily for faster decompression speeds, but because GZIP can compress to decently smaller sizes than LC_LZ2 in some cases.

GZIP is a pretty standardized compression format, and I've already found quite a few documents on it, I just haven't gotten around to coding a 65c816-version of the decompressor.

Code
HEADER
LOROM
		!Freespace = $1D8000
		org $80B8E3
		JML Decomp_Start

macro ReadByte()
		LDA [$8A]
		LDX $8A
		INX
		BNE +
		LDX.w #$8000
		INC $8C
+		STX $8A
endmacro

		org !Freespace
Decomp:
.Return		PLB
		STY $8D		; store size to $8D
		JML $80B8EA
.Start		PHB
		LDA $02
		PHA
		PLB
.Loop		%ReadByte()
		CMP #$FF
		BEQ .Return
		STA $8F
		AND #$E0
		CMP #$E0
		BEQ +
		PHA
		LDA $8F
		REP #$20
		AND.w #$001F
		BRA .Label2
+		LDA $8F
		ASL #3
		AND #$E0
		PHA
		LDA $8F
		AND #$03
		XBA
		%ReadByte()
		REP #$20
.Label2		INC A
		STA $8D
		SEP #$20
		PLA
		BEQ .Label3
		BPL .NextUp

.Label4		%ReadByte()
		XBA
		%ReadByte()
		TAX
		REP #$20
		LSR $8D
		LDA $8D
		BEQ .LoopEnd0
-		PHY
		TXY
		LDA ($00),y
		PLY
		STA ($00),y
		INY #2
		INX #2
		DEC $8D
		BNE -
.LoopEnd0	SEP #$20
		BCS +
		JMP .Loop
+		PHY
		TXY
		LDA ($00),y
		PLY
		STA ($00),y
		INY
		JMP .Loop

.NextUp		ASL A
		BPL .Label5
		ASL A
		BPL .Label6

		%ReadByte()
		LDX $8D
-		STA ($00),y
		INC A
		INY
		DEX
		BNE -
		JMP .Loop

.Label3		%ReadByte()
		STA ($00),y
		INY
		LDX $8D
		DEX
		STX $8D
		BNE .Label3
		JMP .Loop

.Label5		%ReadByte()
		LDX $8D
-		STA ($00),y
		INY
		DEX
		BNE -
		JMP .Loop

.Label6		%ReadByte()
		XBA
		%ReadByte()
		XBA
		REP #$20
		LSR $8D
		LDX $8D
		BEQ .LoopEnd
-		STA ($00),y
		INY #2
		DEX
		BNE -
.LoopEnd	SEP #$20
		BCS +
		JMP .Loop
+		STA ($00),y
		INY
		JMP .Loop
Got rid of the indirect stuff because this is faster+you can use X so no expensive shuffling stuff in and out of Y. Your size thing should be preserved too, edit:

Code
HEADER
LOROM
		!Freespace = $1D8000
		org $80B8E3
		JML Decomp_Start

macro ReadByte()
		LDA [$8A]
		LDX $8A
		INX
		BNE +
		LDX.w #$8000
		INC $8C
+		STX $8A
endmacro

		org !Freespace
Decomp:
.Return		PLB
		REP #$20
		TYA
		SEC
		SBC $00		;sub starting pointer
		STA $8D		; store size to $8D
		SEP #$20
		JML $80B8EA
.Start		PHB
		LDA $02
		PHA
		PLB
		LDY $00		;16bit pointer in Y	
.Loop		%ReadByte()
		CMP #$FF
		BEQ .Return
		STA $8F
		AND #$E0
		CMP #$E0
		BEQ +
		PHA
		LDA $8F
		REP #$20
		AND.w #$001F
		BRA .Label2
+		LDA $8F
		ASL #3
		AND #$E0
		PHA
		LDA $8F
		AND #$03
		XBA
		%ReadByte()
		REP #$20
.Label2		INC A
		STA $8D
		SEP #$20
		PLA
		BEQ .Label3
		BPL .NextUp

.Label4		%ReadByte()
		XBA
		%ReadByte()
		REP #$21
		ADC $00		;X needs to be offset by original pointer
		TAX
		LSR $8D
		LDA $8D
		BEQ .LoopEnd0
-		LDA $0000,x
		STA $0000,y
		INY #2
		INX #2
		DEC $8D
		BNE -
.LoopEnd0	SEP #$20
		BCS +
		JMP .Loop
+		LDA $0000,x
		STA $0000,y
		INY
		JMP .Loop

.NextUp		ASL A
		BPL .Label5
		ASL A
		BPL .Label6

		%ReadByte()
		LDX $8D
-		STA $0000,y
		INC A
		INY
		DEX
		BNE -
		JMP .Loop

.Label3		%ReadByte()
		STA $0000,y
		INY
		LDX $8D
		DEX
		STX $8D
		BNE .Label3
		JMP .Loop

.Label5		%ReadByte()
		LDX $8D
-		STA $0000,y
		INY
		DEX
		BNE -
		JMP .Loop

.Label6		%ReadByte()
		XBA
		%ReadByte()
		XBA
		REP #$20
		LSR $8D
		LDX $8D
		BEQ .LoopEnd
-		STA $0000,y
		INY #2
		DEX
		BNE -
.LoopEnd	SEP #$20
		BCS +
		JMP .Loop
+		STA $0000,y
		INY
		JMP .Loop


will contribute more when i'm not so tired, pardon any stupid mistakes but they should be easy to spot and correct.
I optimized it a little.
If there is a bug, I am sorry.

Code
HEADER
LOROM
		!Freespace = $1D8000
		org $80B8E3
		JML Decomp_Start

macro ReadByte()
		LDA [$00],y
		INY
		BMI +
		LDY.w #$8000
		INC $02
+
endmacro

macro ReadWord()
		LDA [$00],y
		INY #2
		BMI +
		PHA
		TYA
		ORA #$8000
		TAY
		SEP #$20
		INC $02
		REP #$20
		PLA
+
endmacro

		org !Freespace
Decomp:
.Return		PLY
		STY $00
		LDA $02
		STA $8C
		STA $8F
		PHB
		PLA
		STA $02
		PLB
		REP #$20
		TXA
		SEC
		SBC $00		;sub starting pointer
		TXY
		STA $8D		; store size to $8D
		SEP #$20
		JML $80B8EA
.Start		PHB
		LDA $02
		PHA
		PLB
		LDX $00		;16bit pointer in X
		PHX
		STZ $00
		STZ $01
		LDY $8A
		LDA $8C
		STA $02
.Loop		LDA $7F8182
		%ReadByte()
		CMP #$FF
		BEQ .Return
		STA $8F
		AND #$E0
		CMP #$E0
		BEQ +
		PHA
		LDA $8F
		REP #$20
		AND.w #$001F
		BRA .Label2
+		LDA $8F
		ASL #3
		AND #$E0
		PHA
		LDA $8F
		AND #$03
		XBA
		%ReadByte()
		REP #$20
.Label2		INC A
		STA $8D
		SEP #$20
		PLA
		BEQ .Label3
		BPL .NextUp

.Label4		%ReadByte()
		XBA
		%ReadByte()
		PHY
		REP #$21
		ADC $03,s		;Y needs to be offset by original pointer
		TAY
		LSR $8D
		LDA $8D
		BEQ .LoopEnd0
-		LDA $0000,y
		STA $0000,x
		INY #2
		INX #2
		DEC $8D
		BNE -
.LoopEnd0	SEP #$20
		BCS +
		PLY
		JMP .Loop
+		LDA $0000,y
		STA $0000,x
		INX
		PLY
		JMP .Loop

.NextUp		ASL A
		BPL .Label5
		ASL A
		BPL .Label6

		%ReadByte()
		PHY
		LDY $8D
-		STA $0000,x
		INC A
		INX
		DEY
		BNE -
		PLY
		JMP .Loop

.Label3		REP #$20
		LSR $8D
		LDA $8D
		BEQ .LoopEnd1
-		%ReadWord()
		STA $0000,x
		INX #2
		DEC $8D
		BNE -
.LoopEnd1	SEP #$20
		BCS +
		JMP .Loop
+		%ReadByte()
		STA $0000,x
		INX
		JMP .Loop

.Label5		%ReadByte()
		PHY
		LDY $8D
-		STA $0000,x
		INX
		DEY
		BNE -
		PLY
		JMP .Loop

.Label6		REP #$20
		%ReadWord()
		LSR $8D
		PHY
		LDY $8D
		BEQ .LoopEnd
-		STA $0000,x
		INX #2
		DEY
		BNE -
.LoopEnd	PLY
		SEP #$20
		BCS +
		JMP .Loop
+		STA $0000,x
		INX
		JMP .Loop
Nice work guys! I've done a small extremely inaccurate calculation and it seems like the level loading time improved by 0.5 seconds.

I'm sure this can be optimized even more though so I'll probably find a way again.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Half a second is good. I may use this for the SMAS compression routine, seeing as I used the original SMB1 code anyway. =P
----------

Interested in MushROMs? View its progress, source code, and make contributions here.

Over a million cycles saved is good, and it hasn't even been unrolled yet. There's still room for improvement the thread is still pretty new.

Just a small bug I noticed (but only happens on bank crossing):

Code
macro ReadWord() 
LDA [$00],y 
INY #2 
BMI + 
PHA 
TYA 
ORA 
#$8000 
TAY 
SEP #$20 
INC $02 
REP #$20 
PLA 
+ 
endmacro


If [$00] points to $12:FFFF for example it will read upper byte from $13:0000 instead of $13:8000.

Code
macro ReadWord() 
LDA [$00],y 
INY #2 
BMI +
PHP
LDY #$8000
SEP #$20    ;is $03 used for anything? can save a cycle without the SEP
INC $02 
PLP
BEQ +
XBA ;it's the high byte that got affected
SEP #$20
LDA [$00],Y
XBA
INY
REP #$20
+
endmacro


didn't test that so point out anything that doesn't look quite right, ofcourse =)

edit: since push/pull is actually slower than STA dp/LDA dp, some savings can be made by placing them in unused DP space for that routine.
Going through the current code carefully, I've noticed that the following pieces of code can be unrolled:

Code
-		LDA $0000,y ;direct copy?
		STA $0000,x
		INY #2
		INX #2
		DEC $8D
		BNE -


Code
		LDY $8D
-		STA $0000,x ;direct fill?
		INX
		DEY
		BNE -


Code
-		STA $0000,x ;direct word fill?
		INX #2
		DEY
		BNE -


I'd try to do this myself but seeing that I have school today...............
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
MVN test.

Code
header
lorom

!ofs = $8FF000

macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE $03
JSR BANK_INC
endmacro

org $80B8E3
	JSL !ofs
	RTS
	
org !ofs
	PHB
	PEI ($03)
	PEI ($05)
	PEI ($07)
	PEI ($09)
	PEI ($0B)
	PEI ($8A)
	SEP #$20
	REP #$10
	LDA $02
	PHA
	PLB
	STA $05		; dest_bank
	INC
	STA $03		; dest_bank [plus or minus]
	LDA #$54
	STA $04		; mvn
	LDA #$4C
	STA $07		; jump
	LDA $8C
	STA $06		; src_bank
	LDX.w #.back
	STX $08
	
	LDY $00		; dest_low
	LDX $8A		; src_low
	STZ $8A
	STZ $8B
	BRA .main

.case_80_or_e0
	BPL .lz
	LDA $8D
	CMP #$1F
	BNE .case_e0
	PLX : STX $8A
	PLX : STX $0B
	PLX : STX $09
	PLX : STX $07
	PLX : STX $05
	PLX : STX $03
	SEP #$10
	PLB
	RTL
	
.case_e0
	AND #$03
	STA $8E
	EOR $8D
	ASL
	ASL
	ASL
	XBA
	%ReadByte()
	STA $8D
	XBA
	BRA .type
	
.lz
	%ReadByte()
	XBA
	%ReadByte()
	
	STX $0B
	REP #$21
	ADC $00
	TAX
	LDA $8D
	SEP #$20
	BIT $03
	BPL +
	MVN $7F7F
	BRA ++
+	MVN $7E7E
++	LDX $0B
	
.main
	%ReadByte()
	STA $8D
	STZ $8E
	AND #$E0
	TRB $8D

.type
	ASL
	BCS .case_80_or_e0
	BMI .case_40_or_60
	ASL
	BMI .case_20

.case_00
	REP #$20
	LDA $8D
	STX $8D
	
-	SEP #$20
	JMP $0004
	
.back
	CPX $8D
	BCS .main
	
	JSR BANK_INC_2
	CPX #$0000
	BEQ ++
	
	DEX
	STX $0B
	REP #$21
	LDX #$8000
	STX $8D
	TYA
	SBC $0B
	TAY
	LDA $0B
	BRA -
	
++	LDX #$8000
	BRA .main

.case_20
	%ReadByte()
	STX $0B
	PHA
	PHA
	REP #$20
	
.case_20_main
	LDA $8D
	INC
	LSR
	TAX
	PLA
	
-	STA $0000,Y
	INY
	INY
	DEX
	BNE -
	
	SEP #$20
	BCC +
	STA $0000,Y
	INY
+	LDX $0B
	BRA .main
	
.case_40_or_60
	ASL
	BMI .case_60
	%ReadByte()
	XBA
	%ReadByte()

	XBA
	STX $0B
	REP #$20
	PHA
	BRA .case_20_main
	
.case_60
	%ReadByte()
	STX $0B
	LDX $8D
-	STA $0000,Y
	INC
	INY
	DEX
	BPL -
	LDX $0B
	JMP .main
	
BANK_INC:
	LDX #$8000
BANK_INC_2:
	INC $06
	INC $8C
	RTS
@ersanio: the unrollable stuff will be good to convert to DMA, except word fill that would be pretty awkward. The copy/fill(byte) loops are suitable though.

@Min: Self modifying code with MVN is better than load/store, but DMA WRAM->SRAM->WRAM is 2 cycle/byte not counting setup. A few days ago there was a chat about just using DMA (would also work for byte copy).

@ersanio/others: For byte fill you can just use DMA and set DMA to not increment the address so it reads the same byte every time. 1 cycle per byte transferred like that to $2180. Just have to store it to some byte in SRAM because WRAM->WRAM transfer not allowed.

Copy can do $2180->$70xxxx->$2180 but some SRAM must be reserved. Much faster, but more akward to use. MVN is easy so it depends on what we decide. If we ever have to copy like 200 bytes or something (large monocolored area, repeating tile sequence etc) then DMA will be massively faster than MVN. For small quantities, MVN is OK due to DMA setup time.

By the end when everyone has put in their contribution it should be much, much faster than the original.

@Japan guys: if there is anyone else to bring please bring them since the communities are kind of divided for whatever reason. Even if the English is not too good it is still very easy to contribute to projects like these.
Re-added the size to $8D thing. I had to move one block of code to prevent it from causing out-of-range branches.
It will be very helpful if this stays in there

also added RATS tag, and insert-size counter (current size is 351 bytes)

Code
header
lorom

!ofs = $8FF000

macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE $03
JSR BANK_INC
endmacro

org $80B8E3
	JSL CodeStart
	RTS
	
org !ofs
	reset bytes
	db "STAR"
	dw CodeEnd-CodeStart-$01
	dw CodeEnd-CodeStart-$01^$FFFF

CodeStart:
	PHB
	PEI ($03)
	PEI ($05)
	PEI ($07)
	PEI ($09)
	PEI ($0B)
	PEI ($8A)
	SEP #$20
	REP #$10
	LDA $02
	PHA
	PLB
	STA $05		; dest_bank
	INC
	STA $03		; dest_bank [plus or minus]
	LDA #$54
	STA $04		; mvn
	LDA #$4C
	STA $07		; jump
	LDA $8C
	STA $06		; src_bank
	LDX.w #.back
	STX $08
	
	LDY $00		; dest_low
	LDX $8A		; src_low
	STZ $8A
	STZ $8B
	BRA .main
	
.case_e0
	AND #$03
	STA $8E
	EOR $8D
	ASL
	ASL
	ASL
	XBA
	%ReadByte()
	STA $8D
	XBA
	BRA .type

.case_80_or_e0
	BPL .lz
	LDA $8D
	CMP #$1F
	BNE .case_e0
	PLX : STX $8A
	PLX : STX $0B
	PLX : STX $09
	PLX : STX $07
	PLX : STX $05
	PLX : STX $03
	REP #$20
	TYA
	SEC
	SBC $00
	STA $8D		; size!!!
	SEP #$30
	PLB
	RTL
	
.lz
	%ReadByte()
	XBA
	%ReadByte()
	
	STX $0B
	REP #$21
	ADC $00
	TAX
	LDA $8D
	SEP #$20
	BIT $03
	BPL +
	MVN $7F7F
	BRA ++
+	MVN $7E7E
++	LDX $0B
	
.main
	%ReadByte()
	STA $8D
	STZ $8E
	AND #$E0
	TRB $8D

.type
	ASL
	BCS .case_80_or_e0
	BMI .case_40_or_60
	ASL
	BMI .case_20

.case_00
	REP #$20
	LDA $8D
	STX $8D
	
-	SEP #$20
	JMP $0004
	
.back
	CPX $8D
	BCS .main
	
	JSR BANK_INC_2
	CPX #$0000
	BEQ ++
	
	DEX
	STX $0B
	REP #$21
	LDX #$8000
	STX $8D
	TYA
	SBC $0B
	TAY
	LDA $0B
	BRA -
	
++	LDX #$8000
	BRA .main

.case_20
	%ReadByte()
	STX $0B
	PHA
	PHA
	REP #$20
	
.case_20_main
	LDA $8D
	INC
	LSR
	TAX
	PLA
	
-	STA $0000,Y
	INY
	INY
	DEX
	BNE -
	
	SEP #$20
	BCC +
	STA $0000,Y
	INY
+	LDX $0B
	BRA .main
	
.case_40_or_60
	ASL
	BMI .case_60
	%ReadByte()
	XBA
	%ReadByte()

	XBA
	STX $0B
	REP #$20
	PHA
	BRA .case_20_main
	
.case_60
	%ReadByte()
	STX $0B
	LDX $8D
-	STA $0000,Y
	INC
	INY
	DEX
	BPL -
	LDX $0B
	JMP .main
	
BANK_INC:
	LDX #$8000
BANK_INC_2:
	INC $06
	INC $8C
	RTS

CodeEnd:
print "Insert Size: ",bytes," bytes"
Eliminated unnecessary JSRs. The code is about 0.7 seconds faster now o_o
Code
HEADER
LOROM

!Freespace = $1D8000|$800000

macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE +
LDX #$8000
INC $06
INC $8C
+
endmacro

org $80B8E3
	JSL CodeStart		;was JML before
	RTS
	
org !Freespace
	reset bytes
	db "STAR"
	dw CodeEnd-CodeStart-$01
	dw CodeEnd-CodeStart-$01^$FFFF

CodeStart:
	PHB
	PEI ($03)
	PEI ($05)
	PEI ($07)
	PEI ($09)
	PEI ($0B)
	PEI ($8A)
	SEP #$20
	REP #$10
	LDA $02
	PHA
	PLB
	STA $05		; dest_bank
	INC
	STA $03		; dest_bank [plus or minus]
	LDA #$54
	STA $04		; mvn
	LDA #$4C
	STA $07		; jump
	LDA $8C
	STA $06		; src_bank
	LDX.w #.back
	STX $08
	
	LDY $00		; dest_low
	LDX $8A		; src_low
	STZ $8A
	STZ $8B
	BRA .main
	
.case_e0
	AND #$03
	STA $8E
	EOR $8D
	ASL
	ASL
	ASL
	XBA
	%ReadByte()
	STA $8D
	XBA
	BRA .type

.case_80_or_e0
	BPL .lz
	LDA $8D
	CMP #$1F
	BNE .case_e0
	PLX : STX $8A
	PLX : STX $0B
	PLX : STX $09
	PLX : STX $07
	PLX : STX $05
	PLX : STX $03
	REP #$20
	TYA
	SEC
	SBC $00
	STA $8D		; size!!!
	SEP #$30
	PLB
	RTL			;JML $80B8EA
	
.lz
	%ReadByte()
	XBA
	%ReadByte()
	
	STX $0B
	REP #$21
	ADC $00
	TAX
	LDA $8D
	SEP #$20
	BIT $03
	BPL +
	MVN $7F7F
	BRA ++
+	MVN $7E7E
++	LDX $0B
	
.main
	%ReadByte()
	STA $8D
	STZ $8E
	AND #$E0
	TRB $8D

.type
	ASL
	BCS .case_80_or_e0
	BMI .case_40_or_60
	ASL
	BMI .case_20

.case_00
	REP #$20
	LDA $8D
	STX $8D
	
-	SEP #$20
	JMP $0004
	
.back
	CPX $8D
	BCS .main
	
	INC $06
	INC $8C
	CPX #$0000
	BEQ ++
	
	DEX
	STX $0B
	REP #$21
	LDX #$8000
	STX $8D
	TYA
	SBC $0B
	TAY
	LDA $0B
	BRA -
	
++	LDX #$8000
	BRA .main

.case_20
	%ReadByte()
	STX $0B
	PHA
	PHA
	REP #$20
	
.case_20_main
	LDA $8D
	INC
	LSR
	TAX
	PLA
	
-	STA $0000,Y
	INY
	INY
	DEX
	BNE -
	
	SEP #$20
	BCC +
	STA $0000,Y
	INY
+	LDX $0B
	BRA .main
	
.case_40_or_60
	ASL
	BMI .case_60
	%ReadByte()
	XBA
	%ReadByte()

	XBA
	STX $0B
	REP #$20
	PHA
	BRA .case_20_main
	
.case_60
	%ReadByte()
	STX $0B
	LDX $8D
-	STA $0000,Y
	INC
	INY
	DEX
	BPL -
	LDX $0B
	JMP .main
	

CodeEnd:
print "Insert Size: ",bytes," bytes"

My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Seeing as the JSL/RTL only gets called about 20 times max per level, you only lose about 1 microsecond give or take. Also, because there is potential for a hacker to need the decompression routine for whatever reason, I argue that it should stay as an RTL. I hate to keep going back to SMAS, but I do intend on using this routine in more than one spot. Point is, it can happen in SMW too, so I'd say the beneficiary comfort of keeping an RTL for convenience outweighs the additional microsecond we save in time.

Just my opinion. Anyone can rebut.
----------

Interested in MushROMs? View its progress, source code, and make contributions here.

I would just have the JML as default and add comments for JSL conversion of the hijack.
I own a community of TF2 servers!

ASMT - A new revolutionary ASM system, aka 65c816 ASseMbly Thing
SMWCP - SMW Central Presents a Product- tion long name

frog

http://esolangs.org/wiki/MarioLANG
Originally posted by spel werdz rite
Seeing as the JSL/RTL only gets called about 20 times max per level, you only lose about 1 microsecond give or take. Also, because there is potential for a hacker to need the decompression routine for whatever reason, I argue that it should stay as an RTL.


Agreeing with this. It's literally impossible to notice the difference and just causes inconvenience, so there's little reason to keep it. The focus should be on optimizing the loop content.
@smkdan and SWR: you both have good points actually.

I'll change them to JSL/RTL again.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
@smkdan: Thanks.

I rewrote the code.

Code
lorom
header

!Freespace = $1D8000|$800000

macro ReadByte()
	STX $8A
	LDA [$8A]
	INX
	BNE +
	JSR BANK_INC
+
endmacro

macro ReadWord()
	STX $8A
	LDA [$8A]
	INX
	INX
	BMI +
	JSR BANK_INC_2
+
endmacro

org $80B8E3
	JSL CodeStart
	RTS
	
org !Freespace
	reset bytes
	db "STAR"
	dw CodeEnd-CodeStart-$0001
	dw CodeEnd-CodeStart-$0001^$FFFF
	
CodeStart:
	PHB
	PEI ($03)
	PEI ($05)
	PEI ($07)
	PEI ($09)
	PEI ($0B)
	PEI ($8A)
	SEP #$20
	REP #$10
	LDA $02
	PHA
	PLB
	STA $05		; dest_bank
	INC
	STA $03		; dest_bank [plus or minus]
	LDA #$54
	STA $04		; mvn
	LDA #$4C
	STA $07		; jump
	LDA $8C
	STA $06		; src_bank
	LDX.w #.back
	STX $08
	
	LDY $00		; dest_low
	LDX $8A		; src_low
	STZ $8A
	STZ $8B
	JMP .main

.case_ff
	PLX : STX $8A
	PLX : STX $0B
	PLX : STX $09
	PLX : STX $07
	PLX : STX $05
	PLX : STX $03
	REP #$20
	TYA
	SBC $00		; carry = 1
	STA $8D		; size
	SEP #$30
	PLB
	RTL

.case_e0
	LDA $8D
	CMP #$1F
	BEQ .case_ff
	AND #$03
	STA $8E
	EOR $8D
	ASL
	ASL
	ASL
	XBA
	%ReadByte()
	STA $8D
	XBA
	BRA .type
	
.case_00
	LDA $8E
	XBA
	LDA $8D
-	JMP $0004
	
.back
	CPX #$0000
	BMI .main
	
	INC $06
	INC $8C
	DEX
	BMI ++
	STX $0B
	REP #$21
	LDX #$8000
	TYA
	SBC $0B
	TAY
	LDA $0B
	SEP #$20
	BRA -
	
++	LDX #$8000
	BRA .main
	
.case_80_or_e0
	BMI .case_e0
	REP #$21
	%ReadWord()
	XBA
	STX $8A
	ADC $00
	TAX
	LDA $8D
	SEP #$20
	BIT $03
	BPL +
	MVN $7F7F
	BRA ++
+	MVN $7E7E
++	LDX $8A
	BRA .bra

.main
	STX $8A
.bra
	LDA [$8A]
	INX
	BNE +
	JSR BANK_INC
+	STA $8D
	STZ $8E
	AND #$E0
	TRB $8D

.type
	ASL
	BCS .case_80_or_e0
	BEQ .case_00
	BMI .case_40_or_60

.case_20
	%ReadByte()
	STX $0B
	PHA
	PHA
	REP #$20
	
.case_20_main
	LDA $8D
	INC
	LSR
	TAX
	PLA
	
-	STA $0000,Y
	INY
	INY
	DEX
	BNE -
	
	SEP #$20
	LDX $0B
	BCC .main
	
	STA $0000,Y
	INY
	BRA .main
	
.case_40_or_60
	ASL
	BMI .case_60
	REP #$20
	%ReadWord()
	STX $0B
	PHA
	BRA .case_20_main
	
.case_60
	%ReadByte()
	STX $0B
	LDX $8D
-	STA $0000,Y
	INC
	INY
	DEX
	BPL -
	LDX $0B
	JMP .main
	
BANK_INC:
	LDX #$8000
	INC $06
	INC $8C
	RTS

BANK_INC_2:
	CPX #$0001
	LDX #$8000
	INC $06		; $07($8D) is not affected.
	INC $8C
	BCC +
	
	SEP #$20
	XBA
	STX $8A
	LDA [$8A]
	XBA
	INX
	REP #$21
+	RTS
CodeEnd:
print "Insert Size: ",bytes," bytes"


DMA ver. (not use SRAM)
// updated

Code
lorom
header

!Freespace = $1D8000|$800000

; Note: 
; - This routine uses $00:211B-$00:211C,
;   $00:2134-$00:2136, $00:435x-$00:436x.
;
; - This routine doesn't use WRAM->SRAM->WRAM DMA,
;   because it will fail in some cases.
;
;   Error case: 
;   CompData: <02:01 02 03> <85:00 00> <FF>
;   dest: $7F0000
;
;   <02:01 02 03>
;   01 02 03 .. .. .. .. .. ..
;   01 02 03 .. .. .. .. .. ..
;
;   <85:00 00>
;   01 02 03 01 02 03 01 02 03 ; Default / MVN
;   01 02 03 01 02 03 .. .. .. ; (for example)
;    WRAM->SRAM ($7F:0000-$7F:0005 -> $70:0000-$70:0005)
;    SRAM->WRAM ($70:0000-$70:0005 -> $7F:0003-$7F:0008)
;
; - It is needed to change the value of D register 
;   from $4300 to $0000 at the beginning of NMI/IRQ.
;

macro ReadByte()
	LDA [$67],Y
	INY
	BNE +
	JSR BANK_INC
+
endmacro
	
macro ReadWord()
	LDA [$67],Y
	INY
	INY
	BMI +
	JSR BANK_INC_2
+
endmacro

org $80B8E3
	JSL CodeStart
	RTS
	
org !Freespace
	reset bytes
	db "STAR"
	dw CodeEnd-CodeStart-$0001
	dw CodeEnd-CodeStart-$0001^$FFFF
	
CodeStart:
	PHB
	PHK
	PLB
	PHD
	PEA $4300
	PLD
	
	LDX #$3480
	STX $50		; dma_param
	LDX $0000
	STX $58		; dest_low[start] / (HDMA Table Address)
	LDA $0002
	STA $54		; dest_bank
	STA $2183
	INC
	STA $57		; [Plus or Minus] / (HDMA Indirect Bank)
	
	LDY #$8000
	STY $60		; dma_param
	LDY $008A
	STZ $67		; src_low
	STZ $68		; src_high
	LDA $008C
	STA $64		; src_bank
	STA $69		; src_bank
	JMP .main
	
.end
	PLD
	REP #$20
	TXA
	SBC $00		; carry = 1
	STA $8D		; size
	SEP #$30
	PLB
	RTL
	
.case_e0
	LDA $65
	CMP #$1F
	BEQ .end
	AND #$03
	STA $66
	EOR $65
	ASL #3
	XBA
	%ReadByte()
	STA $65
	XBA
	BRA .type
	
.case_00
	REP #$21
	INC $65		; bytecount
	STX $2181
	TXA
	ADC $65
	TAX
	STY $62
	
-	SEP #$20
	LDA #$40
	STA $420B
	LDY $62
	
	BMI .main
	INC $64
	INC $69
	CPY #$0000
	BEQ ++
	
	REP #$20
	STY $65
	LDY #$8000
	STY $62
	TXA
	SBC $65		; carry = 1
	STA $2181
	BRA -
	
++	LDY #$8000
	BRA .main

.case_80_or_e0
	BMI .case_e0
	REP #$21
	%ReadWord()
	STY $52		; tmp
	TXY
	XBA
	ADC $58
	TAX
	LDA $65
	SEP #$20
	PHB
	BIT $57
	BPL +
	MVN $7F7F
	BRA ++
+	MVN $7E7E
++	TYX
	LDY $52
	PLB
	
.main
	%ReadByte()
	STA $65
	STZ $66
	AND #$E0
	TRB $65
	
.type
	ASL
	BCS .case_80_or_e0
	BEQ .case_00
	BMI .case_40_or_60

.case_20
	%ReadByte()
	STA $211B
	STZ $211B
	LDA #$80
	STA $50		; param
	STX $52
	LDX $65
	INX
	STX $55
	LDA #$01
	STA $211C
	LDA #$20
	STA $420B
	LDX $52
	BRA .main

.case_40_or_60
	ASL
	BMI .case_60

.case_40	
	REP #$20
	%ReadWord()
	SEP #$20
	STA $211B
	XBA
	STA $211B
	LDA #$81
	STA $50		; param
	STX $52
	LDX $65
	INX
	STX $55
	LDA #$01
	STA $211C
	LDA #$20
	STA $420B
	LDX $52
	BRA .main
	
.case_60
	%ReadByte()
	STX $2181
	STY $52		; tmp
	
	LDY $65
-	STA $2180
	INC
	INX
	DEY
	BPL -
	LDY $52
	JMP .main
	
BANK_INC:
	LDY #$8000
	INC $64
	INC $69
	RTS
	
BANK_INC_2:
	CPY #$0001
	LDY #$8000
	INC $64
	INC $69
	BCC +
	
	SEP #$20
	XBA
	LDA [$67],Y
	INY
	XBA
	REP #$21
+	RTS
CodeEnd:
print "Insert Size: ",bytes," bytes"
for some reason the DMA version is screwing up the new (yet-to-be-released) Layer3ExGFX that reloads GFX on submap change (including FG slots) w/o fblank. It causes layer 1 and sprites to "flicker" above the windowing effects. I think it might have something to do with the DMA transfer that uses channel 6, because when I comment out the STA $420B it stops flickering (of course messes up the GFX load too though), but when I change the channel it doesn't help at all.

I'm looking into a solution right now, because I really think this should be addressed.


EDIT: I just tested it in SNES9x, and the problem isn't exactly the same but it is there. the whole screen flickers black as if entering fblank every other frame.
I guess the routine has a problem running outside of blank, which is a problem for any "OW ExGraFix" patches that reload GFX on submap change without fblank

EDIT2: BSNES does the same

EDIT3: I'm not even sure if the DMA version is faster. And if it is slightly faster, I think we should continue to use the non-DMA version because compatibility > tiny speed increases.
@edit: It writes to CH5/CH6 registers, any possibility CH5 is messing things up for whatever you are doing? The registers themselves are readable so maybe push them before entering the routine and see if it helps.
  • Pages:
  • 1
  • 2