I have never used a Mac, and never will. Professionally, I use DOS or nothing at all. DOS gets out of my way and lets me program the hardware directly. Win98se, booted into real mode. Disable Interrupts while running, execute reserved instructions, and undocumented instructions. Phar Lap extended DOS when I need a lot of memory, wrote all of my own low-level support routines for interrupt handling and bypassing their "Shadow-Interrupt vector table". It was too slow. As far as finding hardware to run on- custom made.
Once in school in CS class I wrote a demo that had to do fast graphics output for sprites, so I needed a figure out a fast way to write sprite data to video memory. The fastest way to write stuff into memory was using PUSH instructions with fixed data values, with which I could in two clock cycles write 32 bits or four pixels to the stack. This was on a 386 and there was no acceleration from the graphics card.
So I wrote a routine that read in all sprites and transformed them into assembler code that consisted only of PUSH instructions (and stack pointer increases for line breaks and for sprites with holes in them). This worked well, and PUSH is also a really compact instruction - the 32bit prefix, the opcode (0x68), and the four data bytes. So on average for a sprite containing N pixels with no holes, the assembled sprite routine for this pixel would be 1.5 N bytes in length. Sprites with holes were smaller.
Whenever I wanted to draw a sprite, I'd put the stack pointer on the video memory segment at 0xA000 and called the respective sprite routine, which wrote the whole sprite to the stack. After the sprite was written, I'd put the stack pointer back where it belonged (which required a bit of a hack, too, because obviously I couldn't save the old stack pointer location to the stack or else there would be no way to get it back). Anyway, this method allowed me to write four pixels in two clock cycles, and a hole in a sprite was another clock cycle increasing the stack pointer for the length of the hole (as opposed to the traditional method which uses a comparison for every pixel). Two pixels per clock cycle on average. There is no faster way to write stuff to the screen, the limit is the bandwidth to the VGA card. Of course there was no way to do page-flipping etc., because all memory access I had was the single 64k segment at 0xA000.
This worked really nicely, except when there was a timer interrupt. This usually happens 18.2 times per second in plain DOS and there was no predicting it. Whenever there was a timer interrupt when a sprite was being written, the interrupt routine would write some junk to the stack, which ended up in video memory as randomly-coloured pixels and all the subsequent rows of my sprite would be off. You can set it faster than 18.2 times per second, but not slower. So what I had to do was to disable RAM refresh while a sprite was being written, because on our hardware that would disable the timer interrupt, too. Except that one shouldn't leave it disabled for too long, because one would end up with memory errors. But for one sprite at a time it was fine. Of course this was utterly incompatible with EMM386, Windows, or DOS extenders, because the virtual 8086 mode didn't allow putting the stack in random locations where it didn't belong.
It was a big hack, but when it worked, the graphics were really, really fast
🙂