ESP8266 GPIO output performance

posted by on 2015.05.14, under collected, electronics, programming

While building an extreme feedback device utilizing the ESP8266 and a bunch of WS2812B LEDs I missed some detailed information about the GPIO output performance of the ESP8266. This was a more general demand – but I ended up with my own WS2812 driver. And it was fun to use NOPs to achieve a nearly perfect timing (… cause I still remember the good old days when we used NOPs to create awesome rasterbars on the C64).

setup / basics

The XTENSA lx106 runs at 80 Mhz and has interrupts and watchdog disabled during measurement (see below).
Excerp from – I/O addresses used to control the GPIO hardware:

0x60000304 - set GPIO pin HIGH
0x60000308 - set GPIO pin LOW
0x60000310 - set GPIO pin direction output
0x60000314 - set GPIO pin direction input

Xtensa calling convention

*** this part is just here for completeness ***

The lx106 used in the ESP8266 implements the CALL0 ABI offering a 16-entry register file (see Diamond Standard 106Micro Controller brochure). By this we can apply the calling convention outlined in the Xtensa ISA reference manual, chapter 8.1.2 CALL0 Register Usage and Stack Layout:

a0 - return address
a1 - stack pointer
a2..a7 - arguments (foo(int16 a,long long b) -> a2 = a, a4/5 = b), if sizefof(args) > 6 words -> use stack
a8 - static chain (for nested functions: contains the ptr to the stack frame of the caller)
a12..a15 - callee saved (registers containing values that must be preserved for the caller)
a15 - optional stack frame ptr

Return values are placed in AR2..AR5. If the offered space (four words) does not meet the required amount of memory to return the result, the stack is used.

disabling interrupts

According to the Xtensa ISA (Instruction Set Architecture) manual the Xtensa processor cores supporting up to 15 interrupt levels – but the used lx106 core only supports two of them (level 1 and 2). The current interrupt level is stored in CINTLEVEL (4 bit, part of the PS register, see page 88). Only interrupts at levels above CINTLEVEL are enabled.

In esp_iot_sdk_v1.0.0/include/osapi.h the macros os_intr_lock() and os_intr_unlock() are defined to use the ets_intr_lock() and ets_intr_unlock() functions offered by the ROM. The disassembly reveals nothing special:

disassembly – ets_intr_lock() and ets_intr_unlock():

40000f74:  006320      rsil  a2, 3           // a2 = old level, set CINTLEVEL to 3 -> disable all interrupt levels supported by the lx106
40000f77:  fffe31      l32r  a3, 0x40000f70  // a3 = *mem(0x40000f70) = 0x3fffcc0
40000f7a:  0329        s32i.n  a2, a3, 0       // mem(a3) = a2 -> mem(0x3fffdcc0) = old level -> saved for what?
40000f7c:  f00d        ret.n

40000f80:  006020      rsil  a2, 0           //enable all interrupt levels
40000f83:  f00d        ret.n

To avoid the overhead of the function call and the unneeded store operation the following macros to enable / disable interrupts can be used:

macros for disabling/enabling interrupts:

#define XT_CLI __asm__("rsil a2, 3");
#define XT_STI __asm__("rsil a2, 0");

NOTE: the ability to run rsil from user code without triggering a PriviledgedInstruction exception implies that all code is run on ring0. This matches the information given here

the watchdog

Just keep it disabled. I run into a lot of trouble with it – seemed the wdt_feed() didn’t work for me.
…and its not (well) documented at all.

Some pieces of information I found in the net:

gpio_output_set(uint32 set_mask, uint32 clear_mask, uint32 enable_mask, uint32 disable_mask)

declared in: esp_iot_sdk_v1.0.0/include/gpio.h
defined in: ROM (eagle.rom.addr.v6.ld -> PROVIDE ( gpio_output_set = 0x40004cd0 ))

C example:

gpio_output_set(BIT2, 0, BIT2, 0);  // HIGH
gpio_output_set(0, BIT2, BIT2, 0);  // LOW
gpio_output_set(BIT2, 0, BIT2, 0);  // HIGH
gpio_output_set(0, BIT2, BIT2, 0);  // LOW

disassembly – call to gpio_output_set(BIT2, 0, BIT2, 0):

40243247:       420c            movi.n  a2, 4                                     // a2 = 4
40243249:       030c            movi.n  a3, 0                                     // a3 = 0
4024324b:       024d            mov.n   a4, a2                                    // a4 = 4
4024324d:       205330          or      a5, a3, a3                                // a5 = 0
40243250:       f79001          l32r    a0, 40241090 <system_relative_time+0x18>  // a0 = *mem(40241090) = 0x40004cd0
40243253:       0000c0          callx0  a0                                        // call 0x40004cd0 - gpio_output_set

disassembly – gpio_output_set (thanks to By0ff for the ROM dump):

> xtensa-lx106-elf-objdump -m xtensa -EL  -b binary --adjust-vma=0x40000000 --start-address=0x40004cd0 --stop-address=0x40004ced -D 0x4000000-0x4011000/0x4000000-0x4011000.bin

0x4000000-0x4011000/0x4000000-0x4011000.bin:     file format binary

Disassembly of the .data section:

40004cd0 <.data+0x4cd0>:
40004cd0:       f0bd61          l32r    a6, 0x40000fc4  // a6 = *mem(0x40000fc4) = 0x60000200
40004cd3:       0020c0          memw                    // finish all mem operations before next op
40004cd6:       416622          s32i    a2, a6, 0x104   // mem(a6 + 0x104) = a2 -> mem(0x60000304) = 4 (SET)
40004cd9:       0020c0          memw
40004cdc:       426632          s32i    a3, a6, 0x108   // mem(a6 + 0x108) = a3 -> mem(0x60000308) = 0 (CLR)
40004cdf:       0020c0          memw
40004ce2:       446642          s32i    a4, a6, 0x110   // mem(a6 + 0x110) = a4 -> mem(0x60000310) = 4 (DIR -> OUTPUT)
40004ce5:       0020c0          memw
40004ce8:       456652          s32i    a5, a6, 0x114   // mem(a6 + 0x114) = a5 -> mem(0x60000314) = 0 (DIR -> INPUT)
40004ceb:       f00d            ret.n                   // return to the caller

> od -A x -j 0xfc4 -N 4 -x 0x4000000-0x4011000/0x4000000-0x4011000.bin
000fc4 0200 6000            // *mem(0x40000fc4) = 0x60000200



The whole cycle of a HIGH-RISE/LOW/HIGH-RISE transition takes 1160 nano seconds – the execution of one call to gpio_output_set() takes ~580ns (~46 cycles@80Mhz). Since the clear operation is executed after the set operation (see the code above) the LOW period is slightly shorter then the HIGH period (HIGH: 675ns, LOW: 485ns). By setting an initial LOW GPIO to HIGH and LOW in the same command a short pulse of 88 nano seconds length (~7 cycles) is created.

The macro GPIO_OUTPUT_SET(gpio_no, bit_value) – defined in esp_iot_sdk_v1.0.0/include/gpio.h – is just a wrapper for gpio_output_set():

#define GPIO_OUTPUT_SET(gpio_no, bit_value) \
    gpio_output_set(bit_value<<gpio_no, ((~bit_value)&0x01)<<gpio_no, 1<<gpio_no,0)

WRITE_PERI_REG(addr, val)

Macro defined in esp_iot_sdk_v1.0.0/include/eagle_soc.h:

#define WRITE_PERI_REG(addr, val) (*((volatile uint32_t *)ETS_UNCACHED_ADDR(addr))) = (uint32_t)(val)

It boils down to the following assembly instructions:

4024323a:       080000          excw
4024323d:       600003          excw                // --> *mem(4024323c) -> 0x60000308

40243240:       000304          excw
40243243:       f03d60          subx8   a3, a13, a6 // --> *mem(40243240) -> 0x60000304

// set
40243252:       fffb21          l32r    a2, 40243240 <eagle_lwip_getif+0x28>    // a2 = 0x60000304
40243255:       230c            movi.n  a3, 4                                   // a3 = 4
40243257:       0020c0          memw
4024325a:       0239            s32i.n  a3, a2, 0                               // mem(a2 + 0) = a3 -> mem(0x60000308) = 4

// clear
4024325c:       fff821          l32r    a2, 4024323c <eagle_lwip_getif+0x24>    // a2 = 0x60000308
4024325f:       430c            movi.n  a3, 4                                   // a3 = 4
40243261:       0020c0          memw
40243264:       0239            s32i.n  a3, a2, 0                               // mem(a2 + 0) = a3 -> mem(0x60000308) = 4   

To avoid optimization by the compiler I used the following hand crafted code for the measurement:

__asm__("movi    a2, 0x60000304  \n" // will be converted to literal load - l32r    a2, 40243240 <eagle_lwip_getif+0x28>
    "movi.n  a3, 4       \n"
    "memw           \n"
    "s32i.n  a3, a2, 0     \n"

__asm__("movi    a2, 0x60000308  \n"
    "movi.n  a3, 4       \n"
    "memw           \n"
    "s32i.n  a3, a2, 0     \n"

The disassembly shows that the movi is converted to a literal load (as expected):

40243248:       fffd21          l32r    a2, 4024323c <eagle_lwip_getif+0x24>
4024324b:       430c            movi.n  a3, 4
4024324d:       0020c0          memw
40243250:       0239            s32i.n  a3, a2, 0
40243252:       fffb21          l32r    a2, 40243240 <eagle_lwip_getif+0x28>
40243255:       430c            movi.n  a3, 4
40243257:       0020c0          memw
4024325a:       0239            s32i.n  a3, a2, 0



The whole cyle of a HIGH-RISE/LOW/HIGH-RISE transition takes 237ns nano seconds with a HIGH-period
of 100 ns (8 cycles) and a LOW-period of 137 ns (11 cycles – the HIGH-period is three cycles shorter
then the LOW period – maybe caused by one additional instruction fetch).

By avoiding the mov and removing the memw operations I was able to generate pulses with a period time of 150ns (12 cycles, HIGH: ~52ns, LOW: ~98ns).

__asm__("movi    a2, 0x60000304  \n"
    "movi    a4, 0x60000308  \n" 
    "movi.n  a3, 4       \n"  // GPIO2
    "memw           \n"
    "s32i.n  a3, a2, 0     \n"
    "s32i.n  a3, a4, 0     \n"
    "s32i.n  a3, a2, 0     \n"
    "s32i.n  a3, a4, 0     \n"
    "s32i.n  a3, a2, 0     \n"
    "s32i.n  a3, a4, 0     \n"

fastest - T: ~150ns

fastest – T: ~150ns

WS2812B timing

…this was the starting point that forced me to have a deeper look into that topic. There are already some implementations for controlling the WS2812B. They work – just use them. This part is for fun and education … ahh, lets do it in plain assembly.

static inline void WS2812B_SEND_1(int port) 
  //800ns HIGH & 450ns LOW
  __asm__ volatile ("movi    a2, 0x60000304  \n"
            "movi    a3, 0x60000308  \n"
            "movi    a4, %0      \n"
            "s32i    a4, a2, 0     \n"
            "memw            \n"
            "movi    a5, 14         \n"
            "3:            \n"
            "addi    a5, a5, -1    \n"
            "bnez    a5, 3b      \n"
            "nop.n           \n"
            "nop.n           \n"
            "nop.n           \n"
            "nop.n           \n"
            "nop.n           \n"
            "s32i    a4, a3, 0     \n"
            "memw            \n"
            "movi    a5, 2         \n"
            "4:            \n"
            "addi    a5, a5, -1    \n"
            "bnez    a5, 4b      \n"
           :: "g" (port)
           : "a2", "a3", "a4", "a5"

ws2812b - send logical 1

ws2812b – send logical 1

SEND_1_HIGH: 796ns
SEND_1_LOW:  454ns

static inline void WS2812B_SEND_0(int port) 
  //400ns HIGH & 850ns LOW
  __asm__ volatile ("movi    a2, 0x60000304  \n"
            "movi    a3, 0x60000308  \n"
            "movi    a4, %0      \n"
            "s32i    a4, a2, 0     \n"
            "memw            \n"
            "movi    a5, 7         \n"
            "1:            \n"
            "addi    a5, a5, -1    \n"
            "bnez    a5, 1b      \n"
            "nop.n           \n"
            "s32i    a4, a3, 0     \n"
            "memw            \n"
            "movi    a5, 10         \n"
            "2:            \n"
            "addi    a5, a5, -1    \n"
            "bnez    a5, 2b      \n"
            "nop.n           \n"
            "nop.n           \n"
              "nop.n           \n"
           :: "g" (port)
           : "a2", "a3", "a4", "a5"

ws2812b - send logical 0

ws2812b – send logical 0

SEND_0_HIGH: 397ns
SEND_0_LOW: 855ns


  1. Add a ~400 ohm resistor into the WS2812 data input line to avoid oscillation caused by reflections of the input of the first LED.

    write_peri_reg() - no resistor

    write_peri_reg() – no resistor

  2. Use decoupling capacitors – without I got strange noise on the power supply line.

    pwr supply - noise bursts

    pwr supply – noise bursts

    pwr supply - single bursts

    pwr supply – single bursts

  3. Ensure the WS2812 data line uses a proper signal level. According to the data sheet DATA_IN is treated as HIGH above 3.5V (0.7*Vdd) and LOW below 1.5V (0.3*Vdd)). The Vhigh of 3.3V offered by the ESP9266 is not enough for the WS2812B to be detected as a clean HIGH signal (I got some strange flickers – after adding a 4050 as level shifter everything was fine).
  4. Disable interrupts during bit-banging the WS2812. Avoid disabled interrupts for more then 10ms or wireless connections will act wired. Keeping that in mind you will be able to write ~300 LEDs at once (1250ns per bit, 8 bit per color, 3 colors = 30us, internal overhead when switching to the next pixel = 225ns, keep 50us between each write for reset condition). If you need to run the code from above from flash check the timing using an oscilloscope. I have seen NOPs taking more then one cycle…
  5. Wear sunscreen.

Useful links

multimeter GVA-18B protocol & dump tool

posted by on 2012.02.04, under electronics, programming

A wile ago I bought an inexpensive multimeter on ebay: G VA 18 B. For ~30€ it comes with autorange measurement for voltage, current, frequency, resistant, capacity and … temperature (with an internal and external sensor). The meter has serial interface based on an infrared diode on the top.The connection to the pc is done with a CP2030-based serial-usb adapter cable (also in the package).

GVA18B, also sold as VA18

GVA18B, also sold as VA18

Unhappily the offered software won’t work for me – it does not see the virtual COM-port (and it only works on Windows). So I decided to write my own. I connected to the meter via putty – and got only binary crap. With the help of some lines C# (I decided to train my C#-„skills“ – can’t remember the reason) to dump the output in hex/binary and some sample data I figured out the protocol… great hardware but ugly protocol. It seems our friends in HongKong simply map the data of the display controller to some bytes… The facts:

Every sample is decoded into 14 bytes. The high-nibble of each byte contains the position within the stream (bits 4,5,6,7; P={16,32,48,64,..,224} or shifted by 4 bits P={1,..14}). The bytes 1, 10, 11, 12, 13, 14 containing control informations about the selected unit, the range and further functions. The table below shows witch bit in each byte is set for a specific function (high nibble set to zero, first number is decimal value, in brackets the relevant bits):

Each numeric position on the display is encoded into a pair of bytes [1,2], …, [7,8]. The association is given in the following graphic. Bytes 1,3,5,7 represent position „a“, 2,4,6,8 position „b“:

The meaning of Bit 3 in byte „a“ depends on the position of the digit. For the leftmost digit it indicates a leading minus. For the other positions it indicates that the digit is the first part of the fraction.

In the attached code you find a class that handles all the encoding stuff – GVA18BProtocolDecoder. You can drop the received data into it – it does the rest. For simple use it offers an interface for registering a handler that is called when new data arrives. The data can then be fetched by using some convenience methods. The dump2display does what is names – it simply puts the data on the screen (and shows how to use the decoder). And yes, the code is over sized and not very sexy – yet. But it is under the GPL3 – use it as you can: GVA18BDataDump

collected: svn relocate & add all subdirs but not…

posted by on 2011.11.14, under general, programming

After some changes in our infrastructure@work svn failed to find the server… (svn info shows you the url). huh, seen&solved years ago… and now here so I can’t miss the information in the future.

~>svn switch --relocate <old_addr> <new_addr>

I also found a piece of old script I wrote years ago for adding subdirs in svn but not a given set of name directorys and file-extensions:


# script add all but not
# add all files not versioned in and below the current dir to the svn
# exeptions: ignore_dir and ignore_file_suffix (seperated by |)

#ignore directory paths with names...
#ignore files with names...

list=`svn status | grep ? | awk {'print $2'}`

for l in $list

  to_add=`echo "$l \<($ignore_dir)\> \<($ignore_file_suffix)\>$" \
      | awk '{
        where = match($1, $2)
          if (!where) {
            where = match($1, $3)
              if (!where)
                print 1
                print 0
          } else
            print 0   

  if [ $to_add -gt 0 ] ; then 
    echo "+++ADDITION+++"$l
    svn add $l
    echo "---IGNORE---"$l


By commenting out the „svn add$I“ you can do a „try run“.


posted by on 2011.10.31, under programming

I searched for a tool for managing project tasks for me and my colleagues – and I found – a great service offered by the 6 Wunderkinder GmbH. It’s a free task list manager that runs in a browser or – as a rich client – on nearly every (common) platform.

But I missed something like a dedicated page that shows only one auto-updated task list so I can display it on a big screen in our office.

But hey – this is the world of web-services – everyone can access any data in any needed way. Due to the SOP there is no way to call the AJAX-service directly from my page – but with a little help of a JSON-P-alike php-script for reading the data the „magic“ is done. Hint: the script returns all lists with active entries. A simple html/javascript-page is used for displaying the (selected) data. You can test it under

The needed files can be downloaded here.

(Don’t forget to set your email/pwd in getTaskList.php and the name of the list in the view-page index.html)

Edit: update the view-page to look a little…
Edit2: update the view-page – now has a css-progressbar/shows time&date…

collected: TI Chronos display addressing

posted by on 2011.04.12, under collected, programming, TI Chronos

Update: Got time to play again with the Chronos and … huuu, fu**. There was a big mistake in the old version – all bits must be  (old+1)%8… its corrected now.

This is a „cheat sheet“ that contains the assignment of the LCD_B memory (0x0a20 to 0x0a2b) and its bits to the wristwatch’s display elements.

src: base image taken from SLAU292C (p.77, LCD segment map)

as .pdf
as zipped svg