Paged Vector

date: 2023-10-08
author: Peterino

A wild data structure appeared! - Paged Vectors

I haven't found an official name for them yet, but I have seen them in quite a few places. I'm going to call them "Paged Vectors". But the idea is simple. Let me walk you through a common issue with "normal" vectors (or dynamic arrays or arraylists) in programming.

We all know a normal vector is just an array of contiguous data. When you write to it it grows. but what happens when you reach the end of the vector?

What happens in 90% of dynamic array implementations goes a little like this:

  1. New buffer is allocated (usually 2x larger than the previous buffer).

  1. data is copied from the old buffer to the new one.

  1. old data is then invalidated and destroyed

so when we hit this resize event. Two extremely expensive operations are conducted. Both the allocation and the copy are potentially incredibly expensive for very large collections.

Not to mention during the interim in which both the old and new buffers are alive you're sitting on a lot of memory for no reason.

Layout of a paged vector

I first came across the idea after seeing how Unreal Engine internally stores visual log data. They didn't have it as a dedicated standalone data structure, rather it was embedded in some internal logging structure.

But it was a cool enough idea for me to spend an afternoon coding it up and documenting it.

The paged vector looks a little bit different than our vector. Basically its a "Vector of fixed sized vectors".

Rather than resizing the whole buffer that has been allocated, all we do is actually allocate a new fixed size buffer.

This has several interesting positives:

  1. Resizing is no longer an expensive allocate-copy-free, it's just an allocate.
  2. Because the buffers are fixed size, indexing is a constant-time modulo operation.
  3. Because most elements are contiguous, this is actually cache friendly for linear iteration.
  4. Memory usage growth is linear and can we can predict if we're going to oom in the future.
  5. Pages can be sized according to whatever is natural to the operating system to reduce fragmentation.

And there's one more tiny benefit:

POINTERS ARE NEVER INVALIDATED

If you're interested I got battle and unit tested implementation of it up at github:

https://github.com/peterino2/p2-algorithms/blob/integration/structures/paged-vector.zig

(It's zig, but you could convert this code to C, C++, or any other language very easily).


Making fast buffered logging.

date: 2023-02-27
author: Peterino

For those of you who don't know, I've been working on a game engine completely from scratch in my (very limited) spare time. The name is NeonWood

Oddly enough I think this is the first actual post I've ever made about something I've made in zig... interesting.

I was debugging my multithreaded job dispatch system in my game engine and It seems to be doing fine, dispatching jobs to decode some 8k textures from highly compressed PNGs.

Now I'm not the brightest tool in the shed when it comes to multithreading but something seems really off here. This is the time it takes to dispatch a job.

I dunno about you guys but.. 1.75 ms to dispatch a task is kind of unnacceptable.

Why was it even this bad? lets see what the code is doing.

var z = tracy.ZoneN(@src(), "AssetReferenceSys LoadRef");

core.engine_log("loading asset {s} ({s}) [{s}]", .{ asset.name.utf8, asset.assetType.utf8, asset.path });
try self.loaders.LoadAssetByType(asset.assetType, asset);

z.End();

Uh.. ok? it seems like all it's doing is just creating a tracy zone then calling the loading function?

I bet it's the engine_log. It's printing directly to stdout under the hood.

Sigh.. lets comment that out.

Now with the debug prints removed we get some much more reasonable dispatch times.

Honestly I kind of hate this. I don't nessecarilly want to have to lose the convenience of logging.

But while working on the engine's internals I want to have a finger on the pulse when it comes to performance.

What to do..

Well, about a day later and I massively reduced the time it takes to both dispatch the job as well as massively improved performance of debug prints in the engine as well.

here it is loading the previous previous test

Here it is running the entire test, and the time it takes for the first loader function to start executing on the worker thread.

The technique I'm using is to write to an in memory buffer first. All my _log functions now kick the print request to a logger system. This has a fallback to debug print if for some reason the logger print fails, or if it hasn't been initialized yet.

fn printInner(comptime fmt: []const u8, args: anytype) void {
    if (gLoggerSys) |loggerSys| {
        loggerSys.print(fmt, args) catch std.debug.print("!> " ++ fmt, args);
    } else {
        std.debug.print("> " ++ fmt, args);
    }
}

This logger system will flush out pending writes in it's write buffer periodically. It also has an internal buffer size of about 16k characters.

.writeOutBuffer = try std.ArrayList(u8).initCapacity(allocator, 8192 * 2),

when a write causes the buffered character count to go above about half of the buffer size it forces a flush right a way. In this situation the forced flush will cause a stall in the 1 ms range.

if (self.writeOutBuffer.items.len > 8192) {
    try self.flush();
}

There's some stuff I can address with that later though.

The logger is of course also thread safe as it is guarded by a single mutex.

Fixing the stall.

That stall though from flush() is still horrendously bad to me but at least we're not paying the cost of it in the hot path anymore.

However look at this situation

This is really bad. We can get rid of this by double buffering. This kicks out the expensive flush operation to a worker thread. via a swap operation.

var swap = self.writeOutBuffer;
self.writeOutBuffer = self.flushBuffer;
self.flushBuffer = swap;

const L = struct {
    loggerSys: *LoggerSys,

    pub fn func(ctx: @This(), _: *core.JobContext) void {
        var z1 = tracy.ZoneN(@src(), "flushing output buffer");
        defer z1.End();
        ctx.loggerSys.flushFromJob() catch unreachable;
    }
};

try core.dispatchJob(L{ .loggerSys = self });

And of course all of this is done with zero allocations in the hot path.

But what about overflow?

Well this is a case where stalls must be handled immediately.

if (self.writeOutBuffer.items.len > 8192) {
    if (flushBufferLen == 0) {
        try self.flush();
    } else {
        try self.flushWriteBuffer();
    }
}

Basically I add a seperate synchronous flush function that immediately flushes out whatever is currently in the write buffer. This emergency synchronous flush is only used if

  1. we are above our write buffer size for a single message, AND
  2. the flush buffer is still flushing.

I think those are acceptable situations for a stall.

There is just one mild issue left:

When the program crashes, anything still left in the buffer will not get dumped out. But there's lots of juicy details to talk about there, and will probably be a post for another time.


Just make it a zig package dude

date: 2023-02-27
author: Peterino

For those of you who don't know, I've been working on a game engine completely from scratch in my (very limited) spare time. The name is NeonWood

Oddly enough I think this is the first actual post I've ever made about something I've made in zig... interesting.

I was debugging my multithreaded job dispatch system in my game engine and It seems to be doing fine, dispatching jobs to decode some 8k textures from highly compressed PNGs.

Now I'm not the brightest tool in the shed when it comes to multithreading but something seems really off here. This is the time it takes to dispatch a job.

I dunno about you guys but.. 1.75 ms to dispatch a task is kind of unnacceptable.

Why was it even this bad? lets see what the code is doing.

var z = tracy.ZoneN(@src(), "AssetReferenceSys LoadRef");

core.engine_log("loading asset {s} ({s}) [{s}]", .{ asset.name.utf8, asset.assetType.utf8, asset.path });
try self.loaders.LoadAssetByType(asset.assetType, asset);

z.End();

Uh.. ok? it seems like all it's doing is just creating a tracy zone then calling the loading function?

I bet it's the engine_log. It's printing directly to stdout under the hood.

Sigh.. lets comment that out.

Now with the debug prints removed we get some much more reasonable dispatch times.

Honestly I kind of hate this. I don't nessecarilly want to have to lose the convenience of logging.

But while working on the engine's internals I want to have a finger on the pulse when it comes to performance.

What to do..

Well, about a day later and I massively reduced the time it takes to both dispatch the job as well as massively improved performance of debug prints in the engine as well.

here it is loading the previous previous test

Here it is running the entire test, and the time it takes for the first loader function to start executing on the worker thread.

The technique I'm using is to write to an in memory buffer first. All my _log functions now kick the print request to a logger system. This has a fallback to debug print if for some reason the logger print fails, or if it hasn't been initialized yet.

fn printInner(comptime fmt: []const u8, args: anytype) void {
    if (gLoggerSys) |loggerSys| {
        loggerSys.print(fmt, args) catch std.debug.print("!> " ++ fmt, args);
    } else {
        std.debug.print("> " ++ fmt, args);
    }
}

This logger system will flush out pending writes in it's write buffer periodically. It also has an internal buffer size of about 16k characters.

.writeOutBuffer = try std.ArrayList(u8).initCapacity(allocator, 8192 * 2),

when a write causes the buffered character count to go above about half of the buffer size it forces a flush right a way. In this situation the forced flush will cause a stall in the 1 ms range.

if (self.writeOutBuffer.items.len > 8192) {
    try self.flush();
}

There's some stuff I can address with that later though.

The logger is of course also thread safe as it is guarded by a single mutex.

Fixing the stall.

That stall though from flush() is still horrendously bad to me but at least we're not paying the cost of it in the hot path anymore.

However look at this situation

This is really bad. We can get rid of this by double buffering. This kicks out the expensive flush operation to a worker thread. via a swap operation.

var swap = self.writeOutBuffer;
self.writeOutBuffer = self.flushBuffer;
self.flushBuffer = swap;

const L = struct {
    loggerSys: *LoggerSys,

    pub fn func(ctx: @This(), _: *core.JobContext) void {
        var z1 = tracy.ZoneN(@src(), "flushing output buffer");
        defer z1.End();
        ctx.loggerSys.flushFromJob() catch unreachable;
    }
};

try core.dispatchJob(L{ .loggerSys = self });

And of course all of this is done with zero allocations in the hot path.

But what about overflow?

Well this is a case where stalls must be handled immediately.

if (self.writeOutBuffer.items.len > 8192) {
    if (flushBufferLen == 0) {
        try self.flush();
    } else {
        try self.flushWriteBuffer();
    }
}

Basically I add a seperate synchronous flush function that immediately flushes out whatever is currently in the write buffer. This emergency synchronous flush is only used if

  1. we are above our write buffer size for a single message, AND
  2. the flush buffer is still flushing.

I think those are acceptable situations for a stall.

There is just one mild issue left:

When the program crashes, anything still left in the buffer will not get dumped out. But there's lots of juicy details to talk about there, and will probably be a post for another time.


AI Behaviour Trees

date: 2023-01-31
author: Peterino

AI Behavior trees can be broken into multiple assets

here's an example

New Behavior tree runs NewBehaviourTree1. Statuses and events will propagate up from one to the other.

You can run state trees dynamically with Run Dynamic Behavior

This node for example will run a dynamic behavior tree.

From C++ there is a function called UBehaviorComponent::SetDynamicSubtree

This function takes in a gameplay tag and a reference to a UBehaviorTree asset.

Notice the

Injection tag parameter.

All nodes matching this tag, will have their behavior trees set by the SetDynamicSubtree call.

Putting it together we can basically author a bunch of attacks using dynamic subtrees. and run it all through two tasks.

The second task there will run multiple attack behaviours randomly now.

This is pretty awesome because it allows us to reuse the same attacks in various different situations.


Random Rust Snippets

date: 2021-04-12
author: Peterino

Static Strlist

const BROWSERS: &'static [&'static str] = &["firefox", "chrome"];

Singletons in rust

struct Peripherals {
    serial: Option<SerialPort>,
}

impl Peripherals {
    fn take_serial(&mut self) -> SerialPort {
        let p = replace(&mut self.serial, None);
        p.unwrap()
    }
}

static mut PERIPHERALS: Peripherals = Peripherals {
    serial: Some(SerialPort),
};

Rust Macros

macro_rules! info {
    ($($arg:tt)*) => ({
        ::std::print!("[sear ]: ");
        ::std::println!($($arg)*);
    })
}

Random Python Snippets

date: 2020-11-09
author: Peterino

Automation Scripts and stuff

This contains some random pythyon snippets i keep forgetting but keep using.

import os 
orig_dir = os.path.dirname(__file__)
os.chdir(orig_dir)
// now you can os.system like its in the ksame folder

Removing stuff

shutil.rmtree(muh_folder) # hits harder than rmdir


EAN-8 Barcode, Code golf

date: 2018-03-25
author: Peterino


/*
An EAN-8 barcode includes 7 digits of information and an 8th checksum digit.

The checksum is calculated by multiplying the digits by 3 and 1 alternately,
adding the results, and subtracting from the next multiple of 10.

For example, given the digits 2103498:

Digit:        2   1   0   3   4   9   8 
Multiplier:   3   1   3   1   3   1   3
Result:   sum(6   1   0   3  12   9  24) = 55 

The sum of these resulting digits is 55, 
so the checksum digit is 60 - 55 = 5

The Challenge:

Your task is to, given an 8 digit barcode, verify if it is valid - returning a
truthy value if the checksum is valid, and falsy otherwise.

You may take input in any of the following forms: A string, 8 characters in
length, representing the barcode digits A list of 8 integers, the barcode's
digits A non-negative integer (you can either assume leading zeroes where none
are given, i.e. 1 = 00000001, or request input with the zeroes given) Builtins
that compute the EAN-8 checksum (i.e, take the first 7 digits and calculate the
last) are banned.  This is code-golf, so the shortest program (in bytes) wins!
Test Cases

20378240 -> True
33765129 -> True
77234575 -> True
00000000 -> True

21034984 -> False
69165430 -> False
11965421 -> False
12345678 -> False


Solution as follows:
118 chars
By Peter Li
*/
int i,j,k;int main(int,char**v){for(;i<7;(j+=(v[1][i]-48)*(i++%2?1:3)));
    for(;(k+=10)<j;);return v[1][7]-48==(k-j)%10;}

Random C/C++ Snippets

date: 2018-03-15
author: Peterino

Cross platform fixed size printing

In order to write cross platform code with fixed sizes you often have to deal with uint32_t and other fixed size vars.

But printing them in a cross platform manner is a PITA.

printf("%u", (uint32_t)x ); // will mismatch on x86_64

Use inttypes instead.

<inttypes.h>
printf("%" PRIu32, (uint32_t)x );

Hot Swapping Cpp Code (demo)

date: 2018-03-15
author: Peterino

A quick demo of an idea I had to have a kind of plugin system that allows live hotswapping of C++ modules. Turns out it's pretty easy to do so long as you have simple C style interfaces between the executable and the library objects you link against. That is so long as the executable doesn't ever directly reference objects whose implementations can change as development goes on.


Hardware NES Emulator Part 1: Video Output

date: 2018-02-22
author: Peterino

*This is the first part of a series where I detail what makes up a game console emulator. The first few parts will cover the video and the GPU logic that allows you to draw to the screen, this encompasses the scope of what was probably the most enjoyable project I did during my undergrad. The report for which can be seen here.

The first step in my opinion of making any video game console is deciding how to draw to the screen. When I set out to make an NES on an FPGA, I was immediately overwhelmed before I even wrote a single line of SystemVerilog.

This is because working on hardware allows you an almost exhilirating amount of flexibility when it comes to making your choices. HDMI? VGA? Composite? any kind of video output format you want, we can do it. Let's look at some of the most popular options available on most devices.

Let's say you have a frame buffer that contains RGB color values for each pixel on a physical screen. Lets see how Composite, HDMI, and VGA would draw this frame buffer to the screen.

Composite video

Here's what the composite video looks like for a single line of grayscale data.

On each scanline the voltage gets pulled very low to the &2018;blanking level'. This signifies the start of each horizontal blank, after this blank occurs a small period of setup happens. During this time the colourburst is sent as well as the level for &2018;black' is established. Then after this, the device starts tracing the input data with intensity values proportional to the input voltage. Higher the voltage, the brighter the trace is at that instant in time.

When color is added in the gravy thickens quite a bit. A color burst happens during the setup time period. This causes the recieving device to synchronize a sinewave to the colorburst's phase. Then when the actual tracing begins to happen, a hue/saturation signal the form of an amplitude/phase modulated signal appears superimposed on the grayscale signal. The phase determines the hue while the amplitude determines the saturation.

In a nutshell Composite video's workflow is:

  1. serialize each row of your array and sweep each individual value in that row.
  2. Intensity is transformed into an analog waveform of varying intensity, corresponding to each pixel's value
  3. This sweep happens on each line and this is called a scanline.
  4. A colorburst signal is sent at the beginning of each line and acts as a reference for a later modulated chroma signal on the line, with different phase lags signifying color channels.
  5. there's also some horizontal blanking and vertical blanking that is inserted for the purposes of synchronization.

This tech was originally designed to transform tv signals that flew over the air and into an antenna. From an era where analog electronics were so much faster than digital, so that's why it may feel a bit unintuitive especially for somebody like me who feels much more at home in the world of computers and digital logic. So it makes sense that it's a pretty complicated signal electrically. To properly create a composite adapter we would need

  1. Digital to phase modulation for the Hue
  2. Digital to amplitude modulation for the Saturation
  3. DAC for the Intensity
  4. Oh and convert our RGB domain colors into HSI

There are ways to create simple versions of each thing. For example the colorburst signal can be done via some ghetto DDS through a filtered squarewave generated by a constant, clock-fed counter. It's amplitude can then be adjusted in a number of ways. Resistor ladder into a transistor gate, Op-amp with a digital potentiometer. Maybe even a DAC with a Digital Modulator, the sky's the limit here.

So while do-able, it is still quite complicated and would definitely add a couple of days to the project.

HDMI

What about HDMI? Well, despite being both modern and fully digital and has a boatload of features its' general workflow is actually surprisingly simple. (If we choose to ignore the frills that are available)

  1. Convert each channel into 8 bit color (if they are not in 8 bit already)
  2. Encode each 8 bit color value into 10 bit TDMS format
  3. Juice up your pixel clock by 10x and shift out each 10 bit encoded color value for each pixel on the screen.
  4. Pulse the per pixel clock
  5. Repeat until the scanline is completed, add blank time for hsync and repeat for each other scanline
  6. add blanking time for vsync and wait out till the next frame draw

It's got a pretty snazzy physical interface too. It doesn't require nearly as much circuitry

It's literally 4 differentially paired digital lines. All you would have to do is find some kind of breakout for it (adafruit sells them pretty cheap.) But my only problem with it is... I actually don't have any computer monitors at the lab which can take HDMI. But I do have a ton of old VGA monitors! Also some VGA headers from a previous project. So let's look at VGA as well.

VGA

Alright VGA is kind of a hybrid between the two. (Kinda makes sense considering it is most popular in the time directly proceeding one and shortly before the other one.) It's 3 channels of analog output but the time steps can be thought of as discrete and digital.

The workflow for VGA shakes out like this:

  1. For each row in your buffer, scan across each row.
  2. For each discrete value in each row output an analog voltage onto each of the color channels.
  3. Add time for hsync and vsync

This is great for pretty much the same reasons why HDMI is great: no complicated circuitry or wizardry required. However it has the added benefit of having a lower data rate and actually works with the crappy monitors I didn't mind bringing into the lab? Yes please! There's only one hiccup here though that little point about outputting an analog voltage onto each row.

Well this is where we can take advantage of some of the properties of the video we are outputting. The NES video

For example take a look at the NES's video. The colour pallete has a total available number of 64 colors. This means, that we don't actually need very good resolution. As you can see from my NES to 9 bit color conversion spreadsheet. 9 Bits of color (3 bits per colour channel) offers me some pretty darn accurate colors. This requires nothing more than just 3 bits of a resistor divider network. (the resistor values are available here.)

Furthermore VGA allows me to do some slightly nasty things such as smearing my scanlines and delaying my pixel outputs which would've been slightly more difficult to do in HDMI (more details on that in a different post).

Concluding the video output

But that doesn't mean VGA is the best choice, or even the right choice in the long run. HDMI is incredible in that it can output high definition video without much more hardware resources than a moderate clock. Unlike a VGA output, which would require a low speed DAC. While composite video is a much more true to form option as the original NES did indeed output composite video.

At the end of the day, the goal is to get video onto a screen. For this project It didnt really matter how. So the goal is to whatever makes sense given the situation. In my case here, both HDMI and VGA are excellent choices. They are both digital, widely used, flexible and compatible with most devices out there. Disregarding frame buffers, they are both also quite cheap in terms of logical elements on an FPGA. In fact it's easy to see why many commercial products just go ahead and offer both of them.


Hello World

date: 2018-01-24
author: Peterino

So here's my site, it's not much but my blogging needs aren't much either. It's a fairly simple flat CMS controlled through a series of Python Scripts that generate static HTML files. My goal here is to be as minimal as possible in terms of deployment requirements as this is running off of an extremely tiny server.

Update 2022:

As the years have gone on, I'm transitioning this site into more of a notebook, I've added some light obsidian integration so creating pages and posts is a breeze with the ability to copy paste screenshots as i keep working

Markdown

The site is generated with markdown so here's a quick cheatsheet/reminder to myself on how that works.

// This is a comment
int x;
x += 2;

Code blocks are generated by enclosing with ~~~

Math equations don't work normally with the $ but instead require

\\[ 2 + 2 = 4 \\]

\[ 2 + 2 = 4 \]