Metal newBufferWithBytes usage - ios

I have a basic question about allocating new Metal device buffers. Most of the sample codes I see create MTLBuffer at setup time and do not modify it. But I wonder if the vertex data is changing at every render call, is it okay to every time create a new MTLBuffer (using -[MTLBuffer newBufferWithBytes:length:option]) for sending data to shaders running on GPU, OR, should MTLBuffer of given size be created once but it's bytes modified on every render call? What's the recommended way as per best practices?

If it is a small amount of data (under 4K) I believe you can use setVertexBytes() and setFragmentBytes():
For larger amounts of data that changes every frame, they recommend that you use a triple-buffered approach so that successive frames' access to the buffer do not interfere:
This tutorial shows how to set up triple buffering:
That's actually like the third part of the tutorial but it is the part that shows the triple-buffering setup, under "Reusing Uniform Buffers".
What I am unsure about is whether it is better/faster to use triple-buffering on small amounts of data as well -- they do it in that tutorial for only a couple of matrices, so maybe it is better to always triple buffer.


DirectX 9.0 vertex buffer updating

I've made a small 3D viewer application using DirectX 9.0. Now I would like to add some editing functionality. Suppose that I want to edit a large polygon. While the user edits the shape, vertices would be added, deleted and moved. Right now each polygon is stored as a vertex buffer.
My question is, what is the best way to store and display the shape while the user edits it? If i destroy and re-create the vertex buffer each time a change happens, I believe it will be too resource intensive and suboptimal.
I though I should ask my questions in the main post as well so here they are :
I have three different tasks, two of which I don't know how to implement :
1) Editing a vertex : Easy, I will just update the old vertex with the new position in the VB.
2) Deleting a vertex : What happens here? How do I remove it without creating a new VB? Should I just add a blank VB?
3) Adding a vertex : What about here? Can I change the length of a dynamic VB and add vertices at the end?
Another thought here that I believe would work :
1) Editing is easy
2) Deleting would just mean that I will overwrite the deleted vertex with possibly the position of the previous or the next vertex, so it will not be visible.
3) Addition will create a new vertex buffer, but will resize like a Vector or a List. Each time it's re-created it's size would be something like
NewSize = OldSize * 1.1 (adding 10% more)
so that successive additions do not have to re-create the VB.
So 1 & 2 never create a new VB and 3 might sometimes create one. How does this sound?
You don't have to destroy the buffer every frame. Create a dynamic buffer.
Check out the article Using Vertex Buffers With DirectX, if you haven't already:
For dynamic vertex buffers, which contain information about primitives that change often in the scene, we need to specify the D3DUSAGE_DYNAMIC | D3DUSAGE_WRITEONLY flags for the Usage and the D3DPOOL_DEFAULT flag for the Pool.
Remember though, the data still has to be sent to the pipeline for every update. So there will definitely be some performance cost compared to static buffers with consistent data.
Also you should really try keeping buffer updates and also buffer switches to a minimum. Are there many such editable polygons in your application? If the answer is yes, maybe consider putting them into one buffer.
The official Q/A site for graphics/game developers:
Sounds pretty good to me.
Dynamic Vertex Buffers, on the other hand, are filled and tossed away every frame.
This is taken from the article and is pretty much the answer for 1. and 2.. Instead of updating single vertices or carefully selecting which of them should be overwritten I would just update the whole buffer content. The complete buffer has to be sent to the device anyways. You would have to test it though, just to be sure.
Concerning 3.: You cannot change the buffer's size after it has been created, but you can create a larger buffer than actually needed. So try to estimate a good margin. If the buffer is still too small you will have to create a new one. There is no other solution to this. You have already found a possible algorythm for increasing the buffer size dynamically.
There are so many parameters when it comes to graphics performance, it is nearly impossible to give definite answers. Are you already encountering limits? If not, do not bother too much. Be generous with your resources while you are still developing.

D3D Performance comparison, shaders vs built in shading

I have a running 3D engine built in D3D (via SlimDX). To avoid interupting the rendering pipeline I have batched together many objects with the same material into bigger meshes (to reduce state switching). This have worked well and gives a good performance level for my need.
The issue I am running into is that I, during runtime, need to change material properties on some subsets of those larger batched meshes. I'm using the attribute buffer for that and it have been working reasonably well. I have been using a limited number of active attributes earlier (ca 5 per mesh), but now find the need to have much more variations on the materials (different shades of opacity/colorblends) and thus ending up with possibly hundred or more combinations. And since those changes happens during runtime I can't bundle them together before rendering starts. Sure I could re-construct meshes, but I rather not since it is slow and switching back and fourth between materials needs to be done at interactive speeds.
So my question is, what is the best route to take?
Should i implement a more robust attribute handling system that dynamically masks faces with available attribute IDs on demand and then resets them when done? I have heard that fragmentation in the attribute buffer generates added performance hit and I am also unsure about the performance hit of subsequent DrawSubset() calls with material switches in between (i.e when is too much and when should i optimize my attribute arrays?). Anyone with any experience on this?
My other idea is to use a parametrized pixel shader. I don't need any fancy effects, just the bare minimum (current is the built in flat-shader with color only and transparency on some objects), so shader model 1 is more than enough for my needs. The idea here is to use one all-purpose shader and instead of switching material between calls just alter some shader parameters. But I don't know if this is faster than switching materials and/or if programmable shaders are slower than the build in ones (given the same result).
I'm also curious about the difference in performance hit between switching mesh or drawing different subsets in one big mesh (given the same number of material switches for both cases).
I understand that these questions might differ some between GFX-cards and their respective performance/age but I'm just looking for general guidelines here on what to focus most effort on (i.e what type of state switches/CPU-interference that gives the biggest GPU-hit). Memory is a concern also, so any implementations that duplicates whole (or large parts) of meshes are not possible for me.
My focus is performance on older(5y)/less capable/integrated GFX cards and not necessarily top of the line gamer cards or work station cards (like Quadro). Which I guess could make or break the solution using shaders depending on how good the shader performance is on a particular board.
Any and all suggestions and feedback are greatly appreciated.
Many thanks in advance!
Altering shader parameters will be just as slow. Ideally you want to write a Shader 2 based shader that uploads a large section of the attribute buffer to the graphics card. You then have a per-vertex attribute field which can select the appropriate attribute buffer.
Your performance problems will come from the number of draw calls you use. The more draw calls you use the more performance suffers. Any change of shader constants or texture will require a new DIP call. What you want to do is minimise the number of shader constant modificatiosn AND the number of DIP calls.
It becomes quite an involved process mind.
For example if you are processing a skeletal model with 64 bones in it then you have 2 options. 1 you set the world matrix for each bone's mesh data and call DIP. OR you load as many of the bone matrices as you can in one go and use a value on the vertex to select which bone it will use and call DIP once. The second WILL be faster. You will probably find you can do some multi-bone based skinning quite easily this way as well.
This goes for anything that causes a shader constant change. Do note that, even though you may use the fixed function pipeline, most modern graphics hardware (ie anything since the Radeon 9700 released in 2002) will translate Fixed function to shader based so the same performance issues apply.
Basically the thing to avoid over anything is anything that causes you to make another DIP call. Obviously its impractical to avoid doing this for everything and certain changes are less costly. As a rough rule of thumb the things to avoid, in order of expense, are as follows: (Be warned its been a while since i tested this so you may want to do some testing and alternative reading on the subject)
1) Change Shader
2) Change Texture
3) Change Shader constant
4) Change Vertex Buffer
5) Change Index Buffer
1 is by far the most expensive.
4 and 5 are pretty cheap by comparison to the rest but a different vertex format will likely cause bigger problems as it may cause the shader to get changed.
Edit: I'm not entirely sure why changing constants hurts so much. I would have thought such changes would pipeline well. Maybe on moden hardware its not such a problem. On some early hardware the constants were compiled into the shader so changing constants resulted in a full shader change.
As with anything you are best off trying and see what happens.
An interesting solution for sorting all your calls can be to use a sort key where the top 8 bits give you a shader id. The next 10 bits for texture and so on. You then perform a normal numeric sort and you have an easy way of playing with different sort orders to see what gives best performance :)
Edit2: Its worth noting that anything that changes the pixel shader state is more costly than changing the vertex buffer state because it is deeper in the pipeline. Thus it takes longer for it to bubble through ...

unaligned memory accesses

I'm working on an embedded device that does not support unaligned memory accesses.
For a video decoder I have to process pixels (one byte per pixel) in 8x8 pixel blocks. The device has some SIMD processing capabilities that allow me to work on 4 bytes in parallel.
The problem is, that the 8x8 pixel blocks aren't guaranteed to start on an aligned address and the functions need to read/write up to three of these 8x8 blocks.
How would you approach this if you want very good performance? After a bit of thinking I came up with the following three ideas:
Do all memory accesses as bytes. This is the easiest way to do it but slow and it does not work well with the SIMD capabilites (it's what I'm currently do in my reference C-code).
Write four copy-functions (one for each alignment case) that load the pixel-data via two 32-bit reads, shift the bits into the correct position and write the data to some aligned chunk of scratch memory. The video processing functions can then use 32 bit accesses and SIMD. Drawback: The CPU will have no chance to hide the memory latency behind the processing.
Same idea as above, but instead of writing the pixels to scratch memory do the video-processing in place. This may be the fastest way, but the number of functions that I have to write for this approach is high (around 60 I guess).
Btw: I will have to write all functions in assembler because the compiler generates horrible code when it comes to the SIMD extension.
Which road would you take, or do you have another idea how to approach this?
You should first break your code into fetch/processing sections.
The fetch code should copy into a working buffer and have cases for for memory that is aligned (where you should be able to copy using the SIMD registers) and non-aligned memory where you need to copy byte by byte (if your platform can't do unaligned access, and your source/dest have different alignments, then this is the best you can do).
Your processing code can then be SIMD with the guarantee of working on aligned data. For any real degree of processing doing a copy+process will definitely be faster than non-SIMD operations on unaligned data.
Assuming your source & dest are the same, a further optimization would be to only use the working buffer if the source is unaligned, and do the processing in-place if the memory's aligned. The benefits of this will depend upon the characteristics of your data.
Depending on your architecture you may get further benefits by prefetching data before processing. This is where you can issue instructions to fetch areas of memory into the cache before they're needed, so you would issue a fetch for the next block before processing the current.
You can use memcpy (which if I recall can be optimized to perform word copies if possible) to copy to an aligned data structure (e.g. something allocated on the stack or from malloc). Then perform your processing on that aligned data structure.
Most likely, though, you'd want to handle things in your processor's registers and not in memory. How you'd approach your task depends on the capabilities of the hardware (e.g. can a 32-bit register be split into four 8-bit ones? What registers do the SIMD operations operate on?) If you're going the simple route, you can have a small loader function be called which performs your unaligned read(s) for you.
Align the data first, and then take the aligned-SIMD approach.
This is less work than option 3, and with luck your code will be top-speed 25% of the time (i.e. the already-aligned case). You can happily re-use the code in future in situations where you know the input will be properly aligned.
Only if this doesn't work to your satisfaction should you consider hardcoding all four alignment possibilities into your functions.
I'd go with option 1) until you know that it's too slow (slow is ok, too slow is bad)
General advice: why don't you go with something that sounds reasonable (like #2) and then measure the performance? If it's not acceptable, you can go back to the drawing board.
Surely handcrafting 60ish functions in assembler before measuring would count like "premature optimization". :)

efficiency of storing in MTLBuffer

I want to store image float data in an unformatted memory location from within my shader, and MTLBuffer seems to be the answer. Firstly, is this possible? I saw within the apple docs of MTLBuffer that one can access the pointer pointing to the buffer. Is it possible to use that pointer from within the shader to fill up allocated memory? Secondly, I wish to get some idea how fast this operation will be; will it be fast and efficient as operating with Textures?
I ask this as I need to re-engineer a lot of my code if it is confirmed that MTLBuffer's speed of access is comparable to operating with MTLTexture.
First, it's important to be aware of several restrictions:
Fragment shaders neither support writing to textures nor to buffers. You're only option is rendering to textures.
Vertex shaders support support writes to buffers but not to textures.
So if you're using either of these shader types, you don't have a choice between texture writes and buffer writes, regardless of performance.
If you're using a compute shader and your write pattern is relatively simple (i.e. a 1-to-1 thread id to pixel correspondence), I'd expected buffer writes to be faster. That said, there is no good general advice, so the best solution is to try both and profile.

glDrawElements massive cpu usage on iOS

Hardware: iPad2
Sofware: OpenGL ES 2.0 C++
glDrawElements seems to take up about 25% of the cpu. Making the CPU 18ms and the GPU 10ms per frame.
When I don't use an index buffer and use glDrawArrays, it speeds up and glDrawArrays barley shows up on the profiler. Everything else is the same, glDrawArrays has more verts because I have to duplicate verts in the VBO without the index buffer.
so far:
virtually the same amount of state changes between the two methods
vertex structure is two floats(8 bytes).
indexbuffer is 16bit(tried 32bit as well)
GL_SATIC_DRAW for both buffers
buffers don't change after load
the same VBO and the indexbuffer render multiple times per frame, with different offsets and sizes
no opengl errors
So it looks like it's doing a software fallback of some sort. But I can't figure out what would cause OpenGL to fallback.
There are a few things that immediately jump to mind that might affect speed the way you describe.
For one, many commands are issued passively to reduce the number of bus transfers. They are queued up and wait for the next batch transfer. State changes, texture changes, and similar commands all accumulate. It is possible that the the draw commands are triggering a larger transfer in the one case but not in the other, or that you are triggering more frequent transfers in the one case or the other. For another, your specific models might be better organized for one or the other draw calls. You need to look at how big they are, if they reuse index values, and if they are optimized or reordered for rendering. glDrawArrays may require more data to be transferred, but if your models are small the overhead may not be much of a concern. Draw frequency becomes important since you want to queue off calls frequently to keep the card busy and let your CPU do other work, you don't want it to just accumulate in the command buffer waiting to be sent, but it needs to be balanced since there is a cost with those transfers. And to top it off, frequently indexed values can benefit from cache effects when they are frequently reused, but linearly accessed arrays can benefit from cache effects when they are accessed linearly, so you need to know your data since different types of data benefit from different methods.
Even Apple seems to be unsure which method to use.
Up until iOS7 the OpenGL ES Programming Guide for IOS for that version and earlier wrote:
For best performance, your models should be submitted as a single unindexed triangle strip using glDrawArrays with as few duplicated vertices as possible. If your models require many vertices to be duplicated (...), you may obtain better performance using a separate index buffer and calling glDrawElements instead. ... For best results, test your models using both indexed and unindexed triangle strips, and use the one that performs the fastest.
But their updated OpenGL ES Programming Guide for iOS that applies to iOS8 offers the opposite:
For best performance, your models should be submitted as a single indexed triangle strip. To avoid specifying data for the same vertex multiple times in the vertex buffer, use a separate index buffer and draw the triangle strip using the glDrawElements function
It looks like in your case you have just tried both, and found that one method is better suited for your data.