Because rectangles can overlap, it's possible to reduce the amount drawing done when using them for dirty rect updating. If there's two rectangles overlapping, then we don't need to over draw the overlapping area twice. Normally the dirty rect update algorithm used just makes a bigger rectangle which contains two of the smaller rectangles. But that's not optimal.
But, as with all over draw algorithms, can it be done fast enough to be worth it?
Here's an article on the topic:
jmm0 made some code here:
DR0ID, also did some with tests and faster code...
So far DR0ID says it's not really fast enough to be worth it. However, there are opportunities to improve it. Perhaps a more optimal algorithm, or one which uses C or Cython.
"worst case szenario with 2000 rects it takes ~0.31299996376 seconds"
If it's 20x faster in C, then that gets down to 0.016666. Still too slow perhaps.
It's not an embarrassingly parallel problem... I think. But I haven't thought on it much at all. Maybe there is a use for multi core here.
Anyone done any other work like this? Or know of some good algos for this?
Just to clarify, this is an optimization to update the display surface? TBH, if DR0ID says that it isn't a good optimization, then I'll have to follow him. Thinking deeply about the issue, the would be better payoffs by finding way to not flip bits in the display surface in the first place. To say it another way, if we can inform developers that they are drawing to specific areas in the display surface several times, then it would help profile the game better and reduce extraneous draws. Android has options to detect extraneous overdraw, and I'm sure other platforms do as well.Also, if we are going to promote pygame group rendering, a cheap 15-20% performance increase (based on my system) could be had by creating a C function that accepts a list of surfaces and their positions, then executing the blits in C. I've brought it up before, but nothing came if it. In fact, I was asked more than once if I had converted my surfaces, so maybe people just don't read threads? ¯\_(ツ)_/¯
On Sun, Mar 26, 2017 at 2:50 PM, René Dudfield <[hidden email]> wrote:
The use case is to speed up naieve drawing. With things like pygame zero, where they update the whole screen no matter what you draw. Where if people just draw several images to the screen, we should be able to easily track their changes, and only update what has changed.
Yeah! Batching APIs would be a useful additions indeed. Ones which do _multi things at once. The tricky bit is to finding a set of those functions which are useful for many cases. Also extending the API without adding a million different functions here and there. fill_multi, blit_multi etc. However, what does blit_multi do? Blit one surface to multiple places? What about blit blend arguments? What about sub rects? Shouldn't there be a batch API for blitting a lot of different surfaces to different rects at once? After all this you can see how the APIs could get complicated and the number of them large. We have drawline, and drawlines, do we have blit and blits?
Relatedly, there is also a possibility of compiling the sprite classes in Cython. This may also give a decent speedup in some cases. As shown by things like psyco, and pypy speeding them up. They're loops in python after all. But I wonder if they can be compiled easily, or if they're too dynamic.
Apart from the jit work going on using libjit, it would be useful to do data aware blits. What I mean by that is by looking at an image, you can find better ways to draw it. Eg. does the image only contain 6bits worth of colour? Then an RLE blit is the fastest, and using 32bit alpha transparency is wrong. Are there large areas of the same colour (which can be more quickly drawn with fill())? Can we split it into 5 surfaces (four on the edges, one in the middle) where only the edge surfaces are drawn with the much slower alpha transparency blitters? For many games, the images don't change, so this works quite well. However this would perhaps require quite an API change (not entirely sure of that), or could likely be implemented inside the sprite classes with some trickery.
Another area is better memory alignment on a platform specific basis can give considerable gains. On the slower raspberry pi for example, 16 byte, and 32 byte aligned memory goes much faster. On many platforms this is 4 bytes, or 8 bytes. Gains of around 600MB/s ->1300MB/s have been seen. This is one area where a libjit based blitter (or one like in ANGLE) can make significant improvements. This type of change only needs to happen where we allocate memory for surfaces (or maybe people have done that already).
Finally, hardware support for things like jpeg drawing, and DMA are possibilities. For example, the raspberrypi has dedicated hardware for jpeg decoding (as does most hardware these days). This can be used to draw a jpeg from memory into a buffer, so if you are drawing photos onto the screen, then this is a very fast way to do it. However, this would require a more abstract Sprite class which loads files itself. Sprite('mybigphoto.jpg'). But interestingly (but not unexpectedly) the CPUs are outperforming the dedicated hardware, if you dedicate the CPUs to the task. Also for when the images are small. However, the older, and slower CPUs don't. https://info-beamer.com/blog/omx-jpeg-decoding-performance-vs-libjpeg-turbo But either way, keeping the jpg photos in memory allows you to draw a lot more of them, than if you decode them and use up all of the limited memory on small platforms like the Orange|Raspberry PI. So we need a Sprite('photo.jpg') style api for this reason.
I'm interested in if anyone has done any work on dirty rect optimizations for reducing over draw for now. Because that could be applied to something like pygame zero with almost no work on their part, and no work on the users part. For non-worst cases, like where there's < 50 rects, I guess simply compiling an algorithm like DR0ID has done in C/Cython will work nicely enough - if the python code isn't fast enough already. Would need some representative pygamezero examples to check this.
On Mon, Mar 27, 2017 at 3:55 AM, Leif Theden <[hidden email]> wrote:
Anyway, all of this chatter is moot, if pygame moves to a hardware sprites soon.
I think a function that accepts a sequence of tuples in the form of (dest_surface, dest_position, source_area, blend_mode) would be enough. It is only needed as a power-user optimization and I don't seem much value in watering it down or splitting it into multiple functions for beginners. No need to overthink and complicate it. Cython may help sprite rendering in Groups, but it would have to be implemented and tested.
Has anyone considered releasing the GIL during blit operations? Is it possible the release the GIL during a blit, then block when needed, in the cases of subsequent draw operations? The optimized case would be to allow math operations to execute while the display buffer is being modified, while you are iterating over sprites.
Here is my preliminary C code for a "blit_multi" function. It is called "blit_list". It only supports (dest_surface, dest_position) tuples, but it works as so far.https://gist.github.com/bitcraft/1785face7c5684916cde
Use it like this:
sprite_image_list = [(s.image, s.rect) for s in self.sprites]
Jitblit sounds interesting, but how is that different from just using Surface.convert() with the correct flags? I may be mistaken here, but on all modern computers, using 32-bit surfaces can be optimized into a single memcpy call...bytealigned and whatnot. No need to iterate over pixel data (RLE). If that's the case, then it doesn't make sense to support legacy 8/16/24-bit image formats, and validates pygame_sdl2's position not to support them.
I don't see a way to combat overdraw, except by informing users that it is happening.
On Mon, Mar 27, 2017 at 6:18 AM, René Dudfield <[hidden email]> wrote:
Cool about the blit function. Since the rect and surfaces are by reference, then you only need to make sprite_image_list once (if you are blitting the same thing every time). Additionally, I think it would likely be fast enough for another commonly requested case (blit the same image to many different spots). It would also work nicely for sprite groups. I think it's worth some more API discussion, before merging it in. We'd also need some tests and docs.
A couple of quick points.
Another software improvement would be using the new METH_FASTCALL on python 3.6+ Which can see 15%-30% improvements for cases like calling lots of functions.
But I'm still interested in the dirty rect handling, for the naieve case.
On Mon, Mar 27, 2017 at 4:59 PM, Leif Theden <[hidden email]> wrote:
Just a couple quick, miscellaneous notes on this problem:
--IDK if pygame does this, but I expect blitting, with overdraw, to a fullscreen temporary surface, and then sending the whole surface every frame to the GPU would be faster than updating multiple smaller portions of the GPU memory with CPU data.
--Dirty rects were never really a performance issue for me. This is partly because I usually implement the scheme above, more-or-less, partly because I avoid it algorithmically, and partly because blitting is inexpensive and typical overdraw rates are low (as they would be in a good GPU renderer too).
--IIRC the fastest algorithm for rect-rect collision detection is the max-of-mins/min-of-maxes method, which is widely known. It maps nicely to SSE too.
--I don't doubt there's no clear performance gain for current implementations, especially if the algorithm is in Python.
--There are two key cases to think about: (1) You have 1x1 pixel rects for each pixel on the screen. No overdraw, so the algorithm is complete waste. (2) You have a bunch of fullscreen rects that of course all overlap 100%.
On 27.03.2017 20:08, Ian Mallett wrote:
> max-of-mins/min-of-maxes method
Could you share more about this method? Seems I can't find something
related to dirty rects with that.
In reply to this post by Ian Mallett-2
On Mon, 27 Mar 2017 12:08:10 -0600
Ian Mallett <[hidden email]> wrote:
> --There are two key cases to think about: (1) You have 1x1 pixel
> rects for each pixel on the screen. No overdraw, so the algorithm is
> complete waste.
I think it would make sense to merge dirty rectangles even if they don't
actually overlap, but simply touch. This would improve this use case
(and a common use case of a tiled map) greatly.
In reply to this post by DR0ID
On Mon, Mar 27, 2017 at 12:16 PM, DR0ID <[hidden email]> wrote:
On <a href="tel:27.03.2017%2020" value="+12703201720" target="_blank">27.03.2017 20:08, Ian Mallett wrote:
It's a rect-rect collision detection algorithm, which could be applied to detect which rects overlap (and gives the overlapping rect also). It's the standard algorithm people use for this purpose, although people tend to think of it in terms of multiple tests, which is less intuitive (to me, anyway) and slower.
The test is basically (from memory):
In 1D, you can think of "rect.min" as being "rect.left" and "rect.max" being "rect.right". In nD rect.min is a vector [left,bottom,back,...] and rect.max is a vector [right,top,front,...] and the "<" must apply along all axes. That's why it's great for SSE. You can also subtract the terms to get the dimensions of the overlapping area.
On Mon, Mar 27, 2017 at 12:23 PM, Radomir Dopieralski <[hidden email]> wrote:
On Mon, 27 Mar 2017 12:08:10 -0600
I agree; this is desirable.
In reply to this post by Ian Mallett-2
On Mon, Mar 27, 2017 at 8:08 PM, Ian Mallett <[hidden email]> wrote:
Double buffering works like that. Except with update you can tell it to update only parts of the back buffer. When you display.flip()/update() it sends the data. It's faster because less data is sent over the bus. But even on OpenGL slower implementations (or with higher resolutions) you can gain benefits by only updating part of the buffer(s). Not on monster dedicated cards at lower resolutions of course.
Sounds interesting. Maybe that's what we're using now for rect collision... I'm not sure of the algo name. It's only in C however.
Below is one of the cases that benefits the most. (actually after drawing this I'm not sure if this is actually true or I dreamed it up... but I drew the diagrams... so it feels a waste to delete them).
Currently the update method combines two rects like this:
Into a big rect like this:
| XX | # - updated, but no need.
|### | X - over drawn.
Note in these crude diagrams the # area is drawn even though it doesn't need to be, and the XX is over draw. When there are some overlapping rects like this that are large, that can be quite a saving of pixels not needed to be drawn.
On Mon, Mar 27, 2017 at 12:34 PM, René Dudfield <[hidden email]> wrote:
I suppose. Still, this should be asynchronous. Maybe making it work in Python isn't possible.
Yeah I'm not sure it has its own name. And definitely you'd want it in C or equivalent.
the thing to consider is what you do about it. You can't just "not draw" those pixels, because you still have to iterate over them, which defeats the point. Instead, you have to decompose the combination of rects into disjoint subrects that span the region that needs to be updated. For your example:
. . . the decomposition could be something like:
│ │ │
The key point is that doing this decomposition needs to be faster than just drawing. Unfortunately, computing those rectangles is annoying, and also O(n^2) in the number of rectangles (whereas the number of overdrawn pixels you save can be linear-ish).
Ian Mallett wrote:
> Unfortunately, computing those rectangles is annoying, and
> also O(n^2) in the number of rectangles
With the right data structures and algorithms, it doesn't have to be.
Apple's Quickdraw had some very efficient ways of representing and
operating on regions made up of collections of rectangles.
There's some info and further pointers about it here:
On Mon, Mar 27, 2017 at 5:26 PM, Greg Ewing <[hidden email]> wrote:
Ian Mallett wrote:
I mean this in a fundamental, mathematical sense. With O(n) rectangles, you can produce O(n^2) regions that cannot be decomposed into fewer than O(n^2) disjoint rectangles.
Ian Mallett wrote:
> I mean this in a fundamental, mathematical sense. With O(n) rectangles,
> you can produce O(n^2) regions that cannot be decomposed into fewer than
> O(n^2) disjoint rectangles.
True, but that's a worst case, unlikely to be approached in practice.
Also, for this application, there will be some minimum size of
rectangle below which the overhead of decomposing it further would
be a net loss. If you stop at that point, there will be an upper
bound on the number of rectangles you're dealing with.
In reply to this post by René Dudfield
I tested the effect of my optimize_dirty_rects script that René mentioned, and I found that the time spent in my optimization algorithm exceeded the time saved by blitting fewer pixels. :-( I've thought about reimplementing it in Cython to see if that would make the algorithm run fast enough that its overall effect is to save time, but I haven't gotten around to doing that.
On Mar 26, 2017 2:50 PM, "René Dudfield" <[hidden email]> wrote:
In my experience, I've found it is better to optimize the game as if the whole screen was being updated at the same time. The effect being, I never bother with dirty updates. It seems that many games will have moments of brief full or near full screen updates, and time spent optimizing the screen updates is effectively wasted.
On Tue, Mar 28, 2017 at 7:19 AM, Jason Marshall <[hidden email]> wrote:
On Tue, Mar 28, 2017 at 8:20 AM, Leif Theden <[hidden email]> wrote:
I've already said this, but it bears amplifying: this has been my experience also. If you're fillrate-bound enough to be worrying about overdraw and minimal blit updates, you're probably fillrate&fragment-bound enough to want to invest in a GL renderer anyway.
In reply to this post by Leif Theden
In my experience it depends on the kind of game you are making. If it is round based and only some parts need update then dirty updates might be worth.
On the other hand, this is why the dirty sprite group kept it simple and just does union overlapping rects. But that is in no way optimal in all situations. Like here, more pixels would be updated instead of just the two rects and overdraw:
| | ' | | '
| | ' | | '
| | ' | | '
| | ' | | '
| | ==> ' | | '
| | ' | | '
| | | |
| | ' | | '
And at a certain point I think updating the entire screen is faster than the calculations needed to find out witch areas are dirty or not (maybe this could be a percentage of the screen area?).
Also when using multiple layers of sprites and marking one sprite dirty, it should mark all overlapping sprites in all layers as dirty and re-draw them too (clipped to the dirty rect, in the right order).
The fastest thing would probably be to have a dirty bit mask where you mark rectangles of bits that are dirty and just blit as many sprites you want, only the dirty pixels should then actually update on screen. There would still be overdraw (in memory) but clipped inside of the dirty areas and not on the screen. Not sure if single pixels can be updated and how fast it would be to just update certain pixels.
I don't think that overdraw can be avoided in all cases (what if the sprites are half transparent on different layers? Or have holes in it? Then you need overdraw!).
I think more important would be to have a nice and fast way to cull not visible sprites. This gets more difficult if parallax scrolling is used besides of scrolling in the world (then a world -> screen transformation and vice versa is needed (for mouse click to object conversion)). Culling enabled me (and my team) to write a game like as in the last pyweek  . It would not been possible otherwise (too many sprites to draw outside of the screen, probably the python part of handling those slowed it down too much, but with culling it was not such a problem).
So at the end, it still depends on the kind of game. For fast graphics use dedicated hardware, on slower machines dirty update might still help.
I still would like to try certain things out, but not sure how much time I will have in the next months.
On 28.03.2017 16:20, Leif Theden wrote:
In reply to this post by Jason Marshall-2
Thanks for your code, it inspired me to try to make a faster version.
At first I tried other ways but at the end I implemented almost the same as you did, but just using the pygame methods for speed up (in exchange it is tied to pygame now, yours works just for python). For next time at least there will be some tests that can be re-used.
IIRC the algorithm tries to make as big rectangles if possible, so it does some sort of union them (without extra pixels). One can see how the algorithm splits the rects using the sript 'gumm_needs_to_see.py' (needs pygame) in my repo . Use arrow keys to move forward, space to change between the two algorithms (so you see differences between them in the result). It will go through all rect configurations (and some more) from the article .
My code has been tried to be written for performance that is why it might not be that readable (and also because there are many cases). Also I'm not sure anymore why it is around 2x faster only with -OO flags (looking at the code there aren't any asserts or such).
On 28.03.2017 14:19, Jason Marshall wrote:
In reply to this post by Leif Theden
I agree with Leif.
I have now used Cython to compile and test my optimize_dirty_rects script (along with the bisect standard library module it uses). The disappointing result was that my test suite ran a little bit slower than it did using pure Python. Maybe there would've been some performance improvement if I had reduced overhead by using structs and C++ standard library containers instead of Rect objects and Python lists & dicts, but I don't know how to do that. (I also found that psyco on 32-bit Python 2.7 helps performance a little bit, but not enough to make optimize_dirty_rects have a positive overall effect on performance.)
So, for the kind of programs I make with pygame, I've determined that it's faster to just update one big rectangular area that covers all of the blits. Maybe someone else could contrive a better optimization algorithm and implement it in C to minimize overhead, but I don't think I'll be able to do that.
On Mar 28, 2017 9:21 AM, "Leif Theden" <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|