اطلاعت جدید دیگه و دلایل انتخاب DRR3 بجای GDDR5 و استفاده از چیپ esram و Data Move Engines و.................. اطلاعات واقعا جالبیه که نشون میده مایکروسافت چگونه میتونه بازدهی 100 درصد از سخت افزار بگیره و پهانی باند خودش را کاملا مورد استفاده قرار بده و همچین جای پای تکنیک های جانکارمک مثل Virtual Texturing و...........
Durango makes US catch the Delorean
Posted on
8 February, 2013 by
Urian
First of all, sorry for the delay on this issue, I wanted to do the right thing in this case and I think that once you read the entry will understand it.
[TABLE="class: aligncenter"]
[TR]
[TD="align: center"]
Microsoft
[/TD]
[TD="align: center"]
AMD
[/TD]
[/TR]
[TR]
[TD="align: center"]Shader Core (SC)
[/TD]
[TD="align: center"]Compute Unit (CU)
[/TD]
[/TR]
[TR]
[TD="align: center"]Local Shared Memory
[/TD]
[TD="align: center"]Local Data Share
[/TD]
[/TR]
[TR]
[TD="align: center"]Global Shared Memory
[/TD]
[TD="align: center"]Global Data Share
[/TD]
[/TR]
[TR]
[TD]Color Block (CB) + Depth Block (DB)
[/TD]
[TD]Raster Back End (RBE)
[/TD]
[/TR]
[/TABLE]
But to better understand the parties "as" we must take the DeLorean.
First trip with the Delorean: a journey to the past
In terms of custom parts, the system's GPU is reminiscent a graphics processor professional not very known to the public given that was not very popular in the domestic sphere, but which, in 2002, included a technology that had never been integrated into a graphics card and would not see integrated hardware level again until the arrival of the GCN architecture,
I'm talking about the P10 from 3D Labs.Beyond the memory bus hardware implementation, 3Dlabs believes that the "Virtual memory" system used in the P10 has much more meaning and potential impact on the 3D market, in fact,
is something that John Carmack of id has been asking for a long time in the hardware. The concept of Virtual memory is very similar to used in the CPU memory system: removes the barriers between the different subsystems of memory in the computer, such as the buffer's local, the RAM image main or even the hard disk space, and allows 3 processor access them freely.
In the Virtual memory of the P10 system there is a space of logical addressing of up to 16 GB which is completely divided in 4 KB pages. The RAM on the card essentially becomes a huge cache L2 for the chip, a system which is easy to understand for the compilers.
Not in vain Carmack already had years pushing for the implementation of Virtual memory on the GPUs for some time,
in a letter of March 7, the year 2000, he already explained the need for the passage of a virtual memory system, the reason for this was to avoid the so-called "Texture Trashing", Carmack described the problem in the following way:
Almost all of the drivers made a purely LRU memory management. This works correctly while total textures need in a frame to fit into memory once have been loads. The minimum you need a little more than memory that fits into the card, you will see how performance falls sharply. If you have 14 MB of textures to render a frame, your graphics card and it has only 12 MB of available buffers of image, instead of having to upload 2 MB that do not fit. You will have to make the CPU to generate 14 MB of command of traffic that can make to the frame rate of a single digit in many drivers.
His idea to solve the problem already the know all, Virtual Texturing, Carmack once described it as well, keep in mind that this is the year 2000:
Problems with large textures can be solved by simply not using large textuas. Losses, both the texels not referenced can be reduced by cutting all textures of 64 × 64 or 128 × 128. This requires pre-processing, adds geometry, and requires a messy overlap of textures to adjust the seams between these.
Currently it is possible to make an estimate of which are the necessary levels of Mip Map and only to exchange those. An application cannot be exactly calculated levels of Mip Map that will be referenced by the hardware, because of this there are a few small variations between chips and the calculation of the slope can lead to a significant overhead in processing. A bound top conservative may be looking at the normal minimum distance from any reference to the vertex and texture in a foograma. This over-etimaria the necessary textures in a 2 X and it would still leave a great impact when the top level of the Mip Map will load for large textures, but can enable the setting for scenes great style Cathedral unless there is an Exchange.
Smart developers can always work hard to overcome the obstacles, but in this case, there is a clear solution for hardware that simply gives more performance than anything else possible for software and makes life easier each: virtualizing the vision that has its virtual memory card.
With pages of tables, the fragmentation of the addressing is not a problem, and with the rasterizer graphic to having to reload a page when exact 4 KB block is necessary, the levels of the mip maps and textures hidden problems simply disappear. You don't have to do anything sneaky by application or driver, only to manage the indexes of the pages.
The hardware requirements are not very heavy. You need graphic card, the ability to load automatically the TLB buffers (TLB) translation from the pages of tables in local memory, and the ability to move a page through the AGP or PCI graphics memory and update the pages of tables and reference counter. Don't even have several TLB, since access patterns are not jumping po all coo memory can the CPU. Even with a single TLB for each unit of textures, refills would have only 1/32 of the memory access if the textures were blocks of 4 KB. Everything you want is that the limit superior outside a TLB large enough so that each texture covers the texels referenced in the typical raster by scanline.
Some developers will say "I do not want the system to handle textures, wants total control" there are a couple of answers to this, first the management pages have the flexibility you don't get by a scheme by software, so you have new capabilities. Second, you can still continue to treat as if it were a fixed texture buffer and do your mimo with updates. Third parties, even if this were slower than the scheme possible to outsmart software (which I seriously doubt), will exchange development time by something that is theoretically more efficient and faster. We not already code in Assembly language overlays!
Some hardware designers will say something that the graphics engine is in waiting while they getting obtaining data from a page from the CPA. Sure, it will always be better to have enough space for textures and not have to always make the Exchange, and this feature would not allow you to talk more about megapixels, or millions of triangles, but each card ends up not having enough memory at a given point. Ignore these cases from the real world does not help to your customers. In any case, infernal waiting for this is much less than if these loading texture learns from the command FIFO.
It is assumed that 3Dlabs will have some form of virtual memory management on the permedia 3, I'm not familiar with the details (Yes someone from 3D labs can send me the latest registered specifications, it detects it!).
P10 = Permedia 3. Later comments on the use of RAM embedded for this concept:
Embedded DRAM should be a driving force. It is possible to place a large amount of megabytes of high bandwidth on a chip with a video driver, worse it will not be possible (for now) to put the 64 MB of a GeForce there. With the virtualized texturing, pressure on the memory is drastically reduced. Even with a 8 MB card would be enough for game to 1024 × 768 and 16-bit or 32-bit and 800 × 600, no matter whatever the burden of textures.
Second trip with the Delorean: a near present.
In one of the entrances on Durango talk precisely of the same item, the implementation of Virtual Texturing but related to a patent on behalf of Mark S. Grossman and Microsoft, are what comes now is a Deja Vu, but is helping to connect the dots between the past and the future.
________________________________________________________________________________________________________________________
On the other hand, there is an element that makes me think that this is the architecture chosen by Microsoft is the 7 HD × 00/GCN and is partially resident textures -
Is that friend Grossman has a patent assigned to Microsoft and with him as the inventor that perfectly describes this topic, despite the fact that this technology is not integrated within the specification of DirectX 11, the fact that Microsoft has a patent assigned to the same technology is another track that relate the following console of Microsoft with this GPU.
As you can see the Texture Unit in this diagram, not only you can read the main RAM for texture, you can also read a map tiles, the patent reads as follows:
Map of Tiles can specify the Tiles that are stored in the memory for textures. In one application, the map Tiles may contain one or more tables that can be used to determine the level of detail (if any) available for each Tile, so the map Tiles can reside on a memory unit.In some applications the map Tiles can be found in the memory for textures, but it is not required.
A map or table Hash, is a data structure that associates a particular key with a value index, so the keys stored within the map Tiles correspond each to a memory address concrete and the address of memory contains data. The patent says that the map of Tiles is in another memory which is not memory where are stored the textures but another... and another memory we have a map of Tiles available and accessible by all the TMU? Because the cache second-level GPU, which means that in the case of Xbox 8 and given that would the CPU and the GPU is communicated through the cache that the CPU could write GPU where textures in the same cache L2 so that the GPU can read it. On the other hand the expression that the map of Tiles is in memory for textures, does not mean only that the map Tiles can insert into main memory but also can include textures in the same memory that is the map Tiles, which entails the use of a memory embedded in the GPU. All textures within the eDRAM memory obviously cannot be placed, nor even the whole scene would take, it can be pulled by a Tile Rendering similar to which there is in the PowerVR and Xbox 360, which would also mean the use of this memory as an accumulation buffer.
The accumulation buffer, known as FrameBuffer Object in the jargon of OpenGL is a section of memory where you can calculate a frame with the peculiarity that will not be the final frame and on the same data is calculated several times. It is used in techniques such as the Tile Rendering and effects of post-processing based on manipulating the final image. You can work with it using a full-frame or otherwise making use of small pieces of the frame, much easier to store in buffers. Effects such as Alpha Blending, Motion Blur, the different types of AA... are effects that depend on much, but lot of bandwidth when calculating it, is for this reason that dividing the frame into fragments and go running with them memories closer to the processor, as in the case of Xbox 8/Durango would be the caches.
For most loop the loop, if we take into consideration the filtrate in VG Leaks on the alpha of the next Xbox kits:Kit Alpha uses a separate graphics card similar in capacity and speed to the GPU to be include in the final design. Card does not have the ESRAM which will take the final GPU design.
_________________________________________________________________________________________________________________________
Looking a bit more documentation I have found information about a technique presented by Sean Barrett in the 2008 GDC called
Sparse Virtual Texturing or SVT, which is the same as the PRT of AMD.
Sparse Virtual Texturing is a form of simulation of large textures using much less memory than would be required to loading only the data when they are needed, and using a pixel shader to map from a huge virtual texture made the current physical texture.
The technique can be used for very large textures, or simply for large amounts of small textures (grouping all of them in a huge texture, or using multiple table pages).
It has been inspired by the descriptions of the MegaTexturing of John Carmack in several private forums and emails. It is not exactly the same as the MegaTexture but is approached.
In full year 2008, the management of memory in GPUs remained virtual memory hardware-level support and therefore had to make an implementation of the idea software, recently with GCN architecture AMD has implemented all this hardware.
The problem with this implementation is not part neither of the current version of DX or OpenGL, so it is not a completely standardized technology, on the other hand some... you say that does this special if it is proper of the GCN architecture that will be used by PS Orbis? Well, keep in mind that aside from Xbox 8/Durango have not yet arrived, but this previous part is necessary to understand the architecture of the new system.
Third trip with the Delorean: Durango
There are three elements of the following console from Microsoft that are interesting, are as follows:
- Virtual Texturing
- ESRAM
- Data Move Engines
All these parties revolve around the same concept, the implementation of Virtual Texturing hardware.
ESRAMDurango has no video memory (VRAM) in the traditional sense, but the GPU does contain 32 MB of fast embedded SRAM (ESRAM). ESRAM on Durango is free from many of the restrictions that affect EDRAM on Xbox 360. Durango supports the following scenarios:
Durango has no memory (VRAM) video from the traditional point of view, but the GPU contains 32 MB of embedded SRAM (ESRAM). The ESRAM in Durango, is free of many of the restrictions that affect the EDRAM on Xbox 360. Durango supports the following scenarios:
- Texturing from the ESRAM
- Render surfaces in the main RAM
- Read from a render target without having to make a decision (in certain cases)
The difference in flow between the ESRAM and main RAM is moderate: 102.4 GB/sec front 68 GB/sec. The advantages of the ESAM is a lower latency and a lack of restraint by other customers of the memory, e.g. CPU, I/O, output to screen. Low latency is particularly important to maintain the performance of the color blocks (CB) and the depth blocks (DB).
By strange and surprising that it seemed to a next generation console: Xbox 8/Durango has "only" 32 MB of video memory. do you understand the reason why I have made reference to the 3DLabs Permedia 3 now? 3DLabs card included memory did the same work that makes the ESRAM next Microsoft console, this means that the L2 cache of the GPU and memory controllers that a traditional configuration would be connected to the external memory here are directly connected to the ESRAM. The current APU from AMD on the PC to communicate with the external memory GPU makes use of the Radeon Memory Bus, which has a bandwidth of 256 bits in each direction per channel memory (256-bit of reading) and 256-bit of writing, in the case of Kryptos we find that bandwidth is 1024 bits in total, so added a dual controller or the existing width has been increased.
This is the first part of the puzzle, we are still missing the other two parties.
Virtual routingAll access to the GPU in Durango memory using virtual addresses, and therefore pass through a translation table before working out in the form of physical address. This layer of indirection solves the problem of fragmentation of memory hardware resources, a single resource can occupy several non-contiguous pages of physical memory without penalty.
Virtual addresses can take aim pages in the main RAM, the ESRAM, or can not be mapped. The Shader read and writes the pages not mapped in well defined results, including optional error codes, rather than block the GPU. This ability is important for the support of resources in "tiles", which are partially resident in physical memory.
The benefits of the face graphics Virtual address does not need to repeat again as above, but above all what's noteworthy is the use of the same for Virtual Texturing, which is its main function, the most direct utility is to break the direct link between the information available on each frame and the bandwidth of the memorythe reason for this is simple, in the traditional method, you have to load the entire page of textures full through the bandwidth, here only loads what is needed in each frame, but this is something that has been commented more above.
But of course, as we are at this level of the last piece of the puzzle still missing description, is worth that the GPU will see all of the main memory, but... do as accessed if their memory controllers are directed towards the ESRAM? Hence through DME or Data Move Engines are the third and final part of the puzzle.
Data Move Engine
The Data Move Engines are something that can initially cause headaches but fully understand when you consider that their function is the same as the the PCI Express on the PC, while thanks to unify into a single chip CPU and GPU seems to be that the PCI Express already loses all its raison d'etre, actually has a less known function for being less used to it is the provide direct access to the main system memory for different devices, connected to the PCI Express port, so the GPU has access to 8 GB of DDR3 system memory.
But the DME beyond, is the problem of virtual memory in the GCN architecture support is not complete and does not support 100% hardware, but that is a combination between software through Shaders and hardware, the idea of the DME is that virtual memory management can be performed automatically by these or alternatively give freedom to the developer not to use this type of memory management according to see it suitable.
The concept is as follows, Shaders memory management sucks by these cycles, which despite the theoretical peak is higher, real available bandwidth may be lower and if you read the summary of the text written by Carmack above you will see that it refers to the waiting time that data arrive from main memory to the memory of video through the DMA's turn, as well, the DME eliminated this problem downloading this task of Shaders and making it work in parallel computing of the scene.
The advantage of the Move Engines lies in the fact that can operate in parallel with the computer. During the times when the GPU is attached by the computer, the Move Engine operations are effectively free. Even when the GPU is limited by bandwidth, the engines move can use different paths. For example, a move engine which copy of the RAM to RAM will not be impacted by a shader that only accesses the ESRAM.
With DME the puzzle is completed and this full hardware support to 100%, the management of virtual texturing does not have to be implemented within each graphic engine and therefore it is possible to use all the graphics engines and therefore on all games. The advantage of this is that the "FLOPS" which would be spent on memory management can be directed to other tasks allowing greater use of the GPU for computer graphics, but the use of not only DME is limited to Virtual Texturing, keep in mind that they can load any data in the ESRAM and that this charge them directly in the caches of the GPU. Do you think for example loading an Octree node for example? Use goes beyond Virtual Texturing.
But this can be applied in PS Orbis? Basically to implement virtual memory management it is not necessary that there are two levels of memory, taking the top sufficient storage capacity for the image (color, depth, and stencil) buffers and textures needed for the scene and there is a lower level. For PS Orbis the caches of the GPU do not have enough storage capacity for this and the GDDR5 is a single level of memory for all of the GPU. Obviously the ESRAM and all the mechanism implementation costs in the space that is a sacrifice in terms of computation capability. But the biggest advantage comes from the fact that this allows access to large amounts of memory per frame without having to rely on huge band widths from expensive high-wattage as the GDDR5 memory. The reason why Xbox 8/Durango uses GDDR5 is not by the fact that then the thing would be completely redundant, the GDDR5 exists on the GPUs of face to avoid the Texture Trashing by the use of a higher bandwidth, the use of virtual memory on the GPU and Virtual Texturing are another solution to the same problem that both come into conflict within a system.
I hope that the article has left things clear and clarified any confusion.