380 likes | 602 Views
The power of C++ Project Austin app. Ale Contenti Visual C++ | Principal Dev Manager 4-001. Diving deep into project Austin. What’s Austin Why we built it C++ at work Go build amazing apps!. Austin. Austin is a digital note-taking app for Windows 8
E N D
The power of C++ Project Austin app Ale Contenti Visual C++ | Principal Dev Manager 4-001
Diving deep into project Austin • What’s Austin • Why we built it • C++ at work • Go build amazing apps!
Austin • Austin is a digital note-taking app for Windows 8 • You can add pages to your notebook, delete them, or move them around • You can use digital ink to write or draw things on those pages • You can add photos from your computer, from SkyDrive, or directly from your computer's camera • You can share the notes you create to other Windows 8 apps such as Email or SkyDrive • Beautiful and simple
Austin: why we built it • We used Visual C++ 2012 to build an amazing app: • Written in “modern C++” • DirectX, XAML for UI • C++/CX to interact with WinRT • Auto-vectorizer for faster ink smoothing • C++ AMP for faster page curling • …and it was fun (the code is available on codeplex, too) • Showcase the power of Windows 8, the native platform and C++
Modern C++ • We strived to write Austin in a “modern” way: • C++ Standard Library, augmented with PPL and Boost • Smart pointers instead of raw pointers • Pervasive RAII pattern • Handle errors using C++ exceptions • Coding conventions inspired by Boost • No bare pointers, no delete
DirectX and XAML • DirectX to create an immersive, fluid user interface,that's built as a 3D scene with lights, shadows, and a camera • On the DirectX render target, we draw notebook's pages, photos, ink strokes, and background • A 3D engine library abstracts some of the DirectX complexity • DirectX for a fast, fluid, real-to-life experience XAML UI is used for the settings menu, the app bar, and the rest of the user interface The SwapChainBackgroundPanel to host the 3D scene inside the XAML UI page
C++/CX • C++/CX is used at the “boundary”, to interact with Windows, (via the WinRT objects) and to leverage XAML UI • Used for loading and saving images, file picker, camera, storage files and folders (SkyDrive, etc.), implementing the “share” contract • Very useful for XAML UI: UI elements and events hook-ups • We were careful in not having C++/CX code “bleed” too much in our Standard C++ code (15 files out of 350) • Windows is the RunTime
Ink smoothing: the problem • We have in the order of 5ms or less to smooth the strokes In real time, please…
Ink smoothing: the code The C++ compiler is obsessed with optimization: In this case, it will auto-vectorize the loop • for (int j=0; j<numPoints; j++) • { • float t = (float)j/(float)(numPoints-1); • smoothedPressure[j] = (1-t)*p2p + t*p3p; • smoothedPoints_X[j] = (2*t*t*t - 3*t*t + 1) * p2x • + (-2*t*t*t + 3*t*t) * p3x • + (t*t*t - 2*t*t + t) * L*(p3x-p1x) • + (t*t*t - t*t) * L*(p4x-p2x); • smoothedPoints_Y[j] = (2*t*t*t - 3*t*t + 1) * p2y • + (-2*t*t*t + 3*t*t) * p3y • + (t*t*t - 2*t*t + t) * L*(p3y-p1y) • + (t*t*t - t*t) * L*(p4y-p2y); • }
Auto-vectorizer(super simplified view) for (i = 0; i < 1000; i++) { C[i] = A[i]+B[i] } for (i = 0; i < 1000; i+=4) { C[i:i+3] = A[i:i+3]+B[i:i+3] } “addps xmm1, xmm0 “ xmm0 + xmm1 xmm1
Auto-vectorizer: info from the compiler When does the auto-vectorizer kick in? On the command line: /Qvec-report:1 will report the vectorized loops /Qvec-report2 will report both vectorizedand non-vectorized loops, and the reason why some loops were not vectorized Refer to the Vectorizer and ParallelizerMessages in MSDN • ink_renderer.cpp(1092) : info C5001: loop vectorized From the build output, with /Qvec-report1:
Auto-vectorizer: it’s not always easy • #include <vector> • void test1() • { • std::vector<int> a(100000), b(10000), c(10000); • for (int i = 0; i < a.size(); ++i) • { • a[i] = b[i] + c[i]; • } • } info C5002: loop not vectorized due to reason ‘501’
Auto-vectorizer: it’s not always easy • #include <vector> • void test1() • { • std::vector<int> a(100000), b(10000), c(10000); • for (int i = 0; i < a.size(); ++i) • { • a[i] = b[i] + c[i]; • } • }
Auto-vectorizer: it’s not always easy • #include <vector> • void test1() • { • std::vector<int> a(100000), b(10000), c(10000); • for (int i = 0, int iMax = a.size(); i < iMax; ++i) • { • a[i] = b[i] + c[i]; • } • } info C5001: loop vectorized
Auto-vectorizer at work in Austin • The compiler will analyze the loop and emit the right code • For the ink-smoothing algorithm, we got a 30% speed-up • For the first part of the page curling algorithm, we got a 175% speed-up • Auto-vectorizer can analyze very complex loops • Always measure with a profiler to understand which loops you need to speed up • Leveragethe Vectorizer and ParallelizerMessages guide for help
Page curling: calculating normals Lots of triangles: we have less than 15ms to “turn a page” in real time; we need to parallelize this algorithm • // pseudo-code • for each triangle{ Position vertex1Pos = triangle.vertex1.position; Position vertex2Pos = triangle.vertex2.position; Position vertex3Pos = triangle.vertex3.position; • Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos); • triangleNormal.normalize(); • vertex1.normal += triangleNormal; vertex2.normal += triangleNormal; vertex3.normal += triangleNormal;} C++ AMP is a good candidate, since the data size is pretty large
Page curling: calculating normals We’re looping over each triangle This set of operations is safe, because it works on a single triangle at each time, no races • // pseudo-code • for each triangle{ Position vertex1Pos = triangle.vertex1.position; Position vertex2Pos = triangle.vertex2.position; Position vertex3Pos = triangle.vertex3.position; • Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos); • triangleNormal.normalize(); • vertex1.normal += triangleNormal; vertex2.normal += triangleNormal; vertex3.normal += triangleNormal;} But here we’re updating vertexes which are shared between triangles -> race! This algorithm only works on a single thread
Page curling: split the loop to make it parallelizable for each triangle for each triangle Calculate triangle normals Calculate triangle normals cache triangle normals Calculate vertex normals for each vertex Calculate vertex normals
First, loop for each triangle… We use C++ AMP • c::array<b::float32, 2> tempTriangleNormals(3, (int)triangleCount()); • parallel_for_each(extent<1>(triangleCount), [=](index<1> idx) restrict(amp){ Position vertex1Pos = triangle.vertex1.position; Position vertex2Pos = triangle.vertex2.position; Position vertex3Pos = triangle.vertex3.position; • Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos); • triangleNormal.normalize(); • tempTriangleNormals[idx] = triangleNormal; • }); Same as before, we calculate the normals for each triangle We collect the normals into a temporary array, which stay in GPU memory
…then, loop for each vertex • parallel_for_each( extent<2>(vertexCountY, vertexCountX), [=](index<2> idx) restrict(amp){ Normal vertexNormal = vertexNormalView(idx); • // go find the normals from nearby trianglesvertexNormal+= sumTriangleNormals(idx); • vertexNormal.normalize(); • vertexNormalView(idx) = vertexNormal;}); We go over each vertex, so no races In sumTriangleNormals, we fetch the normals from tempTriangleNormals, i.e., the temporary we kept on the GPU memory
Page curling: C++ AMP at work • Massive Parallelism with GPU and WARP • Running this algorithm on the GPU yields between 3x and 7x speed-ups • CPU is now free to execute other code • Even when DirectX 11 capable GPU hardware isnot present, C++ AMP willfallback to WARP, whichleverages multi-core and SSE2
Key takeaways • Use modern C++: RAII, r-value references, lambdas, const, Standard C++ Libraries, Boost, other 3rd party libraries, etc. • DirectX for fast and powerful graphics • XAML UI for standard UI elements • C++/CX to talk to Windows, to other components and to other languages (e.g., JS) • Auto-vectorizer and PPL to distribute work on the CPU • C++ AMP to leverage the GPU massively parallel compute power C++ Rocks! Go write great apps!!
Related Sessions • Tue/5:45/B92 OdysseyConnecting C++ Apps to the Cloud via Casablanca • Wed/11:15/B92 OdysseyIt’s all about performance: Using Visual C++ 2012 to make the best use of your hardware • Wed/1:45/B92 StingerDirectX Graphics Development with Visual Studio 2012
Related Sessions • Wed/5:15/B33 CascadeDiving deep into C++ /CX and WinRT • Thu/5:15/B92 Nexus/NormandyBuilding a Windows Store app using XAML and C++ - Photo app, the hiloproject • Fri/12:45/B33 McKinleyThe Future of C++
Resources • vcblog • Project Austin Part 1 of 6: Introduction • Project Austin on CodePlex • Auto-Vectorizer in Visual Studio 2012 • C++ AMP in a nutshell • Parallel Patterns Library (PPL) • alecont@microsoft.com Please submit session evals on the Build Windows 8 App or at http://aka.ms/BuildSessions
Participate in Design Research Experience development tools and features early in their design and development Influence future design decisions MICROSOFT DEVELOPER DIVISION DESIGN RESEARCH FILL IT ONLINE AT http://bit.ly/x6dtHt ENROLL TODAY!
Ink smoothing: the math • Line must be contiguous, as well as first and second derivatives • We approximate with the “cardinal” spline solution • With auto-vectorizer, we get a nice 30% speed-up
Page curling: how do we turn the page • Brilliant paper by Hong et. al., Turning Pages of 3D Electronic Books • Turning a page of a physical book can be simulated as deforming a page around a cone • Each “page” in Austin is made of a bunch of triangles • In C++, we apply the page turning algorithm to all triangles • The auto-vectorizer comes to rescue again with a sweet 1.7x speed-up
Page curling: vertex normals and shading • Vertex normals are typically calculated as the normalized average of the surface normals of all triangles containing the vertex • Using this approach, computing the vertex normals on the CPU simply involves iterating over all triangles depicting the page surface and accumulating the triangle normals in the normalsof the respective vertices • To me, the above screams “massive parallel”
Page curling: C++ AMP • // first calculate the triangle normalsc::array<b::float32, 2> triangleNormals(3, (int)triangleCount()); • c::parallel_for_each(c::extent<1>(triangleCount()), [=, &triangleNormals](c::index<1> idx) restrict(amp){ • b::float32 v1PosX = vertexPositionArray(0, indexArray(2, idx[0])[0]); b::float32 v1PosY = vertexPositionArray(1, indexArray(2, idx[0])[0]); b::float32 v1PosZ = vertexPositionArray(2, indexArray(2, idx[0])[0]); • b::float32 v2PosX = vertexPositionArray(0, indexArray(1, idx[0])[0]); b::float32 v2PosY = vertexPositionArray(1, indexArray(1, idx[0])[0]); b::float32 v2PosZ = vertexPositionArray(2, indexArray(1, idx[0])[0]); • b::float32 v3PosX = vertexPositionArray(0, indexArray(0, idx[0])[0]); b::float32 v3PosY = vertexPositionArray(1, indexArray(0, idx[0])[0]); b::float32 v3PosZ = vertexPositionArray(2, indexArray(0, idx[0])[0]); • b::float32 x1 = v2PosX - v1PosX; b::float32 y1 = v2PosY - v1PosY; b::float32 z1 = v2PosZ - v1PosZ; • b::float32 x2 = v3PosX - v1PosX; b::float32 y2 = v3PosY - v1PosY; b::float32 z2 = v3PosZ - v1PosZ; • // cross them b::float32 x3 = y1 * z2 - z1 * y2;b::float32 y3 = z1 * x2 - x1 * z2;b::float32 z3 = x1 * y2 - y1 * x2; • NORMALIZE(x3, y3, z3); • triangleNormals(0, idx[0]) = x3; triangleNormals(1, idx[0]) = y3; triangleNormals(2, idx[0]) = z3;});