Another attempt at supporting non-contiguous arrays#171
Closed
SyamGadde wants to merge 25 commits intoinducer:masterfrom
Closed
Another attempt at supporting non-contiguous arrays#171SyamGadde wants to merge 25 commits intoinducer:masterfrom
SyamGadde wants to merge 25 commits intoinducer:masterfrom
Conversation
added 7 commits
February 20, 2018 12:44
…avior of users of get_elwise_module or get_elwise_range_module, to test backwards compatibility and performance.
…ctual GPUArrays and using ARRAY_i indices (rather than just i).
Owner
|
Wow. This looks like an impressive bit of work. I agree with the objectives here, and I would like to see this (or something like it) get merged. Unfortunately, I won't be able to take a close look before mid-April (work deadlines). Sorry about the long delay. I've made a note to take a look then. In the meantime, I think it would be a good idea to solicit reviews from folks on the mailing list. Most of all, thanks for working on this! |
Contributor
Author
|
Absolutely. I will subscribe to the list and post. Thanks! |
(Function may change based on kernel call arguments)
Contributor
Author
|
Closing and creating a new pull request #172 with more fixes, more reasonably arranged. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NOTE: SUPERCEDED BY #172.
Inspired by:
https://lists.tiker.net/pipermail/pycuda/2016-December/004986.html
https://gitlab.tiker.net/inducer/pycuda/merge_requests/1
I tried a new approach to supporting non-contiguous arrays in PyCUDA (could be ported to PyOpenCL somewhat easily I think). The goals (some elicited by the above discussion and comments in the WIP) were:
get_elwise_moduleandget_elwise_range_moduleThe only way I could think to support all those goals was to delay compilation (and source generation) to call-time, to take advantage of knowledge of input array strides. Contiguous arrays get the kernels that PyCUDA has always given them, non-contiguous arrays get specialized kernels. The nice thing about doing this is that the actual shape and strides can be sent as '#define's to aid compiler optimization (could even help with the contiguous kernels, though have not tried that). The tricky thing about doing this is that some functions in the current implementation require the Module/Function before call-time, to get texrefs etc. So I basically implemented a Proxy class for SourceModule, called
DeferredSourceModulewhich also defers the generation of the values created byget_function(),get_texref(), etc. until call-time.To make this all work, indexing (for non-contiguous arrays at least) for an array
Aneeds to beA[A_i], rather thanA[i]. If it detects matching contiguous arrays as inputs, thenX_iis '#define'd to bei, so kernels using the old method will still work (as long as the input arrays are contiguous and match in strides). No regexes needed to transform the user-specified kernel fragments, it's all directed by the user. Also, if you want to support non-contiguous arrays, you need to send the actualGPUArrayobjects, rather than theirgpudatamembers to the call or prepare_ functions.All existing tests succeed. More would probably need to be added if it made sense to integrate this into PyCUDA.
Positive side-effects:
GPUArray.get(),GPUArray.set()andGPUArray.copy()now work for arbitrarily sliced/strided arrays.The performance hit for contiguous arrays is around 15% for modest-sized arrays (i.e. the 1000x100 array tested by Keegan in the above discussion) and, looking at profiling output, I think the hit is due to detecting contiguity/order (in
ElementwiseSourceModule.create_key()). This could probably be improved. Performance for non-contiguous arrays is infinitely better, given that they weren't supported before, but I've seen a 40% slowdown over the contiguous version for theb1[::2,:-1]**2test Keegan tried, due to the need to calculate indexes at each iteration of the loop. It tries to do this in a smart way, by pre-calculating the per-thread step for each dimension, and only using division/modulo to calculate the starting indices for each thread before the loop.Independently of whether these changes are merged in, I will continue to use and develop them to support some local needs, so comments are welcome. I hope this is useful!