Documentation update for rustworkx migration#437
Documentation update for rustworkx migration#437peterrrock2 merged 2 commits intomggg:wip/rustworkx-migrationfrom
Conversation
| # frm: TODO: Documentation: Ask Peter for better description of region surcharge | ||
| # | ||
| # I have to admit that I am a bit confused about how region surcharge works. | ||
| # | ||
| # It has the effect of retaining the edges of nodes that are in the "same region". | ||
| # There is no guarantee where these nodes will end up in the tree, however, | ||
| # because the root of the tree that is eventually chosen in the find_balanced_edge_fn | ||
| # is typically randomly selected. This seems to just have the effect of keeping | ||
| # nodes that are in the same region close to each other in the tree. Stated | ||
| # differently, if an edge for nodes in the same region was NOT included, then | ||
| # then the two nodes could end up farther apart in the spanning tree, and hence | ||
| # more likely to be put in separate districts. | ||
| # | ||
| # What has me puzzled, however, is that the nature of GerryChain graphs is that | ||
| # because nodes represent geographic areas, nodes that are in the "same region" | ||
| # are already "close" to each other geographically and hence close to each | ||
| # other in the graph. So the effect of a region surcharge is probably | ||
| # pretty small - since nodes in the same region are already typically close | ||
| # to each other in the spanning tree. | ||
| # | ||
| # The choice of the root of the spanning tree seems to be pretty important. | ||
| # If the root is in a region that has a surchage then all of the nodes in | ||
| # the shared region will be at the top of the tree. This would seem to | ||
| # increase the liklihood that those nodes would be put in the same district. | ||
| # Actually, what you really want is a way to have all of the nodes in the | ||
| # same region be grouped together in a subtree at the BOTTOM of the | ||
| # spanning tree, because the find_balanced_edge_cuts functions all go | ||
| # bottom up. | ||
| # | ||
| # In short, I find that I do not really grok why region surcharge works, | ||
| # how effective it really is, and why there is not a better solution to | ||
| # keeping regions together... | ||
| # | ||
| # Another issue is how users should think about the weights used for | ||
| # a region surcharge. This is especially interesting when there are | ||
| # more than one "region" being surcharged - for instance "muni" and "water". |
There was a problem hiding this comment.
The region_surcharge parameter is a clever trick to modify the way that Kruskal's samples from the space of all possible spanning trees.
For the sake of simplicity, imagine that we have a set regions that tile our map (e.g. counties). Then the region_surcharge parameter will look at the base graph and assign a surcharge weight, say 0.3 for sake of example, to edges between distinct regions of the same type. So if there is an edge between a node in County A and County B, then, when we go to roll the random weight for that edge for Kruskal's (a value between 0 and 1), we will add the surcharge weight to that edge ex-post-facto. This has the effect of making edges between regions less likely to be selected early on by Kruskal's algorithm.
Pushing this to the extreme, it is not too hard to see if you assign a surcharge weight of 1.0, then at some point in Kruskal's algorithm you are forced into a spanning forest where each tree in the forest is it's own region. In fact, if there are
Then comes the population balancing step. Since you only have one edge in between each region of interest when the surcharge is 1.0, when you go to bisect the tree, you are then FORCED to leave at least
In this way, you can think of region_surcharge as a parameter that modifies the graph partition that appears at step
When there are multiple types of region with a surcharge of 1.0, the idea is the same, but the exact math of it gets a little messy since you need to consider intersections. TL;DR is that there will be some step in Kruskal's algorithm where you have
| # frm: TODO: Documentation: What values make sense for region surcharge? | ||
| # | ||
| # We should make clear what the issues are for different values of | ||
| # region_surcharge. The random values are between 0-1, so any region | ||
| # surcharge value greater than 1 will dominate. What happens if the | ||
| # user provides several region surcharge values for different | ||
| # node attributes? I do not know, and I presume users won't know. | ||
| # | ||
| # Note that the documentation currently existing sometimes says that | ||
| # region surcharge values should be between 0-1 and then it has | ||
| # example code where the values are 1 or greater than 1... |
There was a problem hiding this comment.
Noted. Values above can be useful since then you can place an importance ordering on the common refinement, but for most people, once the values get above 1, there is not a whole lot of utility.
| # frm: TODO: BUG? Why do we surcharge when either node is not in the region? | ||
| # | ||
| # In the paper that Peter sent me titled, Models of Random Spanning Trees, it states: | ||
| # | ||
| # When users desire to make it more likely that two nodes are placed in | ||
| # different pieces of a partition, they can add a positive “surcharge” | ||
| # to the weight on the edge between those nodes. When a collection of | ||
| # contiguous nodes makes up a region that users prefer to keep in the | ||
| # same piece, the same idea can be used to surcharge the boundary edges | ||
| # of the region. This has the effect that minimum spanning tree is more | ||
| # likely to restrict to a tree on the designated region; in the | ||
| # bipartition step, that means the region will be kept whole or split | ||
| # at most once. | ||
| # | ||
| # But the code below surcharges both when the nodes are in different regions | ||
| # AND when either of the nodes is not in a region at all. From what the | ||
| # article says, we should not surcharge when BOTH nodes are not in a region | ||
| # at all. | ||
| # | ||
| # ...confused... | ||
| # | ||
| # Ask Peter... | ||
| # |
There was a problem hiding this comment.
This is something that we have gone back and forth on in the lab. There are meaningful differences in whether you should treat members of the "void" (places without a region assignment) as all being members of some "void" region or as individual regions unto themselves (the implementation given here). Currently, this is here because it has worked in the past for the outcomes that we desired, but we are actively working on figuring out the best practice here.
What will probably happen is we will introduce another parameter that will allow you to deal with "void" nodes in a variety of ways.
This is almost all documentation changes. I wanted to get them in before you (Peter) did the docstring overhaul.
There is one question in the code that needs to be answered for the documentation to be accurate - the question is about how region_surcharge works in random_spanning_tree(). Since region_surcharge is so useful, I think it makes sense to be crystal clear about how it works (and when it doesn't work).