Skip to content

Documentation update for rustworkx migration#437

Merged
peterrrock2 merged 2 commits intomggg:wip/rustworkx-migrationfrom
chief-dweeb:frm_rustworkx_post_alpha
Feb 25, 2026
Merged

Documentation update for rustworkx migration#437
peterrrock2 merged 2 commits intomggg:wip/rustworkx-migrationfrom
chief-dweeb:frm_rustworkx_post_alpha

Conversation

@chief-dweeb
Copy link
Copy Markdown

This is almost all documentation changes. I wanted to get them in before you (Peter) did the docstring overhaul.

There is one question in the code that needs to be answered for the documentation to be accurate - the question is about how region_surcharge works in random_spanning_tree(). Since region_surcharge is so useful, I think it makes sense to be crystal clear about how it works (and when it doesn't work).

Comment on lines +120 to +155
# frm: TODO: Documentation: Ask Peter for better description of region surcharge
#
# I have to admit that I am a bit confused about how region surcharge works.
#
# It has the effect of retaining the edges of nodes that are in the "same region".
# There is no guarantee where these nodes will end up in the tree, however,
# because the root of the tree that is eventually chosen in the find_balanced_edge_fn
# is typically randomly selected. This seems to just have the effect of keeping
# nodes that are in the same region close to each other in the tree. Stated
# differently, if an edge for nodes in the same region was NOT included, then
# then the two nodes could end up farther apart in the spanning tree, and hence
# more likely to be put in separate districts.
#
# What has me puzzled, however, is that the nature of GerryChain graphs is that
# because nodes represent geographic areas, nodes that are in the "same region"
# are already "close" to each other geographically and hence close to each
# other in the graph. So the effect of a region surcharge is probably
# pretty small - since nodes in the same region are already typically close
# to each other in the spanning tree.
#
# The choice of the root of the spanning tree seems to be pretty important.
# If the root is in a region that has a surchage then all of the nodes in
# the shared region will be at the top of the tree. This would seem to
# increase the liklihood that those nodes would be put in the same district.
# Actually, what you really want is a way to have all of the nodes in the
# same region be grouped together in a subtree at the BOTTOM of the
# spanning tree, because the find_balanced_edge_cuts functions all go
# bottom up.
#
# In short, I find that I do not really grok why region surcharge works,
# how effective it really is, and why there is not a better solution to
# keeping regions together...
#
# Another issue is how users should think about the weights used for
# a region surcharge. This is especially interesting when there are
# more than one "region" being surcharged - for instance "muni" and "water".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The region_surcharge parameter is a clever trick to modify the way that Kruskal's samples from the space of all possible spanning trees.

For the sake of simplicity, imagine that we have a set regions that tile our map (e.g. counties). Then the region_surcharge parameter will look at the base graph and assign a surcharge weight, say 0.3 for sake of example, to edges between distinct regions of the same type. So if there is an edge between a node in County A and County B, then, when we go to roll the random weight for that edge for Kruskal's (a value between 0 and 1), we will add the surcharge weight to that edge ex-post-facto. This has the effect of making edges between regions less likely to be selected early on by Kruskal's algorithm.

Pushing this to the extreme, it is not too hard to see if you assign a surcharge weight of 1.0, then at some point in Kruskal's algorithm you are forced into a spanning forest where each tree in the forest is it's own region. In fact, if there are $n$ nodes in the graph and $k$ regions, then this forest appears on step $n-k$ of Kruskal's algorithm. With the 1.0 surcharge, that means that the remaining $k-1$ edges that need to be selected for the spanning tree can only be between your chosen regions.

Then comes the population balancing step. Since you only have one edge in between each region of interest when the surcharge is 1.0, when you go to bisect the tree, you are then FORCED to leave at least $k-1$ of the regions of interest whole. Note that this is not at all guaranteed when there is no surcharge on edges between regions since trees in the spanning forest are then free weave in and out of regions at their leisure.

In this way, you can think of region_surcharge as a parameter that modifies the graph partition that appears at step $n-k$ of Kruskal's algorithm where a higher value increases the probability of seeing a spanning forest on the regions of interest. As the value decreases from 1 to 0, you increase the amount of "weaving" between regions that is allowed at step $n-k$ of Kruskals, and, therefore decrease the number of regions that you expect to keep whole.

When there are multiple types of region with a surcharge of 1.0, the idea is the same, but the exact math of it gets a little messy since you need to consider intersections. TL;DR is that there will be some step in Kruskal's algorithm where you have $\ell$ pieces of partition where each piece is a member of the (topological) common refinement of the region types, and then you build a spanning tree on that. In this case, the number of pieces that you are guaranteed to keep whole depends entirely on the way that the region types are laid out, and there are ways of arranging regions that would force you into a situation with no guarantees (i.e. a grid where the types of region are "rows" and "columns" since the common refinement has $n\cdot m$ pieces -- the individual nodes -- and so you are back to basic Kruskal's).

Comment on lines +199 to +209
# frm: TODO: Documentation: What values make sense for region surcharge?
#
# We should make clear what the issues are for different values of
# region_surcharge. The random values are between 0-1, so any region
# surcharge value greater than 1 will dominate. What happens if the
# user provides several region surcharge values for different
# node attributes? I do not know, and I presume users won't know.
#
# Note that the documentation currently existing sometimes says that
# region surcharge values should be between 0-1 and then it has
# example code where the values are 1 or greater than 1...
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Values above can be useful since then you can place an importance ordering on the common refinement, but for most people, once the values get above 1, there is not a whole lot of utility.

Comment on lines +282 to +304
# frm: TODO: BUG? Why do we surcharge when either node is not in the region?
#
# In the paper that Peter sent me titled, Models of Random Spanning Trees, it states:
#
# When users desire to make it more likely that two nodes are placed in
# different pieces of a partition, they can add a positive “surcharge”
# to the weight on the edge between those nodes. When a collection of
# contiguous nodes makes up a region that users prefer to keep in the
# same piece, the same idea can be used to surcharge the boundary edges
# of the region. This has the effect that minimum spanning tree is more
# likely to restrict to a tree on the designated region; in the
# bipartition step, that means the region will be kept whole or split
# at most once.
#
# But the code below surcharges both when the nodes are in different regions
# AND when either of the nodes is not in a region at all. From what the
# article says, we should not surcharge when BOTH nodes are not in a region
# at all.
#
# ...confused...
#
# Ask Peter...
#
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that we have gone back and forth on in the lab. There are meaningful differences in whether you should treat members of the "void" (places without a region assignment) as all being members of some "void" region or as individual regions unto themselves (the implementation given here). Currently, this is here because it has worked in the past for the outcomes that we desired, but we are actively working on figuring out the best practice here.

What will probably happen is we will introduce another parameter that will allow you to deal with "void" nodes in a variety of ways.

@peterrrock2 peterrrock2 merged commit 237c242 into mggg:wip/rustworkx-migration Feb 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants