Add initial BEAST X backend integration#44
Conversation
Adds an install-beast-mcmc.sh helper that downloads the official BEAST X v10.5.0 release tgz and registers the bundled fat beast.jar as dr:beast-mcmc:10.5.0 in the local Maven repository, plus a README documenting the one-time setup. The integrations/beastx/java pom now declares this dependency. Adds a first concrete tile, JC69Tile, mirroring the BEAST 2 pattern. BEAST X has no dedicated JukesCantor class, so the tile builds an HKY with kappa=1.0 and equal nucleotide frequencies, which is equivalent. Registers the tile in BeastXCoreTileLibrary and adds a unit test exercising the BEAST X classpath at runtime.
55528cf to
e63b237
Compare
tochsner
left a comment
There was a problem hiding this comment.
Great work! A first batch of comments. They mainly cover the tiles. I still have to finish the comments for the tiling package.
- Most points are minor. Quite often, a remark applies to multiple places. I tried to find all of them, but you know the code better.
- Feel free to respond first before making changes, I'm happy to discuss my comments :)
| import beast.base.core.*; | ||
| import beast.base.inference.*; | ||
| import org.phylospec.tiling.TypeToken; | ||
| import tiling.TypeToken; |
There was a problem hiding this comment.
What is the reason behind this and all other changes in commit 40f93be ? Is the only change there the duplicated tiling classes in order to get rid of the generic type parameter for the state class? Cause in that case, I would argue that the code duplication is worse than having slightly more verbose types.
There was a problem hiding this comment.
Furthermore, it breaks the git version history of these files.
There was a problem hiding this comment.
I would propose to drop this commit unless there is a reason for it I'm missing, wdyt?
There was a problem hiding this comment.
Thanks for pointing this out. I checked this more carefully.
The related changes seem to appear in three BEAST 3 files:
integrations/beast3/java/src/test/java/tiling/TilingScriptFilesTest.javaintegrations/beast3/java/src/test/java/beaststate/BeastStateScriptFilesTest.javaintegrations/beast3/java/src/main/java/beastconfig/BEASTState.java
These ended up in this PR because I was trying to sync my branch with the latest BEAST 3 backend changes before continuing the BEAST X work. I used the recent fix-example branch as a reference for those BEAST 3 files, and that branch also has BEASTState.java importing tiling.TypeToken.
So this was not intended as a BEAST X design change. I agree that these BEAST 3 tiling changes should not be mixed into this BEAST X PR, especially if they duplicate the core tiling classes just to avoid the generic state type.
I’ll remove/revert this BEAST 3 update commit unless there is a separate reason to keep it here.
|
|
||
| EvaluateTiles<BEASTState> evaluateTiles = new EvaluateTiles<>(new BeastCoreTileLibrary().getTiles(), new ArrayList<>(), variableResolver, stochasticityResolver); | ||
| List<Tile<?, BEASTState>> bestTilings = null; | ||
| EvaluateTiles evaluateTiles = new EvaluateTiles(TileLibrary.loadAll(), OperatorTileLibrary.getTiles(), variableResolver, stochasticityResolver); |
|
|
||
| EvaluateTiles<BEASTState> evaluateTiles = new EvaluateTiles(new BeastCoreTileLibrary().getTiles(), new ArrayList<>(), variableResolver, stochasticityResolver); | ||
| List<Tile<?, BEASTState>> bestTilings = null; | ||
| EvaluateTiles evaluateTiles = new EvaluateTiles(TileLibrary.loadAll(), OperatorTileLibrary.getTiles(), variableResolver, stochasticityResolver); |
There was a problem hiding this comment.
Same as above, I'm happy to keep that
|
|
||
| public class RelaxedClockTile extends GeneratorTile<DiscretizedBranchRates, BeastXState> { | ||
|
|
||
| private static final double MEAN_TOLERANCE = 1.0e-6; |
There was a problem hiding this comment.
We could have a global PhyloSpec-wide threshold for these kind of things (e.g. also for the Simplex sum). Because I also do these checks in the type checker sometimes and right now it's a bit of a wild west with different numbers floating around. We don't have to change anything here, but it might make sense to do a short PR on this in the future.
There was a problem hiding this comment.
(The exception would be when the engine itself does checks with specific thresholds, but as far as I can see this is not the case for BEAST X here)
There was a problem hiding this comment.
Thanks, that makes sense. The check here is meant to ensure that the RelaxedClock base distribution has mean 1.0, so that the overall mean rate is controlled by clockRate.
I agree that the tolerance itself should probably be shared PhyloSpec-wide, together with similar checks like the Simplex sum. I’ll keep it local for now unless you prefer changing it in this PR, and we can move it to a shared helper/constant in a follow-up PR.
| if (numBranches <= 0) { | ||
| throw new TileApplicationError( | ||
| this.getRootNode(), | ||
| "RelaxedClock requires a tree with at least one branch.", | ||
| "Use a tree with at least two taxa.", | ||
| List.of("Tree tree ~ Yule(birthRate=1.0, taxa=taxa)") | ||
| ); | ||
| } |
There was a problem hiding this comment.
I like how thorough you are!
| throw new FailedTilingAttempt.Irrelevant(); | ||
| } | ||
|
|
||
| AutoOperatorConfigTile tile = |
There was a problem hiding this comment.
As this is only relevant in a beastx block, you should add a block check too. Smth like
if (!(assignment.block instanceof Stmt.Block.Custom(String blockName)) || !Objects.equals(blockName, "beastx")) {
throw new FailedTilingAttempt.Irrelevant();
}There was a problem hiding this comment.
This is relevant for all block-level tiles. A different way to fix this is with a template tile (check out ChainLengthTile in the BEAST 2 parser for instance). Notably, template tiles don't match the variable name so this would still work here.
There was a problem hiding this comment.
This then simplifies the logic as well as it eliminates the need to manually parse the literal, you can just specify a PositiveReal input.
There was a problem hiding this comment.
I see what you mean. This tile is meant to be BEAST X-specific, so it should not be able to match assignments outside a beastx block.
I’ll add the block check here, and I’ll also look at the ChainLengthTile pattern to see if this should be rewritten as a template tile instead. That would probably make the matching cleaner and avoid the manual literal parsing.
| import java.util.Map; | ||
| import java.util.Set; | ||
|
|
||
| public class AutoOperatorConfigTile extends Tile<Void, BeastXState> implements CandidateTile<BeastXState> { |
There was a problem hiding this comment.
Generally, the level of control regarding things like operator settings in a PhyloSpec model is an entire discussion in itself. While engine developers tend to prefer to give all the control to users, I personally would err on the cautious side to not end up with feature creep. One thing to keep in mind for example is backwards-compatibility. We can keep this in for now, but it would make sense to define a general strategy when it comes to the level of control.
There was a problem hiding this comment.
That is a fair concern. I added this mainly to make the BEAST X runs configurable while testing different operator setups, especially for the BEAST X vs BEAST 3 comparisons. I agree we should be careful not to turn every backend option into a PhyloSpec language feature. I’ll keep this limited for now, and we can revisit the general strategy for engine-specific controls before expanding this further.
| import java.util.Map; | ||
| import java.util.Set; | ||
|
|
||
| public class FileLoggerTile extends Tile<Void, BeastXState> implements CandidateTile<BeastXState> { |
There was a problem hiding this comment.
Could you use a template tile like in FileLoggerTile of the BEAST 2 parser? I think this would heavily simplify this no?
There was a problem hiding this comment.
The same holds for the other two logger tiles.
There was a problem hiding this comment.
I’ll refactor these using the BEAST 2 FileLoggerTile pattern. The current BEAST X logger tiles do too much manual matching, so a template tile should simplify the file, screen, and tree logger cases.
|
|
||
| import java.util.IdentityHashMap; | ||
|
|
||
| public class ObservedAsIntTile extends TemplateTile<IntScalar<? extends Int>, BeastXState> { |
There was a problem hiding this comment.
What is the reason we need this tile and the ObservedAsNonNegativeIntTile? Shouldn't it be covered by ObservedAsTile?
There was a problem hiding this comment.
Good point. ObservedAsTile currently only handles RealScalar; the name is a bit misleading.
The Int and NonNegativeInt versions exist because BEAST X binds observed values through different backend wrappers (BeastXRealScalarParam vs BeastXIntScalarParam), and BoundDistribution is typed by that concrete wrapper.
I agree the duplicated logic should be cleaned up. I can either rename ObservedAsTile to ObservedAsRealTile, or factor the shared logic into a helper with small typed wrappers.
| import java.util.HashSet; | ||
| import java.util.Set; | ||
|
|
||
| public class BeastXBMTraitLikelihoodSpec extends AbstractModelLikelihood { |
There was a problem hiding this comment.
Do you mind adding short docstrings to these classes explaining why we need them? I try to add docstrings to all public classes and public methods. One exception are the tiles, because most of them are obvious. (I'm sure though that I have forgotten it at some places too tbj)
There was a problem hiding this comment.
Sure, I’ll add short JavaDocs for these helper/model classes. The tiles are usually clear from the class name, but classes like this one do need a bit of context about why they exist in the BEAST X backend.
|
Something else: For me at least, to be able to run anything using BEAGLE in IntelliJ, I have to specify these VM options in the run config: |
|
Personally, I'm a big fan of opinionated code formatters as soon as multiple people work on the same code base to have a unified code style. I mainly know them from other languages (e.g. prettier for JS, black for python, gofmt for go...). For Java, reddit seems to recommend https://github.com/diffplug/spotless . What do you think? We could try it out in a future PR. |
There was a problem hiding this comment.
Generally, the pipeline to build the BEAST X state and model and then execute it seems a bit overcomplicated to me (or at least is a bit overwhelming). Generally, it seems that the builder and runner classes try to cover every possible edge case, even though most of them migh tnever be needed in practice. I would start with the actual use cases in the current code base (i.e. executing a .phylospec file or PhyloSpec string, generating the XML for a .phylospec file or PhyloSpec string, and the tests) and then prune everything else.
What do you think?
|
|
||
| public class BeastXBMTraitLikelihoodSpec extends AbstractModelLikelihood { | ||
|
|
||
| private final Alignment observedTraits; |
There was a problem hiding this comment.
For this and BeastXOUTraitLikelihoodSpec: do I understand it correctly that we need them because BEAST X does not implement the OU or BM models?
Then it's part of the same discussion wrt. e.g. the Bernoulli distribution. I'm fine with leaving them here now, but it's definitely something we should discuss with them directly. Best case, they feel incentivized to implement it.
There was a problem hiding this comment.
Also, while the Bernoulli seems simple enough, it's probably harder to implement something like the OU likelihood which works correctly and is as fast as possible without their help.
There was a problem hiding this comment.
That is my understanding as well. I added these local specs because I could not find directly usable BEAST X BM/OU trait likelihood components for this backend path, so this was a way to keep the PhyloSpec trait models executable for now.
But I agree this is not ideal as a long-term solution, especially for OU where correctness and performance are much less trivial than for something like Bernoulli. I’m happy to leave these in for now if useful, but this should definitely go on the list of things to discuss with the BEAST X team.
| public BeagleTreeLikelihood materializeBeagleTreeLikelihood() { | ||
| SitePatterns patterns = | ||
| new SitePatterns(this.observedAlignment); | ||
|
|
||
| return new BeagleTreeLikelihood( | ||
| patterns, | ||
| this.treeModel, | ||
| this.branchModel, | ||
| this.siteRateModel, | ||
| this.branchRateModel, | ||
| null, | ||
| false, | ||
| PartialsRescalingScheme.DEFAULT, | ||
| false | ||
| ); | ||
| } |
There was a problem hiding this comment.
Other than for consistency with the other two likelihood specs, is there a reason why we could not directly create the BeagleTreeLikelihood in the corresponding tile?
This would also remove the materialization logic in the model builder.
There was a problem hiding this comment.
The main reason for this spec layer was to keep the likelihood construction path consistent with the BM/OU trait likelihood specs, and to delay the actual BEAST X likelihood creation until the model builder had collected the full model state. For PhyloCTMC, though, the required BEAST X objects are already available at that point, so I don’t think there is a strong reason to keep the extra materialization step here. I’ll try moving the BeagleTreeLikelihood creation into the tile and remove the redundant model-builder logic if that keeps the pipeline cleaner.
|
|
||
| import java.util.function.Consumer; | ||
|
|
||
| public class TreeDistribution<L extends AbstractModelLikelihood> { |
There was a problem hiding this comment.
If we loosen the type guard T extends BeastXParam to T in BoundDistribution, can we remove this class?
There was a problem hiding this comment.
I think that was the main reason for introducing TreeDistribution: tree priors did not fit well while BoundDistribution was restricted to T extends BeastXParam. I’ll check whether TreeDistribution has any tree-prior-specific logic left. If it is just a wrapper around the bound object and likelihood, then loosening BoundDistribution should let us remove it.
| TreeModel existingTreeModel = | ||
| this.treeModelsByPhyloSpecName.get(id); | ||
|
|
||
| if (existingTreeModel != null) { | ||
| return existingTreeModel; | ||
| } |
There was a problem hiding this comment.
This is a bit risky. Right now, addTreePriorDistribution is only called in TreeDrawTile and the ID should always be unique (because PhyloSpec var names have to be unique). However, this could (and probably should) be also called for drawn arguments (something like Age rootAge = rootAge(tree~Yule(...))). If we have a line like this twice, we will only register it once presuming the id is tree for both calls.
Can we just drop the if check? Or am I missing a case where this is relevant.
There was a problem hiding this comment.
Good catch. I added this as a defensive reuse path, but I see the risk if tree-producing functions are used more than once with the same default id.
I don’t think we need this check here, since the PhyloSpec variable name should already give us a unique tree registration. I’ll drop the reuse branch so duplicate tree registration fails more explicitly instead of silently reusing the first tree.
| public void addMCMCLogger(Logger logger) { | ||
| this.mcmcLoggers.add(logger); | ||
| } |
There was a problem hiding this comment.
This, addScreenLoggerSpec(long logEvery) and addFileLoggerSpec(long logEvery, String fileName) seem to be dead code. If this is the case, can we remove it?
There was a problem hiding this comment.
Yes, those are leftovers from the earlier logger implementation. addMCMCLogger(...) is the current path now, so I’ll remove the old addScreenLoggerSpec and addFileLoggerSpec methods.
There was a problem hiding this comment.
Generally, this file in combination with BeastXRunPipeline is a bit overwhelming to me. There are many small methods, most of which call like two other functions. Some also seem to never actually been called or only be called by dead code.
How do clients actually use these two classes? Could we boil it down to a handful of slightly longer methods?
There was a problem hiding this comment.
Also, can we somehow merge this with BeastXRunPipeline?
There was a problem hiding this comment.
I see the issue. I split this quite aggressively because I wanted each step of the runner pipeline to be testable while I was wiring the backend together, but it has ended up making the public path harder to follow.
I’ll simplify this around the actual entry points we need — running from a file/string and writing XML from a file/string — and see whether BeastXRunPipeline can be merged back into PhyloSpecRunner or reduced to a private helper.
There was a problem hiding this comment.
This file is never used right?
If we want to document the compatibilities, we could do it the same way as the BEAST 2 parser with this csv file which can be rendered on the website.
There was a problem hiding this comment.
You’re right, this is not part of the actual BEAST X execution path. I added it as an internal summary of the current backend coverage, but a Java class is probably not the right place for that.
I’ll remove this file from the runtime code and move the capability information to a CSV/documentation format, following the BEAST 2 parser approach.
| private static String defaultRunName(Path sourcePath) { | ||
| String fileName = | ||
| sourcePath.getFileName().toString(); | ||
|
|
||
| int extensionStart = | ||
| fileName.lastIndexOf('.'); | ||
|
|
||
| if (extensionStart <= 0) { | ||
| return fileName; | ||
| } | ||
|
|
||
| return fileName.substring(0, extensionStart); | ||
| } |
There was a problem hiding this comment.
This method already exists in PhyloSpecRunner. Do we need both?
There was a problem hiding this comment.
Yes, there is some repetition in this section. I will carefully check these parts to make them clearer.
| public enum RunMode { | ||
| BUILD_STATE, | ||
| BUILD_MODEL, | ||
| BUILD_MCMC, | ||
| EXECUTE_MCMC | ||
| } |
There was a problem hiding this comment.
Do we actually need all these options? E.g. when would we just want to build the state or to build the model?
There was a problem hiding this comment.
Fair question. These modes mostly came from the way I split the pipeline while debugging and testing each stage separately, rather than from clear user-facing use cases.
I’ll review where they are actually used. If only XML generation and MCMC execution are needed in practice, I’ll simplify this and remove the intermediate run modes instead of exposing them as part of the runner API.
There was a problem hiding this comment.
This and the other .phylospec file in this folder, could we move them to the test resources?
There was a problem hiding this comment.
Yes, this section is included here because I encountered a problem while testing mcmc run. Because I didn't add expected state to the .phylospec file, it caused an error every time TilingScriptFilesTest was executed. I will remove it later.
For BEAGLE, yes, I have run into the same kind of issue on Windows. In my case I usually add the BEAGLE library directory to For formatting, I also like the idea. Having a project-wide formatter would help avoid noisy diffs once more people are working on the Java code. I would keep that separate from this PR, but trying Spotless in a small follow-up PR sounds good to me. |
I think that is a fair point. I tried to keep the backend pieces separated so that XML generation, model construction, and execution could be tested independently, but I agree the current pipeline may be more general than we actually need right now. I’ll review this from the current use cases first: running a |
Summary
This change introduces the BEAST X backend integration for PhyloSpec. The current main branch does not yet include BEAST X support, so this adds the BEAST X integration module, backend state/model layer, tile library, runner entry point, MCMC execution support, XML export, XML validation, and representative tests.
Main changes include:
Adds the BEAST X integration module, including Maven setup, README notes, tile library registration, and runtime access to the BEAST X Java API.
Adds BEAST X tiles for input data, distributions, priors, tree priors, substitution models, site models, branch-rate models, PhyloCTMC likelihoods, continuous traits, RPN calculations, MCMC configuration, loggers, and operators.
Registers the BEAST X tile library through
META-INF/services/org.phylospec.tiling.TileLibrary.Adds a structured BEAST X backend state/model layer used by the tiler and runner. This includes parameter wrappers, tree/likelihood/prior model records, operator builders, MCMC builders, result summaries, and validation helpers.
Adds
PhyloSpecRunnersupport for BEAST X so PhyloSpec source strings or.phylospecfiles can be built and run through one backend entry point, without directly exposing the lower-level BEAST X builders.Adds BEAST X MCMC execution support through the BEAST X Java API. Representative tests verify that selected models can run and write parameter/tree logs.
Adds BEAST X XML export from the structured BEAST X backend state. XML generation is implemented with explicit XML component builders for parameters, priors, tree priors, alignments, tree models, substitution models, site models, branch-rate models, PhyloCTMC likelihoods, operators, and loggers.
Adds BEAST X XML parse/run validation using the BEAST X XML parser. This covers prior-only models, tree-prior models, nucleotide/protein/trait PhyloCTMC models, showcase models, and runner-level XML entry points.
Adds and reorganizes BEAST X tests and
.phylospecfixtures around backend capability areas, including tiling scripts, expected BEAST X state, runner behavior, MCMC construction, XML component builders, XML execution, representative models, and showcase models.Test plan
Checked whitespace and diff formatting:
Ran core tests, excluding two pre-existing converter tests:
Ran BEAST 3 integration tests:
mvn -pl integrations/beast3/java testRan BEAST X integration tests:
mvn -pl integrations/beastx/java testChecked working tree status:
Notes
A plain top-level
mvn testcurrently fails because of two pre-existing converter tests inphylospec-core:LPhyConverterTestandRevConverterTest. These are in the LPhy/Rev converter layer and are unrelated to the BEAST X backend changes in this PR.BEAST X XML export is generated from the structured BEAST X backend state rather than from arbitrary runtime BEAST X objects. I did not find a general BEAST X runtime-object-to-XML serializer, so this follows a BEAUti-style approach: keep a structured model/configuration representation and write XML from that representation.
Full PhyloCTMC XML execution requires a working BEAGLE native library installation. Without BEAGLE, some XML models can still be generated and parsed, but full tree-likelihood execution may fail.
Follow-up work includes more general dimension/type constraints such as
Simplex(4), data-type-aware observed alignment matching.