-
Notifications
You must be signed in to change notification settings - Fork 639
Makes string length configurable and consistent across backends. Closes #1303 #2678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…lso added condition based metadata extraction.
…d arguments to all file handlers in elffile.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Shajal-Kumar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request by @Shajal-Kumar. This PR addresses issue #1303 by making the minimum string length used during feature extraction configurable and ensuring this setting is applied consistently across different backends (IDA Pro, Vivisect, pefile, dnfile, elffile, and vmray). The changes involve introducing a min_str_len parameter, setting a default value of 4, and refactoring extractor functions to pass configuration details via a context dictionary instead of using **kwargs. Comprehensive tests have been added to validate the new functionality and ensure consistency.
Highlights
- Configurable String Length: Introduced a
min_str_lenparameter to control the minimum length of strings extracted by capa. This value is now configurable instead of being hardcoded. - Default String Length: Set a default minimum string length of 4 characters in
capa/features/extractors/strings.py. - Context Dictionary Refactor: Migrated from using
**kwargsto passing a structuredcontextdictionary in extractor functions. This improves maintainability and explicitly defines the data available to extractors. - Backend Consistency: Ensured that the
min_str_lensetting is respected and applied consistently across all supported static analysis backends (IDA, Vivisect, pefile, dnfile, elffile) and the VMRay dynamic extractor. - Testing: Added new tests in
tests/test_strings.pyto specifically validate themin_str_lenfunctionality, including boundary conditions and different length values for both ASCII and Unicode strings. - Changelog Update: Updated the
CHANGELOG.mdto reflect the addition of the configurable string length feature.
Changelog
Click here to see the changelog
- CHANGELOG.md
- Added an entry for the configurable string length feature (#1303).
- capa/features/extractors/common.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default toextract_file_strings. - Passed
min_str_lentoextract_ascii_stringsandextract_unicode_stringscalls.
- Imported
- capa/features/extractors/dnfile/extractor.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default toDnfileFeatureExtractor.__init__and stored it. - Updated calls to file and function feature extraction functions to pass
min_str_lenvia actxdictionary.
- Imported
- capa/features/extractors/dnfile/file.py
- Modified file feature extraction functions to accept a
ctxdictionary instead of direct parameters likepeand**kwargs. - Extracted
peandmin_str_lenfrom thectxdictionary where needed. - Updated internal calls within
extract_featuresto pass thectxdictionary.
- Modified file feature extraction functions to accept a
- capa/features/extractors/dnfile/insn.py
- Changed the string length check in
extract_insn_string_featuresto usefh.ctx["min_str_len"]instead of a hardcoded4.
- Changed the string length check in
- capa/features/extractors/dotnetfile.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default toDotnetFileFeatureExtractor.__init__and stored it. - Modified file feature extraction functions to accept a
ctxdictionary instead of direct parameters likepeand**kwargs. - Extracted
peandmin_str_lenfrom thectxdictionary where needed. - Updated internal calls within
extract_file_featuresto pass thectxdictionary.
- Imported
- capa/features/extractors/elffile.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default toElfFeatureExtractor.__init__and stored it. - Modified file feature extraction functions to accept a
ctxdictionary instead of direct parameters likeelf,buf, and**kwargs. - Extracted
elf,buf, andmin_str_lenfrom thectxdictionary where needed. - Updated internal calls within
extract_file_featuresto pass thectxdictionary.
- Imported
- capa/features/extractors/ida/extractor.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default toIdaFeatureExtractor.__init__and stored it. - Updated calls to file feature extraction functions and the
get_functionmethod to passmin_str_lenvia actxdictionary.
- Imported
- capa/features/extractors/ida/file.py
- Modified file feature extraction functions to accept a
ctxdictionary instead of no parameters or specific ones. - Extracted
min_str_lenfrom thectxdictionary for string extraction functions. - Updated internal calls within
extract_featuresto pass thectxdictionary.
- Modified file feature extraction functions to accept a
- capa/features/extractors/ida/helpers.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default tofind_string_at.
- Imported
- capa/features/extractors/ida/insn.py
- Changed the string length check in
extract_insn_string_featuresto usefh.ctx["min_str_len"]instead of a hardcoded4. - Passed
fh.ctx["min_str_len"]to thefind_string_atcall.
- Changed the string length check in
- capa/features/extractors/pefile.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default toPefileFeatureExtractor.__init__and stored it. - Modified file feature extraction functions to accept a
ctxdictionary instead of direct parameters likepe,buf, and**kwargs. - Extracted
pe,buf, andmin_str_lenfrom thectxdictionary where needed. - Updated internal calls within
extract_file_featuresto pass thectxdictionary.
- Imported
- capa/features/extractors/strings.py
- Defined
DEFAULT_STRING_LENGTH = 4. - Renamed internal regex constants (
ASCII_RE_4,UNICODE_RE_4) to use_DEFAULTsuffix. - Updated
extract_ascii_stringsandextract_unicode_stringsto accept amin_str_lenparameter with default and use it to build the regex.
- Defined
- capa/features/extractors/viv/extractor.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default toVivisectFeatureExtractor.__init__and stored it. - Updated calls to file feature extraction functions and the
FunctionHandleconstructor to passmin_str_lenvia actxdictionary.
- Imported
- capa/features/extractors/viv/file.py
- Modified file feature extraction functions to accept a
ctxdictionary instead of direct parameters likevw,buf, and**kwargs. - Extracted
vw,buf, andmin_str_lenfrom thectxdictionary where needed. - Updated internal calls within
extract_featuresto pass thectxdictionary.
- Modified file feature extraction functions to accept a
- capa/features/extractors/viv/insn.py
- Changed the string length check in
extract_op_string_featuresto usefh.ctx["min_str_len"]instead of a hardcoded4.
- Changed the string length check in
- capa/features/extractors/vmray/file.py
- Imported
DEFAULT_STRING_LENGTH. - Passed
min_str_len=DEFAULT_STRING_LENGTHto the call tocapa.features.extractors.common.extract_file_strings.
- Imported
- capa/ida/plugin/form.py
- Changed the call to
self.rulegen_feature_extractor.get_functionto use the keyword argumentea.
- Changed the call to
- capa/loader.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter with default toget_extractorandget_file_extractors. - Passed
min_str_lento the constructors of various feature extractors (DnfileFeatureExtractor,PefileFeatureExtractor,VivisectFeatureExtractor,ElfFeatureExtractor,DotnetFileFeatureExtractor).
- Imported
- capa/main.py
- Imported
DEFAULT_STRING_LENGTH. - Added
min_str_lenparameter toget_file_extractors_from_cli. - Passed
min_str_lentocapa.loader.get_file_extractors. - Passed
DEFAULT_STRING_LENGTHtocapa.loader.get_extractorandcapa.features.extractors.ida.extractor.IdaFeatureExtractor. - Added a check for
FORMAT_RESULTbefore callingfind_capabilities.
- Imported
- rules
- Updated the subproject commit hash for the
rulessubmodule.
- Updated the subproject commit hash for the
- tests/fixtures.py
- Imported
DEFAULT_STRING_LENGTH. - Passed
DEFAULT_STRING_LENGTHto the constructors of various feature extractors used in test fixtures (VivisectFeatureExtractor,PefileFeatureExtractor,DnfileFeatureExtractor,DotnetFileFeatureExtractor).
- Imported
- tests/test_strings.py
- Imported
pytest. - Updated calls to
extract_ascii_stringsandextract_unicode_stringsto use themin_str_lenkeyword argument instead ofn. - Added a new test function
test_min_str_lento verify the functionality and error handling of themin_str_lenparameter for both ASCII and Unicode string extraction.
- Imported
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Strings, long or short,
Configurable now, you see,
Consistent across,
Backends, wild and free.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a configurable minimum string length for feature extraction, which is a valuable enhancement for flexibility and consistency across different backends. The refactoring to use a context dictionary in many places also improves code clarity and maintainability.
Overall, the changes are well-structured and the addition of tests for the new string length configuration is commendable. I've found a couple of areas where the new configurable string length isn't fully applied, which I've detailed in the comments.
Good work on this!
Summary of Findings
- Inconsistent
min_str_lenusage inDotnetFileFeatureExtractor:DotnetFileFeatureExtractor.extract_file_featuresusesDEFAULT_STRING_LENGTHinstead of the configuredself.min_str_len. - Hardcoded string length in VMRay backend: The VMRay backend's string extraction (
extract_file_stringsincapa/features/extractors/vmray/file.py) uses the hardcodedDEFAULT_STRING_LENGTHand does not seem to support the new configurablemin_str_len. - Docstring parameter name inconsistency (low severity, not commented): In
capa/features/extractors/strings.py, the docstring forextract_ascii_stringsandextract_unicode_stringsusesmin_lenfor the parameter description, while the parameter itself is namedmin_str_len. This was not commented on due to review settings. - CLI configuration for
min_str_len(low severity, not commented): The command-line interface incapa/main.pydoes not yet expose a flag to setmin_str_len. It currently usesDEFAULT_STRING_LENGTH. This is likely a planned future enhancement and not an issue with the current PR's scope of internal configurability. This was not commented on due to review settings.
Merge Readiness
The pull request is a significant step towards configurable string length handling and better code structure. However, there are a couple of issues (one high severity, one medium severity) identified in the review comments that should be addressed to ensure consistency and correctness across all backends. Once these are resolved, the PR should be in good shape for merging. As an AI, I am not authorized to approve pull requests.
|
|
||
| def extract_file_features(self): | ||
| yield from extract_file_features(self.pe) | ||
| yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like DEFAULT_STRING_LENGTH is used here, but the DotnetFileFeatureExtractor is initialized with min_str_len (which is stored as self.min_str_len).
Should this be using self.min_str_len to ensure the configured minimum string length is respected for file feature extraction in this extractor, similar to how it's handled in other extractors like PefileFeatureExtractor or ElfFeatureExtractor?
| yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH}) | |
| yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": self.min_str_len}) |
| yield from capa.features.extractors.common.extract_file_strings( | ||
| analysis.submission_bytes, min_str_len=DEFAULT_STRING_LENGTH | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The min_str_len here is hardcoded to DEFAULT_STRING_LENGTH. To make the string length configurable for the VMRay backend consistent with other backends, should min_str_len be passed down to this function?
This might involve updating VMRayExtractor or how VMRayAnalysis is handled to include the min_str_len configuration. Currently, the VMRay backend won't respect a custom minimum string length.
closes #1303
Makes string length configurable and consistent across backends
Key Changes
min_str_lenargument to relevant functions infeatures/extractorsDEFAULT_STRING_LENGTH = 4incapa/features/extractors/strings.pykwargsto structuredcontextdictionary for better maintainability wherever applicable.CHANGELOG.md.Motivation
Testing
test_min_str_lentocapa/tests/test_strings.pyvalidating:Checklist