Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 20 additions & 13 deletions README.txt → README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,36 @@
tRECS
# tRECS

- [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/4QuantOSS/tRECS/master) [![Binder]( https://img.shields.io/badge/launch-jupyterlab-red.svg)](https://mybinder.org/v2/gh/4QuantOSS/tRECS/master?urlpath=lab)

## Demos

- [![Binder](https://img.shields.io/badge/launch-spooky%20demo-green.svg)](https://mybinder.org/v2/gh/4QuantOSS/tRECS/master?urlpath=%2Fapps%2Fdemo_spooky.ipynb) from the [Kaggle Spooky Author Competition](https://www.kaggle.com/c/spooky-author-identification/data)



Text recommendation system developer built in Python and Dash by Plotly.

Authors: Taylor Olson, Janie Neal, Christiana Prater-Lee, Eshita Nandini

This recommendation system developer guides the user through cleaning their data, building models, and ultimately creates a recommendation system (housed within the interface). The user can also visualize some the models and other features of their data.

INSTALLATION
## INSTALLATION

The user should be using python 2.7+. Most packages required for the interface can be installed in the command line by navigating to the directory containing the reqs.txt file and running "pip install -r reqs.txt". Some packages require further installation. Links to their installation documentation are provided below.

Spacy: https://spacy.io/docs/usage/
Spacy: https://spacy.io/docs/usage/
Follow the instructions for downloading the english model.

NLTK: http://www.nltk.org/data.html#
Follow the instructions to open the interactive installer and install the following corpora: wordnet, wordnet_ic, words, and stopwords.


HOW TO START THE DASH INTERFACE
## HOW TO START THE DASH INTERFACE

In order to start the interface, the user should navigate to the "interface" folder in the command line and call "python interface.py". The command window will provide a link that the user can then copy into the address bar of a browser of their choice. The interface should open in that window, and the user can begin using the features.


FEATURES OF THE INTERFACE AND HOW TO USE THEM
## FEATURES OF THE INTERFACE AND HOW TO USE THEM

Upload your own data:
User can copy custom data (see sample data for desired format) into the interface folder or use some of the sets included in the interface folder.
Expand All @@ -34,28 +42,27 @@ User is required to remove non-ascii characters and punctuation. There are also
stemming vs. lemmatizing: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Word cut off:
Numbers on the slider indicate the percentage of documents that the words appear in. Use the slider to remove words that occur too frequently or not frequently enough.
Numbers on the slider indicate the percentage of documents that the words appear in. Use the slider to remove words that occur too frequently or not frequently enough.

Choose and Build your models:
Options are TF-IDF, LSA, LDA, and spaCy's built in similarity function.

Recommendation:
Recommendation:
Uses the models built in the previous step to provide the selected number of most similar objects to the selected object.

Visualizations:
The following visualizations have been connected to the interface: entity visualization, lda clustering. Please see their descriptions in the description of visualizations.py.


CONTENTS OF MASTER FOLDER
## CONTENTS OF MASTER FOLDER

Interface.py: contains the code needed for front end of application, including: HTML layout, Dash components and event handlers, calls to backend. Contains iqss_interface object (below).

Iqss_interface.py: contains all coded needed for back end of application, including: user loaded data frame, cleaned data frame, and built models. Contains two objects, the iqss_interface object (holds the data frames) and the model_object (holds the necessary df, tf-idf, lda, and spacy models).
Iqss_interface.py: contains all coded needed for back end of application, including: user loaded data frame, cleaned data frame, and built models. Contains two objects, the iqss_interface object (holds the data frames) and the model_object (holds the necessary df, tf-idf, lda, and spacy models).

Utils.py: contains all methods used for manipulating data frames, including: converting to term and document frequency matrix, cleaning the data frame, extracting entities from the data frame, and determining the number of topics based on the contents of the data frame.
Visualizations.py: contains all code necessary to build the various visualizations, including:

Visualizations.py: contains all code necessary to build the various visualizations, including:
the LDA cluster graph: http://brandonrose.org/clustering
the similarity graph: a graph that plots the similarity score of an object on the x axis, and the group which the object is part of is plotted on the y axis
LDA topic distribution: https://pyldavis.readthedocs.io/en/latest/readme.html#installation

Binary file removed __pycache__/iqss_interface.cpython-36.pyc
Binary file not shown.
Binary file removed __pycache__/utils.cpython-36.pyc
Binary file not shown.
Binary file removed __pycache__/visualizations.cpython-36.pyc
Binary file not shown.
24 changes: 24 additions & 0 deletions binder/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@

name: trecs
channels:
- conda-forge
- defaults
dependencies:
- python=3.6
- numpy
- matplotlib
- pandas
- scipy
- scikit-learn
- spacy
- nltk
- nbserverproxy
- appmode
- pip:
- ipywidgets==7.0.1 # for interactive plots
- dash==0.21.0 # The core dash backend
- dash-renderer==0.11.3 # The dash front-end
- dash-html-components==0.9.0 # HTML components
- dash-core-components==0.18.1 # Supercharged components
- plotly==2.4.1 # Plotly graphing library used in examples
- lda
3 changes: 3 additions & 0 deletions binder/postBuild
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
source activate kernel # ensure we are in the current environment
python -m nltk.downloader all
python -m spacy download en
Loading