import os
import logging
import spacy
import subjectVerbObjectExtractUtil as svoExtract
import experimentUtil as expUtil
import mergeGraphUtil as mergeUtil
logging.basicConfig(format='%(asctime)s %(message)s', level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
INTRO_MD_DIR = "introMDFiles"
CLEAN_INTRO_MD_DIR = "cleanedTexts"
SUBJECT_OBJECT_VERB_CSV = "svo.csv"
#Clean and Extract Text from Intro.md Files
for file in os.listdir(INTRO_MD_DIR):
input_file_string = open(os.path.join(INTRO_MD_DIR, file)).read()
proccessedText = svoExtract.extract(input_file_string)
outputFileName, _ = os.path.splitext(file)
outputFileName = os.path.join(CLEAN_INTRO_MD_DIR, outputFileName)
outputFileName += ".txt"
with open(outputFileName, 'w') as filetowrite:
filetowrite.write(proccessedText)
filetowrite.close()
logging.info(outputFileName + " file created")
#Extract Subject Verb Object
#Creates subject verb object csv file
svoExtract.extractSVO(CLEAN_INTRO_MD_DIR)
import umap
umap.UMAP
# Creates CSV of UMAP and DBSCAN Clusters and InSample Vs OutSample Similarities
# Outputs clusteringLabels.csv
#expUtil.createCluster(SUBJECT_OBJECT_VERB_CSV)
# Get Kmeans Experiment Accuracy
# Assigns Cluster Labels to Variable Inputs to KMeansPredicted.csv and computes accuracy by taking the difference
# with the ground truth labels in JuliaDataVariable.csv
expUtil.runKMeansExp()
# Uses cluster that is most "similar" to the variable for cluster assignment
expUtil.runUMapSimilarityExp()
# Uses KMeans for cluster assignment, computes the similarity of the assignment with nodes
# that also belong to the cluster assignment
averageSimArray = expUtil.runCombinationExp()
#Creates Final CSV to create metaGraph
#Minimum threshold to merge 2 nodes from extracted subject verb objects
ClusterThreshold = .85
#Minimum threshold to assign variable to a cluster
VariableThreshold = .5
mergeUtil.createFinalGraph(ClusterThreshold, VariableThreshold, averageSimArray)
'''
The idea behind this experiment is to create a set of clusters using the Training data extracted
from the cleaned versions of the intro.md files and then assign the variables extracted
from the code associated with the intro.md files to one of these clusters. In order to achieve this,
we first transformed the subjects, objects, and verbs into embeddings and used DBSCAN clustering to
determine which sets of SOVs are "unclusterable" and to determine how many clusters to use for KMeans.
Subsequently, we used KMeans on SOVs that were determined to be "clusterable" by DBSCAN to group
the variables extracted from code. Finally, evaluation of the model is determined by comparing the cluster
labels the variables were assigned by KMeans to the hand labeled dataset.
The accuracy arrived by utilizing KMeans is around 22% which is significantly better than random given that there
are over 30 possible cluster assignments. However, this method of variable classification is limited by the quality
of the clusters produced by the Training data. For example, KMeans will be forced to assign a variable to a cluster
simply because it is better than the other clusters available; however, the variable may be completely irrelevant
compared to the other values in its assigned cluster. To solve this problem, we utilize a threshold prior to
the assignment based on how "similar" the variable is compared to the other phrases that also belong to that
cluster using the similarity comparator from spaCy. As seen from the graph generated above, Accuracy of the
retrieved assignments increases as the threshold is raised while the number of assignments decreases.
There are many improvements that can be made to potentially increase accuracy. By increasing the amount of training
data will have a high probability of improving our results since the current intro.md files are terse in their
model description. Increasing the dataset would also provide an increase to the number of clusters and greatly improve
the quality of assignment. This in-turn will have an impact on the threshold which would require
further experimentation. Subsequently, sometimes the variables that are extracted from code are too simple such that
they are of only 1 character length which makes it hard to determine semantic meaning. Furthermore,
other methods of determining "similarity" will also be explored. The current implementation using
the spaCy similarity() is constrained in that it is most useful in checking for duplications but is not as good
in terms of comparing semantic meaning.
'''