Tag Archives: term-clumping

ClusterSuite – term-clumping macro toolset

ClusterSuite is a macro which runs a series of thesauri, macros, and other term-cleaning and clustering programs to perform dimension reduction on a list, making it more approachable and manageable. It intends to minimize noise and maximize prominent topics, which enables the user to more quickly extract meaning from large amounts of text. More specifically, term clumping macros indicate (i) how closely related two or more terms are, (ii) a good name for a group that includes common terms, and (iii) the nature of the relationship between terms (e.g. parent-child, siblings, etc).

At present, ClusterSuite organizes its parts into three phases. Phase I executes five thesauri. Phase II iteratively runs a fuzzy list cleaning macro following a removal of extremely common or uncommon items. Phase III is designed to run a basic clustering macro after again removing extremely uncommon items. After the completion of these three steps, the user has the option to run an additional program to perform more advanced clustering.

THE FILES BELOW ARE TEMPORARILY UNAVAILABLE

1. Copy the Login to view. ‘ClusterSuite’ folder to C:\Program Files (x86)\VantagePoint
2. Copy Login to view.  to C:\Program Files (x86)\VantagePoint\Fuzzy
3. Copy Login to view. to C:\Program Files (x86)\VantagePoint\Macros
4. Open a VantagePoint file and run ClusterSuite.vpm (from C:\Program Files (x86)\VantagePoint\Macros)

ClusterSuite Tutorial PowerPoint

To watch an instructional video, goto VP Resources>VP How-To>Advanced Analytics
ClusterSuite Tutorial Video

Rough Guide to Dataflow in ClusterSuite

1)ClusterSuite.vpm calls the Tutorial.html (poorly named, I know) window
2)ClusterSuite runs a while loop that waits for Tutorial.html to send a window.status back to VantagePoint
3)Tutorial.html has multiple checkboxes each with their own ID.
4)When a button, such as “About ClusterSuite” or “Start” is presses, Tutorial.html updates the window.status. If “About ClusterSuite” is selected, the while loop continues to run while ClusterSuite launches another window. If “Start” is pressed, the while loop is stopped and ClusterSuite is told to record every checkbox ID with that is marked as checked. Additionally, the numbers from the Remove Extremes boxes are stored in an array.
5)The id of every checked box is stored in a giant array.
6)A for loop reads through each item in this array and executes the items in order. This prevents unchecked items from being executed.
7)Each component is executed in its own function, and the checklist is updated along the way.
8)At the end of this, the user is given the option of running TermCluster. If the user selects yes, a Term-Document matrix is created in ClusterSuite. This is then converted to Excel with a combination of a .vpm and .xlsm Matrix_to_columns macro.
9)Next, a  combination of runTermCluster.bat and .xlsm launch termCluster.jar.
10)TermCluster imports Excel, uses MySQL to store and manipulate it, and exports back to Excel

In order to use TermCluster, you first must install and configure MySQL, a free database from http://www.mysql.com/. Click here for a tutorial:
MySQL-Setup

Additional resources:
TermCluster data flow
Acronym_Eliminator_Instructions