The relationship between words can be described by relationships between vectors of real numbers used to represent those words. One simple way to represent a word as a vector is to compute how often it co-occurrs with other words – for each other word, how often it appears in the same context (perhaps sentence, paragraph, or document) as that other word.
For example, suppose we are interested in computing the co-occurrence matrix for the words ball, cold, play, sun, and wet. If our text is
the sun did not shine it was too wet to play so we sat in the house all that cold cold wet day i sat there with sally we sat there we two and i said how i wish we had something to do too wet to go out and too cold to play ball so we sat in the house we did nothing at allThe Cat in the Hat by Dr. Seuss
[1.000000, 1.000000, 1.000000, 0.000000, 1.000000] [0.500000, 1.000000, 0.500000, 0.000000, 1.000000] [0.500000, 0.500000, 1.000000, 0.000000, 1.000000] [0.000000, 0.000000, 0.000000, 1.000000, 0.000000] [0.333333, 0.666667, 0.666667, 0.000000, 1.000000]where each row gives the vector for one word. The last row then corresponds to "wet" and tells us that one third of the sentences that contained "wet" also contained "ball", two-thirds contained "cold", two-thirds contained "play", and none contained "sun" (and, of course, all contained "wet").
Explore further with Vector Representation of Words from TensorFlow.org or "King-Man+Woman=Queen: The Marvelous Mathematics of Computational Linguistics" in MIT Technology Review. And, for a similar idea in a different context, Dissecting Trump's Most Rabid Online Following from FiveThirtyEight.
First, complete the implementation of the class defined
in cooccur.h
(527 only: implement the class as a template
where the template parameter is the type of the keys; you may assume
that the type has a default constructor, copy constructor, assignment
operator, and stream extraction operator, and
that a suitable specialization of std::hash
exists
for that type).
The methods should run in the following time bounds:
CooccurrenceMatrix
constructor: $O(n^2)$ where $n$ is the number of keywords
update
: $O(k^2)$ where $k$ is the number of keywords in the context
readContext
: $O(m+c)$ where $m$ is the number of words in the context and $c$ is the number of characters read
getVector
: $O(n)$ where $n$ is the number of keywords in the matrix
~CooccurrenceMatrix
destructor (which may be implicit): $O(n)$ where $n$ is the number of keywords in the matrix
int
.
Finally, write a program called Cooccur that takes a non-empty list of unique keywords as command-line arguments (keywords will be non-empty and will have no embedded whitespace), reads a text from standard input with one context per line and words in the text separated by one or more whitespace characters, and writes to standard output the cooccurrence matrix with the row for each keyword written in the order the keywords appeared on the command line and with the columns also in the order they keywords appeared on the command line. Each row should be output preceded by the corresponding keyword, a colon, and a single space. The values in each row should be enclosed in square brackets, separated by a comma and a single space, and output with 6 places after the decimal point. If a keyword does not appear in the text then its row should contain all zeros.
For example,
[jrg94@cobra Cooccur]$ ./Cooccur ball cold play sun wet < cat_in_hat.txt ball: [1.000000, 1.000000, 1.000000, 0.000000, 1.000000] cold: [0.500000, 1.000000, 0.500000, 0.000000, 1.000000] play: [0.500000, 0.500000, 1.000000, 0.000000, 1.000000] sun: [0.000000, 0.000000, 0.000000, 1.000000, 0.000000] wet: [0.333333, 0.666667, 0.666667, 0.000000, 1.000000]where
cat_in_hat.txt
contains the text given above.
Your program must use only $O(n^2)$ space, where $n$ is the number of keywords.
Your program and any program that uses your CooccurrenceMatrix
class correctly must not produce any error messages when run with
valgrind --tool=memcheck --leak-check=yes -q
.
We reserve the right to deduct points from submissions for violations of the following guidelines.
#include
d regardless of what other files have
been #include
d -- if a declaration is required by
your header files, then #include
the appropriate
header file in your header file (or write a forward declaration in
your header) rather than relying on your #include
rs
to have already #include
d those header files before
they #include
yours.
cin
and cout
with the I/O operators <<
and >>
rather than printf
and scanf
).
const
should be declared as const
.
-Wall
and -pedantic
options.
/home/httpd/html/zoo/classes/cs427/f2017/Assignments/Cooccur
: