In our setting, the vocabulary size is $V$, and the hidden layer size is $N$.
The input is a one-hot representation vector, which means for a given input context word, only one out of $V$ units, $\{x_1,\cdots,x_V\}$, will be 1, and all other units are 0.
The weight between the input layer and the output layer can be represented by a $V \times N$ matrix $W$. Each row of $W$ is the $N$-dimension vector representation $v_w$ of the associated word of the input layer.
Given a context (a word), assuming $x_k=1$ and $x_{k’}=0$ for $k’\neq k$ then
\[h=x^TW=W{(k,\cdot):=v_{w_I}}\]
which is just copying the $k$-throw of $W$ to $h$. $v_{w_I}$ is the vector representation of the input word $w_I$. This implies that the link (activation) function of the hidden layer units is simply linear (i.e., directly passing its weighted sum of inputs to the next layer).
From the hidden layer to the output layer, there is a different weight matrix $W’=\{w’_{ij}\}$, which is a $N \times V$ matrix. Using these weights, we can compute a score $u_j$ for each word in the vocabulary,
\[ u_j={v’_{w_j}}^T \cdot h \]
where $v’_{w_j}$ is the $j$-th column of the matrix $W’$. Then we can use the softmax classification model to obtain the posterior distribution of the words, which is a multinomial distribution.
\[p(w_j|w_I)=y_j=\frac{\exp(u_j)}{\sum_{j’=1}^V{\exp(u_{j’})}}\]
where $y_j$ is the output of the $j$-th unit in the output layer.
Finally, we obtain:
\[p(w_j | w_I) = y_j = \frac{\exp( {v’_{w_o}}^T v_{w_I})}{\sum_{j’=1}^V{\exp( {v’_{w’_j}}^T v_{w_I})}}\]