Massive MIMO, i.e. deploying large number of antennas at the base station, is a key technology for 5G [1]. Although its benefits are by now well understood and documented, the critical bottleneck for massive MIMO deployment is still the acquisition of channel state information (CSI), with designs attempting to balance the conflicting requirements of training overhead and co-pilot contamination reduction, in addition to the standard requirement of computational efficiency. This problem is even more important in massive machine type communications (MTC) where accurate CSI with low-training overhead is of critical importance. Towards addressing these issues, one major line of works applies compressed sensing (CS) techniques in order to account for and exploit the sparsity properties of the wireless channel. The typical approach comes in two stages: first the spatial covariance matrix is estimated while, second, the dedicated (wideband) user channels are estimated within the estimated spatial subspaces. This approach has been mostly applied to the narrowband signaling case, exploiting the channel sparsity in the, so called, angle domain (see [2] for an overview). Extensions of CS techniques to the wideband (OFDM) massive MIMO channel have been recently considered in [3], [4], [5]. One intuitive motivation for such an approach is that there is more "space" in this larger structure so that pilot contamination induced by co-pilots in other cells is effectively combated. Initial work has been provided in [3], arguing that with large enough resolution in angular and time domain controlled by the number of antennas and subcarriers, the channels become approximately orthogonal so that pilot contamination diminishes. Notably, the subspace estimates are obtained through pilot coordination over multiple slots, but not primarily through sparse channel estimation. Moreover, the claimed orthogonality is not rigorous and crucially depends on the parameter setting, let alone that bandwidth (i.e. subcarriers) is not a free design parameter. In [4], a computationally efficient CS-based CSI acquisition approach is proposed, utilizing observations from only a limited number of subcarriers and antennas. Moreover, a fine resolution of angle and time domain together with a sparse subspace tracking algorithm is proposed. The resulting algorithm shows good performance and is exploited for pilot decontamination purposes. A similar setting is also considered in [5], where a minimum mean squared error (MMSE) channel estimator is proposed, however, requiring prior knowledge of the channel second order statistics. Generally, a two stage approach may require excessive observations in time in order to obtain an accurate estimate of the received signal covariance matrix, which is critical, e.g., in massive MTC setting. Another major problem is that the proposed CS algorithms lack rigorous performance analysis in terms of estimation error and, equally important, number of utilized subcarriers in order to achieve it. This is mainly due to the specific Kronecker-like measurement structure of the equivalent CS problem, where, as we will show, the classical CS assumptions fail to hold entailing convergence issues and leakage effects. However, targeting for low complexity algorithms as well as small training overhead say in future MTC applications, it is imperative to understand such structures and to derive a tailored algorithmic framework. Contributions: We propose an efficient "one stage" uplink massive MIMO wideband (OFDM) channel estimation taking into account only the sparsity of the wideband channel into account without any additional prior knowledge. In particular, we identify that the channel estimation problem can be posed as the identification of a vector that is hierarchically sparse, i.e., it is not only sparse but its support possesses certain structural properties. This property together with the Kronecker-like measurement is taken into account for the design of an efficient CS-inspired algorithm tailored for this particular setup. In addition, by extending the standard CS analysis methodology to hierarchical sparse vectors, we provide a rigorous analysis of the algorithm performance in terms of estimation error as well as number of pilot subcarriers required to achieve it. We believe, that this is the first paper that draws a rigorous connection between the hierarchical framework and Kronecker measurements. Preliminary simulations show that exploitation of the hierarchical sparsity property leads to improved estimation performance compared to standard CS algorithms that ignore this property. Even worse, standard algorithms may completely fail in some specific parameter settings. Obviously, this might have some profound impact on system parameters such as pilot signal design, user capacity per cell etc.