HW designers manage to describe systems of millions of gates, all running in parallel almost without hassle. They describe their systems in HDLs. SW programmers have difficulties even with two threads.
I used to think that this is expirience: HW designers are tought to think in parallel. However, this is a misconception, I think now, after some asynchronous HDLing.
In HW processing, almost everything is done synchronously, in one clock domain. Yes, you must think in parallel to understand what is going on. The SW operations are too complex (they have variable length, much more complex than CISC, etc) to support this mode of operation - their execution times are unpredictable. Though, some architectures make attempt to execute SW operations in lockstep-parallel way (why SIMD?). Nevertheless, because the synchronization is trivial and communication cost is zero within one clock domain, the parallelism comes at almost no cost. Thinking in parallel is very much like writing a sequential program. There is no indeterminism.
But, once you need to go another clock domain, the HW communication becomes very complex, with a lot of overhead. The mechanisms are very similar (4-way handshaking) that you find in SW programming.
The true parallelism is asynchronous. Writing parallel programs in HDL is as difficult as it is in typical programming, if not more complex.