Multi-FPGA Interconnection: latest techniques...

P

partha sarathy

Guest
Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts limitation on performance due to limited IO pins.
What are the latest Multi-FPGA Interconnection techniques available today? By using Multi Gigabit Transceivers , how much performance improvement is expected ?

Thanks in Advance
Parth
 
partha sarathy <gparthu@gmail.com> wrote:
Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts
limitation on performance due to limited IO pins.
What are the latest Multi-FPGA Interconnection techniques available
today? By using Multi Gigabit Transceivers , how much performance
improvement is expected ?

How much performance do you want? There are transceivers upwards of 56Gbps
these days. Questions:

How many transceivers can you get at that speed?
How to route an nn Gbps signal from one place to another?
How many transceivers can you successfully route and at what speed?
How to make that reliable in the face of bit errors, packet loss and other errors?
What end to end bandwidth can you actually acheive?
What latency impact does all that extra processing have?

Relevant paper of mine:
https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf

Theo
 
On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
partha sarathy <gpa...@gmail.com> wrote:
Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts
limitation on performance due to limited IO pins.
What are the latest Multi-FPGA Interconnection techniques available
today? By using Multi Gigabit Transceivers , how much performance
improvement is expected ?
How much performance do you want? There are transceivers upwards of 56Gbps
these days. Questions:

How many transceivers can you get at that speed?
How to route an nn Gbps signal from one place to another?
How many transceivers can you successfully route and at what speed?
How to make that reliable in the face of bit errors, packet loss and other errors?
What end to end bandwidth can you actually acheive?
What latency impact does all that extra processing have?

Relevant paper of mine:
https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf

Theo

Hi Theo,

Thanks a lot for the reply.

On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.
individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.?
If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
 
On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote:
On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
partha sarathy <gpa...@gmail.com> wrote:
Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts
limitation on performance due to limited IO pins.
What are the latest Multi-FPGA Interconnection techniques available
today? By using Multi Gigabit Transceivers , how much performance
improvement is expected ?
How much performance do you want? There are transceivers upwards of 56Gbps
these days. Questions:

How many transceivers can you get at that speed?
How to route an nn Gbps signal from one place to another?
How many transceivers can you successfully route and at what speed?
How to make that reliable in the face of bit errors, packet loss and other errors?
What end to end bandwidth can you actually acheive?
What latency impact does all that extra processing have?

Relevant paper of mine:
https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf

Theo

Hi Theo,

Thanks a lot for the reply.

On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.
individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.?
If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.

Are you sure you aren\'t doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used?

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
 
On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote:
On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote:
On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
partha sarathy <gpa...@gmail.com> wrote:
Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts
limitation on performance due to limited IO pins.
What are the latest Multi-FPGA Interconnection techniques available
today? By using Multi Gigabit Transceivers , how much performance
improvement is expected ?
How much performance do you want? There are transceivers upwards of 56Gbps
these days. Questions:

How many transceivers can you get at that speed?
How to route an nn Gbps signal from one place to another?
How many transceivers can you successfully route and at what speed?
How to make that reliable in the face of bit errors, packet loss and other errors?
What end to end bandwidth can you actually acheive?
What latency impact does all that extra processing have?

Relevant paper of mine:
https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf

Theo

Hi Theo,

Thanks a lot for the reply.

On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.
individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.?
If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
Are you sure you aren\'t doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used?

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
Hi Rick,
Thanks for the reply with details.
Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks?


Best Regards
Parth
 
On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote:
On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote:
On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote:
On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
partha sarathy <gpa...@gmail.com> wrote:
Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts
limitation on performance due to limited IO pins.
What are the latest Multi-FPGA Interconnection techniques available
today? By using Multi Gigabit Transceivers , how much performance
improvement is expected ?
How much performance do you want? There are transceivers upwards of 56Gbps
these days. Questions:

How many transceivers can you get at that speed?
How to route an nn Gbps signal from one place to another?
How many transceivers can you successfully route and at what speed?
How to make that reliable in the face of bit errors, packet loss and other errors?
What end to end bandwidth can you actually acheive?
What latency impact does all that extra processing have?

Relevant paper of mine:
https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf

Theo

Hi Theo,

Thanks a lot for the reply.

On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.
individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.?
If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
Are you sure you aren\'t doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used?

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
Hi Rick,
Thanks for the reply with details.
Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks?

Sorry, I\'m not at all clear about what you are doing.

Maybe I misunderstood what you meant by pin muxing. Are they using fewer pins and sending data for multiple signals over each pin? That would definitely slow things down.

Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay. I\'m not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clock cycle of latency.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
 
Rick C <gnuarm.deletethisbit@gmail.com> wrote:
Does the gigabit transceiver pipe line inserted delay count more than
20ns say for 50MHz FPGA clocks?

Sorry, I\'m not at all clear about what you are doing.

Maybe I misunderstood what you meant by pin muxing. Are they using fewer
pins and sending data for multiple signals over each pin? That would
definitely slow things down.

Using SERDES (the gigabit transceiver you mention) should speed that up,
but might include some pipeline delay. I\'m not that familiar with their
operation, but I assume you have to parallel load a register that is
shifted out at high speed and loaded into a shift register on the
receiving end, then parallel loaded into another register to be presented
to the rest of the circuitry. If that is how they are working, it would
indeed take a full clock cycle of latency.

That\'s right - you get a parallel FIFO interface. There\'s no guarantee what
you put in will get to the other end reliably (if BER is 10^-9 say and your
bit rate is 10Gbps, that\'s one error every 0.1s). So on these kinds of
links to be reliable you need some kind of error correction or
retransmission. In the Bluelink case, that was hundreds of ns.

Basically you end up with something approaching a full radio stack, just
over wires.

Theo
 
On Saturday, September 26, 2020 at 8:55:16 AM UTC+5:30, gnuarm.del...@gmail..com wrote:
On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote:
On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote:
On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote:
On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
partha sarathy <gpa...@gmail.com> wrote:
Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts
limitation on performance due to limited IO pins.
What are the latest Multi-FPGA Interconnection techniques available
today? By using Multi Gigabit Transceivers , how much performance
improvement is expected ?
How much performance do you want? There are transceivers upwards of 56Gbps
these days. Questions:

How many transceivers can you get at that speed?
How to route an nn Gbps signal from one place to another?
How many transceivers can you successfully route and at what speed?
How to make that reliable in the face of bit errors, packet loss and other errors?
What end to end bandwidth can you actually acheive?
What latency impact does all that extra processing have?

Relevant paper of mine:
https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf

Theo

Hi Theo,

Thanks a lot for the reply.

On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.
individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.?
If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
Are you sure you aren\'t doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used?

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
Hi Rick,
Thanks for the reply with details.
Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks?
Sorry, I\'m not at all clear about what you are doing.

Maybe I misunderstood what you meant by pin muxing. Are they using fewer pins and sending data for multiple signals over each pin? That would definitely slow things down.

Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay. I\'m not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clock cycle of latency.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
Hi Rick,
Thanks for the clarifications. It is obvious now that the Serdes is not suitable for Pin Muxing

Regards
Parth
 
On Sunday, September 27, 2020 at 9:31:42 AM UTC+5:30, partha sarathy wrote:
On Saturday, September 26, 2020 at 8:55:16 AM UTC+5:30, gnuarm.del...@gmail.com wrote:
On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote:
On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del....@gmail.com wrote:
On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote:
On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
partha sarathy <gpa...@gmail.com> wrote:
Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts
limitation on performance due to limited IO pins.
What are the latest Multi-FPGA Interconnection techniques available
today? By using Multi Gigabit Transceivers , how much performance
improvement is expected ?
How much performance do you want? There are transceivers upwards of 56Gbps
these days. Questions:

How many transceivers can you get at that speed?
How to route an nn Gbps signal from one place to another?
How many transceivers can you successfully route and at what speed?
How to make that reliable in the face of bit errors, packet loss and other errors?
What end to end bandwidth can you actually acheive?
What latency impact does all that extra processing have?

Relevant paper of mine:
https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf

Theo

Hi Theo,

Thanks a lot for the reply.

On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.
individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.?
If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
Are you sure you aren\'t doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used?

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
Hi Rick,
Thanks for the reply with details.
Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks?
Sorry, I\'m not at all clear about what you are doing.

Maybe I misunderstood what you meant by pin muxing. Are they using fewer pins and sending data for multiple signals over each pin? That would definitely slow things down.

Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay. I\'m not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clock cycle of latency.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
Hi Rick,
Thanks for the clarifications. It is obvious now that the Serdes is not suitable for Pin Muxing

Regards
Parth

Multi-Gigabit Transceiver (MGT): Configurable hard-macros MGTs are implemented for
inter-FPGA communication. The data rate can be as high as ~ 10Gbps [MGT, 2014]. Nevertheless, the MGT has a high latency (~ 30 fast clock cycles) that limits the system clock frequency and only a few is available. When the TDM ratio is 4, the system clock frequency
is ~ 7MHz [Tang et al., 2014]. In addition, the communication between MGTs is not errorfree. They come with a non-null bit error rate (BER). Therefore, at this moment, MGT is not used as inter-FPGA communication architecture in multi-FPGA prototyping
 
partha sarathy <gparthu@gmail.com> wrote:
Multi-Gigabit Transceiver (MGT): Configurable hard-macros MGTs are
implemented for inter-FPGA communication. The data rate can be as high as
~ 10Gbps [MGT, 2014]. Nevertheless, the MGT has a high latency (~ 30 fast
clock cycles) that limits the system clock frequency and only a few is
available. When the TDM ratio is 4, the system clock frequency is ~ 7MHz
[Tang et al., 2014]. In addition, the communication between MGTs is not
errorfree. They come with a non-null bit error rate (BER). Therefore, at
this moment, MGT is not used as inter-FPGA communication architecture in
multi-FPGA prototyping

It really depends on what you mean by \'prototyping\'. If you have
interconnect which is tolerant of latency, such that the system doesn\'t mind
that messages take several cycles to get from one place to another (typical
of a network-on-chip implementing say AXI), then using MGT with a
reliability layer is fine for functional verification.

If you mean dumping a hairball of an RTL netlist across multiple FPGAs and
slowing the the clock until everything works in a single cycle, then they\'re
not right for that job.

They\'re both prototyping, but at different levels of abstraction.

Theo
 

Welcome to EDABoard.com

Sponsor

Back
Top