User not logged in - login - register
Home Calendar Books School Tool Photo Gallery Message Boards Users Statistics Advertise Site Info
go to bottom | |
 Message Boards » » Data I/O and Matlab Page [1]  
darkone
(\/) (;,,,;) (\/)
11611 Posts
user info
edit post

Random question for anyone who might has some ideas on a mystery I have:

Context:
I've mounted some NFS shares on Windows. The performance in reading data from these shares seems acceptable for everything I can think of except from Matlab. If I copy files from the shares via FTP or robocopy from my windows machine I average about 40-45 MB/s.

Matlab behaves differently:
I wrote a Matlab script to read data from 10 different HDF files to use as a benchmark. If the files are on the local drive, they benchmark runs in just over 6 seconds. If the files are on the NFS share, the benchmark takes 330 seconds. Using robocopy to copy the benchmark files from the NFS share to a local directory took 22 seconds. Something strange is happening in Matlab that doesn't make sense to me.

Anyone have any ideas?

6/29/2015 5:07:10 PM

clalias
All American
1580 Posts
user info
edit post

reading/writing data line by line or record by record, across networks is always slower us. I'm not sure why, but it's true for other languages as well, like C, python, java etc. I've always marked it up to some magic that OS's do to move things in bulk fast. IDK.

We always copy files over to a local server and do our runs then move the data back.

If you find anything else out, I'd be interested to hear.

[Edited on June 29, 2015 at 7:25 PM. Reason : .]

6/29/2015 7:24:53 PM

darkone
(\/) (;,,,;) (\/)
11611 Posts
user info
edit post

I haven't tried to benchmark this for other file types and I don't know anything about the under-the-hood workings of the HDF libraries. Obviously, a bunch of small reads is going to be really slow because of overhead.

My workflow usually prohibits moving the file between machines. I constructed my bench mark to read 10 files but for actual tasks, I'm reading tens of thousands with a combined weight of dozens of terabytes. Usually I execute code on the local machines where the data lives but I've been exploring alternatives for how we interact with our data servers. I'd like to drag my lab out of the 90s technology wise.

We have 5 CentOS servers with a combined 14 RAID 5 & 6 data volumes and 300+ TB total writable capacity. The servers all share the volumes with each other via NIS/NFS. Most of the users (10-15 max) have windows workstations and do their work directly on the data servers via SSH + X-Win (usually Matlab). I know this is an archaic setup, but I was trained as a scientist, not a sys admin. I suppose we really need a consultant but I'm pretty sure we can't afford one.

6/30/2015 11:16:54 AM

clalias
All American
1580 Posts
user info
edit post

executing the code on the file server is as fast as your going to get for the current process. I'd look into new ways of handling the data. Do you really need to read it all in for a single run --10's of terabytes? can you load a smaller chunck and then fork a process to do something, whist you read in more data for the next run, i.e. parallel-reading/tasking.

Are you just looking for a subset of the data? would a sql server better handle the data though queries to return just what you need to your local machine to run? Since you can't afford a consultant I guess you're not getting new hardware either. So living within your paradigm, I think you can only optimize the ways you use the data.

6/30/2015 12:38:29 PM

darkone
(\/) (;,,,;) (\/)
11611 Posts
user info
edit post

I'm aggregating the data to get at various spatial statistics so I do need to read every file. However the reading can be parallelized. I do this kind of analysis just infrequently enough that I haven't wanted to rework all my data structures to handle being split and sent to different nodes (cores). I let this chart be my guide for this sort of thing: https://xkcd.com/1205/

We don't often run into a situation were we need a database but we are building one to catalog data from out Multi-Angle Snowflake Camera. Usually I design my output structures so that I can make subsets of that data post-aggregation since I usually don't know what kind of subsets I want until I start wading through the data.

6/30/2015 1:14:15 PM

darkone
(\/) (;,,,;) (\/)
11611 Posts
user info
edit post

Fun fact: Not that this is surprising, but my benchmark runs almost twice as fast on a local SSD versus a local spinning disk drive.

7/22/2015 3:54:56 PM

BigMan157
no u
103354 Posts
user info
edit post

i'm surprised it's only twice as fast

7/22/2015 4:08:04 PM

darkone
(\/) (;,,,;) (\/)
11611 Posts
user info
edit post

It's not a 100% I/O benchmark.

7/22/2015 10:48:27 PM

 Message Boards » Tech Talk » Data I/O and Matlab Page [1]  
go to top | |
Admin Options : move topic | lock topic

© 2024 by The Wolf Web - All Rights Reserved.
The material located at this site is not endorsed, sponsored or provided by or on behalf of North Carolina State University.
Powered by CrazyWeb v2.39 - our disclaimer.