darkone (\/) (;,,,;) (\/) 11611 Posts user info edit post |
I thought I would get the opinions of some of the folks around here on a problem I've been debating with myself on how to manage.
I work with satellite data. The data sets I use are multi-terabyte data sets are can't be stored on your typical desktop workstation. The data analysis I usually perform involves aggregating data from all the files in the data set. In my lab, we store the files of a several servers accessed across the LAN with 100 mbps interlinks. For my work, the bottleneck on how long the processing takes is always IO limited.
What suggestions would you have on how to best remove the IO bottleneck? Should I be considering something like fiber interlinks? Expensive networking hardware? How would you do this with a really small budget? I'm looking at situations where reducing the IO bottleneck by 20% would save days or even weeks of processing time per project. 3/9/2010 1:16:33 PM |
qntmfred retired 40817 Posts user info edit post |
are you saying the IO bottleneck is network IO or disk IO
as for network, it should be very easy to upgrade to gigabit. but even then, your disk IO is most likely going to be the limiting factor anyways. you probably aren't even maxing out the 100 mb/s. if you're able to upgrade your hard drives, look for drives with high IOPS metrics. there's not typically a lot of room to upgrade unless you go to SSD though, which is $texas
[Edited on March 9, 2010 at 1:58 PM. Reason : pick up a few of these http://www.ramsan.com/products/ramsan-20.htm] 3/9/2010 1:26:09 PM |
FroshKiller All American 51913 Posts user info edit post |
RAMDISK 3/9/2010 1:31:17 PM |
darkone (\/) (;,,,;) (\/) 11611 Posts user info edit post |
^^ The data on the servers is on 12 disk RAID 5 volumes (7200RPM SATA drives). In theory, the data never sees the HDD on the local workstation as it stays in memory the whole time it's being used.
Sadly, my lab is at the mercy of NCSU and departmental IT for the network hardware between the workstations and the servers. I'd love to install gigabit switches, but it will be a cold day in hell before the put up the money for new network hardware.
I wish we could afford SSDs. I shutter to think about what 100 Tb of SSDs costs. 3/9/2010 3:22:27 PM |
greeches Symbolic Grunge 2604 Posts user info edit post |
I would say upgrade to atleast Gigabit or add more disks. You are either network or disk bound, throw more spindles into the mix. RAID-5 isn't a very fast array either, may wanna try some sort of RAID-0 config for performance, if you are indeed disk bound.
Have you figured out where your bottleneck is? (network/storage) 3/9/2010 4:12:34 PM |
darkone (\/) (;,,,;) (\/) 11611 Posts user info edit post |
^ I don't know if I'm Network or Disk limited. That's a test I'll have to run. 3/9/2010 5:45:08 PM |
Shaggy All American 17820 Posts user info edit post |
Raid 5 offers good read performance, and poor write performance. Given your limited budget its probably the best choice. If you can get more disks, maybe do raid 10 (if it is disk limited).
dont ever use raid 0.
[Edited on March 9, 2010 at 6:08 PM. Reason : aaa] 3/9/2010 6:07:49 PM |
Shadowrunner All American 18332 Posts user info edit post |
This is a shot in the dark since you haven't described your work in any detail, but "satellite" + "multi-terabyte" suggests to me that you're working with sparse, uncompressed data. You might be able to dramatically reduce the amount of data you need to touch if you rework some of your analysis to take advantage of compressive sensing algorithms.
This has been my "I realize this thread is about reducing your IO bottleneck... which my solution is not" suggestion. 3/9/2010 6:36:12 PM |
smoothcrim Universal Magnetic! 18968 Posts user info edit post |
what are the chances that this data is deduped? if you could get a decent filer that would dedupe the data and cache what's left, you could increase your IO significantly since many blocks wouldn't have to be read from disk. I would setup opensolaris and a zfs pool. you can get a lot of this functionality with essentially a jbod and a cheap box to run it. I'd also run gigabit links since any modern sata disk can fully saturate a 100mbps connection
[Edited on March 9, 2010 at 6:51 PM. Reason : and if you cant run 1gbe, then run 802.11n MIMO. even that is more than 100mbps] 3/9/2010 6:50:38 PM |
Opstand All American 9256 Posts user info edit post |
Or buy a networked storage system from me 3/9/2010 6:53:21 PM |
smoothcrim Universal Magnetic! 18968 Posts user info edit post |
I got fucking cold called by netapp yesterday while their stuff was shitting the bed. the conversation that ensued was epic 3/9/2010 7:16:29 PM |
gs7 All American 2354 Posts user info edit post |
^^I came here to suggest that.
^You can't say that without giving details
[Edited on March 9, 2010 at 7:23 PM. Reason : .] 3/9/2010 7:22:20 PM |
smoothcrim Universal Magnetic! 18968 Posts user info edit post |
basically there's a virtually undocumented 16TB limit on deduped data within a volume. when you hit that limit (whether it's actually 16TB or 16 thin provisioned TB - I guess ontap is protecting itself since the volume could balloon to that size) instead of instant or dedupe clones, it starts making deep copies. in the middle of the shitstorm created by this, that I'm trying to clean up, some guy from netapp calls me and asks if im interested in netapp and if i plan on buying any in the future. hilarity followed. 3/10/2010 10:03:06 AM |
darkone (\/) (;,,,;) (\/) 11611 Posts user info edit post |
Quote : | ""satellite" + "multi-terabyte" suggests to me that you're working with sparse, uncompressed data" |
The data are neither sparse or uncompressed.
The data in question are satellite HDF files. We keep them gzipped since they compress by about 40-60% from their binary form. http://en.wikipedia.org/wiki/Hierarchical_Data_Format3/10/2010 7:24:23 PM |
Shadowrunner All American 18332 Posts user info edit post |
Carry on, then. Like I said, that was a shot in the dark on my part. 3/10/2010 9:40:39 PM |
Shaggy All American 17820 Posts user info edit post |
Just to make sure, is the data decompressed client side (after the network transfer) or serverside (prior to network transfer)? 3/10/2010 9:43:25 PM |
darkone (\/) (;,,,;) (\/) 11611 Posts user info edit post |
^ Client side. The files traverse the network in their compressed state. 3/10/2010 9:55:01 PM |
Shaggy All American 17820 Posts user info edit post |
well then! Sounds like the next step is to actually figure out if its disk or network.
If it is network, I wonder if you could buy some (relatively) cheap sata disks and create your own local replica of the data. How often is it updated? 3/10/2010 9:59:42 PM |
darkone (\/) (;,,,;) (\/) 11611 Posts user info edit post |
^ The data sets are fairly static. I tend to update them monthly in one-month chunks.
It's hard to justify the expense in purchasing hardware for storing the data locally on the user(s) workstation considering that we already spend a lot maintaining and expanding our primary network storage machines and the fact that it's very difficult to get funding agencies to authorize spending on computer hardware. It's one of those things they expect departments to provide out of grant overhead. Of course, departments that will spend money on that sort of thing are very rare. In our department we get grief about printer paper. I'm sure you could image how a request for workstations with 10 TB local storage would go.
Given an unlimited budget, we would have moved to fiber and SSDs and I could chew my way through a 10 Tb dataset in less than half a day.
[Edited on March 11, 2010 at 5:25 PM. Reason : typing FTL] 3/11/2010 5:24:28 PM |