joe17669 All American 22728 Posts user info edit post |
I'm trying to do some analysis on some data and am having some trouble trying to understand how Excel calculates data. Every time I make a change, Excel recalculates my entire workbook, which takes about 5 minutes on my dual core machine. I'm using Excel 2007.
The dataset is fairly large, about 100,000 rows and 13 columns. This data is fixed and doesn't change. Each row is identified by a date and time (time interval is 5 minutes) on column A, and I need to do a lot of conditional array equations to extract the data I need from columns B-M. I'm using array equations to keep me from having to create a lot of dummy data and rows, and generally helps keep the whole process automated and lets me change things without having to do hours of work.
For example, if I want to get the average of column B (line voltage) for each day, I would do something simple like:
10/25/2008 {=average( if( (month(Data!$A$2:$A$100000)=10) * (day(Data!$A$2:$A$100000)=25) , $B$2:$B$100000) )} 10/26/2008 {=average( if( (month(Data!$A$2:$A$100000)=10) * (day(Data!$A$2:$A$100000)=26) , $B$2:$B$100000) )} 10/27/2008 {=average( if( (month(Data!$A$2:$A$100000)=10) * (day(Data!$A$2:$A$100000)=27) , $B$2:$B$100000) )}
I can understand that Excel takes a while to calculate this, but what I don't seem to get is why Excel must recalculate these cells any time I make any type of change to the worksheet that has no dependencies on the cells. If I were to type some text in a blank cell, it would recalculate the whole workbook. The same would happen if I bold or italicize a label. My conditional statements are more complex than what I posted above, and have several thousand of them.
Is there any way to improve the efficiency of how Excel determines its dependencies for recalculating data? I'm sure my equations aren't the most efficient, but I know that it shouldn't be *this* slow. I know I could turn off automatic calculation alltogether but I'm trying to avoid that so I can get immediate feedback on any errors. I've been digging through the options without much success.
I'm sure there's a way to put all this data into a database and run queries off it to make it faster, but there shouldn't be any reason Excel should fuss about me asking it to give me the average value of column B where the date is 10/25/2008 between the hours of 00:00 and 12:00. I'm also limited to Excel for this effort; Matlab unfortunately is not an option.
Thanks!]1/28/2009 9:13:47 AM |
agentlion All American 13936 Posts user info edit post |
you can go to Tools > Options > Calculation and change it from "automatic" to "manual". Then it only re-evaluates all the formulas when you press F9 or save the file. 1/28/2009 9:53:36 AM |
ScHpEnXeL Suspended 32613 Posts user info edit post |
Quote : | "I know I could turn off automatic calculation alltogether but I'm trying to avoid that so I can get immediate feedback on any errors." |
1/28/2009 9:55:09 AM |
agentlion All American 13936 Posts user info edit post |
oh, i didn't read the whole thing.
in that case, you're stuck. If you want to get immediate feedback, then Excel must reevaluate all the data. If Excel must reevaluate all the data, then you must wait for it. It does suck, or it is curious, that Excel recalculates everything with almost every keystroke, but I've never found a way around that. 1/28/2009 10:03:21 AM |
CalledToArms All American 22025 Posts user info edit post |
^^i guess im confused, if he is making a change to the worksheet that has no dependencies on other cells I am assuming its very simple input if not just text or bolding or italicizing and that is causing the problem, then why would he not want to turn off autocalculation and then just press F9 everytime he needs it like agentlion suggested?
sure its not that convenient but keep pressing it but it seems like itd still be easier than having it on automatic.
[Edited on January 28, 2009 at 10:07 AM. Reason : ] 1/28/2009 10:06:47 AM |
ScHpEnXeL Suspended 32613 Posts user info edit post |
I think he's wanting a way to only do the calculations on the cell or row he's actually changed instead of recalculating 100,000 rows of stuff that hasn't changed.
edit: it'd be a hell of a lot faster if it just updated the changes automatically then he could use F9 if he wanted to refresh the whole sheet just to be sure..
[Edited on January 28, 2009 at 10:24 AM. Reason : asf] 1/28/2009 10:24:02 AM |
CalledToArms All American 22025 Posts user info edit post |
def 1/28/2009 11:48:34 AM |
darkone (\/) (;,,,;) (\/) 11611 Posts user info edit post |
I think that this is a case where Excel isn't the optimal tool to use. 1/28/2009 1:19:12 PM |
ScHpEnXeL Suspended 32613 Posts user info edit post |
I think he knows that
I would think you could write a macro to run every time you change cells and make it update whatever math is in the previous cell you were in. That + turning off automatic updates would do it. Hopefully that made sense..
Just an idea--no idea if it's feasible. 1/28/2009 1:22:02 PM |
agentlion All American 13936 Posts user info edit post |
also, if you have calculations set to Manual, you can update single cells/formulas by activating those specific cells by double-clicking then Enter, or F2 then Enter. That will update one cell only.
I have previously worked on workbooks that are about 20MB with ~20 worksheets and thousands of formulas, and I had to do the Manual Calculation option and just use F9 ever once ina while, because otherwise it was just unusable.
^ i found that running Macros that forced calculations switches the calculation mode from Manual to Automatic, unfortunately, so every time you run a macro you ahve to go back into Options and change it back. Maybe it's possible, though, to set Manual mode inside the macro before it exits.
[Edited on January 28, 2009 at 1:26 PM. Reason : .] 1/28/2009 1:24:16 PM |
joe17669 All American 22728 Posts user info edit post |
Quote : | "I think he's wanting a way to only do the calculations on the cell or row he's actually changed instead of recalculating 100,000 rows of stuff that hasn't changed." |
Yes. I don't quite understand why all of the formulas are recalculated when changing a cell that is not referenced in any of the formulas. I'm sure there's a reason why it's doing it, but it isn't too apparent to me. And I only have about 500 formulas that are "querying" the dataset of 100,000 rows. If I had 100k formulas it would take years for it to refresh them all
Quote : | "i guess im confused, if he is making a change to the worksheet that has no dependencies on other cells I am assuming its very simple input if not just text or bolding or italicizing and that is causing the problem, then why would he not want to turn off autocalculation and then just press F9 everytime he needs it like agentlion suggested?" |
Turning it off is certainly a possibility, and is probably what I will end up doing. I was just wondering if there were any other way around it, because in addition to just adding text and what not, I'm also adding more and more formulas, and I like the automatic error checking you get with the auto-calculate.
Quote : | "also, if you have calculations set to Manual, you can update single cells/formulas by activating those specific cells by double-clicking then Enter, or F2 then Enter. That will update one cell only." |
Nice. I'll check that out. Thanks man.
Quote : | "I have previously worked on workbooks that are about 20MB" |
My current file is about 94MB]1/28/2009 1:24:22 PM |
nattrngnabob Suspended 1038 Posts user info edit post |
It's about time you add a new skill set to your arsenal. Excel wasn't cut out for that kind of work. 1/28/2009 2:03:07 PM |
joe17669 All American 22728 Posts user info edit post |
True. But like I said Excel is the only resource I have to use for this project. Matlab would be great.
I haven't explored the possibility of putting all the data into an Access database and having Excel execute the query. I'm assuming it's possible. And I'm sure it would be a lot quicker. 1/28/2009 2:12:30 PM |
nattrngnabob Suspended 1038 Posts user info edit post |
Where do you work that you can't request a serious statistical package like JMP or SAS when you're working with that much data?
There are probably quite a few ways to optimize what you have if this is something you're going to be dealing with on an ongoing basis. For instance, you can add extra columns to do the "if" part of your average statement so when you add data or a cell changes the average part of the equation doesn't have to evaluated everything over again. 1/28/2009 2:39:32 PM |
disco_stu All American 7436 Posts user info edit post |
Quote : | "It's about time you add a new skill set to your arsenal. Excel wasn't cut out for that kind of work." |
[citation needed]1/28/2009 2:46:48 PM |
HaLo All American 14278 Posts user info edit post |
it is possible to have excel reference itself when executing DB queries, I have limited experience with it, however I think that may work pretty well, don't know how fast it is. 1/28/2009 8:19:50 PM |
A Tanzarian drip drip boom 10996 Posts user info edit post |
Quote : | "Excel Recalculation
The user can trigger recalculation in Microsoft Office Excel in several ways, for example:
- Entering new data (if Excel is in Automatic recalculation mode, described later in this topic).
- Explicitly telling Excel to recalculate all or part of a workbook.
- Deleting or inserting a row or column.
- Saving a workbook while the Recalculate before save option is set.
- Performing certain Autofilter actions.
- Double-clicking a row or column divider (in Automatic calculation mode).
- Adding, editing, or deleting a defined name.
- Renaming a worksheet.
- Changing the position of a worksheet in relation to other worksheets.
- Hiding or unhiding rows (but not columns).
[...]
Dependence, Dirty Cells, and Recalculated Cells
The calculation of worksheets in Excel can be viewed as a three-stage process:
1. Construction of a dependency tree
2. Construction of a calculation chain
3. Recalculation of cells
The dependency tree tells Excel which cells depend on which others, or equivalently, which cells are precedents for which others. From this tree, Excel constructs a calculation chain. The calculation chain lists all the cells that contain formulas in the order that they should be calculated. During recalculation, Excel revises this chain if it comes across a formula that depends on a cell that has not yet been calculated. In this case, the cell that is being calculated and its dependents are moved down the chain. For this reason, calculation times can often improve in a worksheet that has just been opened in the first few calculation cycles.
When a structural change is made to a workbook, for example, when a new formula is entered, Excel reconstructs the dependency tree and calculation chain. When new data or new formulas are entered, Excel marks all the cells that depend on that new data as needing recalculation. Cells that are marked in this way are referred to as dirty. All direct and indirect dependents are marked as dirty, so if B1 depends on A1, and C1 depends on B1, when A1 is changed, both B1 and C1 are marked as dirty.
If a cell depends, directly or indirectly, on itself, Excel detects the circular reference and warns the user. This is usually an error condition that the user must fix, and Excel 2007 provides very helpful graphical and navigational tools to help the user track down the source of the circular dependency. In some cases however, you might deliberately want this condition to exist. For example, you might want to run an iterative calculation where the starting point for the next iteration is the result of the previous iteration. Excel supports control of iterative calculations through the calculation options dialog box.
After marking cells as dirty, when a recalculation is next done, Excel reevaluates the contents of each dirty cell in the order dictated by the calculation chain. In the example given earlier, this means B1 is first, and then C1. This recalculation occurs immediately after Excel finishes marking cells as dirty if the recalculation mode is automatic; otherwise, it occurs later.
Since Microsoft Office XP (Excel version 10), the Range object in Microsoft Visual Basic for Applications (VBA) supports a method, Range.Dirty, which marks cells as needing calculation. When it is used together with the Range.Calculate method (see next section), it enables brute force recalculation of cells in a given range. This is useful when you are performing a limited calculation during a macro, where the calculation mode is set to manual, to avoid the overhead of calculating cells unrelated to the macro’s function. Range calculation methods are not available through the C API.
Before Excel 2002, Excel built a calculation chain for each worksheet in each open workbook. This resulted in some complexity in the way links between worksheets were handled, and required some care to ensure efficient recalculation. In particular, in Excel 2000, you should minimize cross-worksheet dependencies and name worksheets in alphabetical order, so that sheets that depend on other sheets come alphabetically after the sheets they depend on.
In Excel 2007, the logic is enhanced to enable recalculation on multiple threads, so that sections of the calculation chain are not interdependent and can be calculated concurrently. You can configure Excel to use multiple threads on a single processor computer, or a single thread on a multi-processor or multi-core computer.
Volatile and Non-Volatile Functions
Excel supports the concept of a volatile function, that is, one whose value cannot be assumed to be the same from one moment to the next even if none of its arguments (if it takes any) has changed. Excel reevaluates cells that contain volatile functions, along with all dependents, every time it recalculates. For this reason, too much reliance on volatile functions can make recalculation times slow, so you should use them very sparingly.
The following Excel functions are volatile.
- NOW
- TODAY
- RAND
- OFFSET
- INDIRECT
- INFO (depending on its arguments)
- CELL (depending on its arguments)
[...]
Calculation Modes, Commands, Selective Recalculation, and Data Tables
Excel has three calculation modes:
- Automatic
- Automatic Except Tables
- Manual
When calculation is set to automatic, recalculation occurs after every data input and after certain events such as the examples given in the previous section. For very large workbooks, recalculation time might be so long that users need to limit when this happens, that is, only recalculating when they need to. To enable this, Excel supports the manual mode. The user can select the mode through the Excel menu system, or programmatically using VBA, COM, or the C API.
Data tables are special structures within a worksheet. First, the user sets up the calculation of a result on a worksheet. This depends on one or two key changeable inputs and other parameters. The user can then create a table of results for a set of values for one or both of the key inputs. The table is created using the Data Table Wizard. After the table is set up, Excel plugs the inputs one-by-one into the calculation and copies the resulting value into the table. As one or two inputs can be used, data tables can be one- or two-dimensional.
Recalculation of data tables is handled slightly differently:
- Recalculation is handled asynchronously to regular workbook recalculation, so that large tables might take longer to recalculate than the rest of the workbook.
- Circular references are tolerated. If the calculation that is used to get the result depends on one or more values from the data table, Excel does not complain about the circular dependency.
Given the different way that Excel handles recalculation of data tables, and the fact that large tables that depend on complex or lengthy calculations can take a very long time to calculate, Excel lets you disable the automatic calculation of data tables. You do this by setting the calculation mode to Automatic except Data Tables. When calculation is in this mode, the user recalculates the data tables by pressing F9 or some equivalent programmatic operation.
Excel exposes methods through which you can alter the recalculation mode and control recalculation. These methods have been enhanced from version to version to allow for finer control. The capabilities of the C API in this regard reflect those that were available in Excel version 5, and so do not give you the same control that you have using VBA in more recent versions.
Most frequently used when Excel is in manual calculation mode, these methods allow selective calculation of workbooks, worksheets, and ranges, complete recalculation of all open workbooks, and even complete rebuild of the dependency tree and calculation chain. " |
1/28/2009 8:58:48 PM |
LimpyNuts All American 16859 Posts user info edit post |
The problem is you're going about this all wrong.
First of all, it would help if you knew how date codes work.
1 = January 1 1900 0:00 hours. 1.5 = January 1, 1900 12:00 hours.
In fact every date and time is stored in Excel as "the number of days since January 0 1900".
10/25/2008 is 39746. Thereby, to test if the day of a given date is 10/25/2008, you use "Int(Cell)=39746". Or if you don't like that, you can use Date(2008,10,25), which returns 39746.
I created a random dataset covering every 5 minutes for a year (just over 100,000 rows). The problem with your formula is that it scans the length of the dataset 4 times per evaluation. Once to evaluate Month(), once to evaluate Day(), once for the result of IF() and once (through the pared dataset) to calculate the average. Well if you've got one of these formulae for every day of the year, then yeah it's going to take an eternity to calculate.
Replacing your formula with: {=AVERAGE( IF( INT(DateTimeRange)=CellWithDateOfInterest, RangeWithResultsToAverage) )}
Speeds up the calculation by probably a factor of 2 or more. But it's still scanning the dataset for every single date I want an average for.
But guess what! The results you want (sum, count, average, max, min, product, stdev, variance) can be calculated with a SINGLE pass through the data. And guess what else! Excel has a built in tool to do just that. It's called a PivotTable. Using a PivotTable to calculate the average value in B2 for each individual day in 10 years (over 1,000,000 rows using 5 minute interval data) takes a mere fraction of a second.
Your "thousands" of "complex" formulae are the problem. They are calculated individually. If you have to use your method {function ( if ( condition , result ) )}, then "condition" should be replaced with a user function that performs all the necessary comparisons in a single pass through the data. But even then you'd be passing through the data many thousands of times. It may get 10 times as fast, but it will still be slow.
In the future, remember excel is not a database. Even though there happens to be an OLEDB driver for it that allows you to query a spreadsheet like a database table. 1/29/2009 9:11:51 AM |
|