jump to navigation

Processing poorly formatted large text file with SSIS February 28, 2012

Posted by furrukhbaig in SQL Server 2005, SSIS.
Tags: , , ,
add a comment

I have been working on a project where I am processing large comma seperated text files (2 GB, containing 20 million rows) having comma as a value in Last column. Using flat file data source does not solve the problem alone and I had to use script component to cater the open format value in last column which can contain comma within the value.

First you would need to define a connection as per following connection dialogue.

 Image

Then click on columns on left pane. The trick here is to select any thing other than CRLF as COLUMN DELIMITER and go back to first screen by clicking General and come back to this screen.

You will only see one column containing whole row. Please note you would need to move away from this screen after selecting another delimeter in order to see on column with whole row.

 Image

 Now click on Advance and name your column anything you like, but I am using column name as Line as it contains whole line from the text file. Make sure you select String (STR) or Unicode String (WSTR) data type with maximum possible row length.

Image

Now drag Data Flow Task on control Flow and double click on it

Now drag Flat File Source component on Data Flow designer

 Select the connection for the Flat File Source component that we have created in steps above.

Flat File Source should only show one column as per connection manager.

Now Drag Script Component on Data Flow Designer and it will show you the dialogue box to select how this script component going to work. Select Transformation here.

Click on Inputs and Outputs. Input should already have one column as defined in Flat File Source. Add columns for Output. I have added 11 columns and for demonstration purpose change the name of columns. Note “Untranslatable” column that will contain open format value. Be careful with column data type and allow enough length as per values expected.

 Image

Now its time to write some code in script component.

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        char[] delimiter = “,”.ToCharArray();

        string[] tokenArray = Row.Line.Split(delimiter, 11);

        Row.Column1= tokenArray[0];
        Row.Column2= tokenArray[1];
        Row.Column3= tokenArray[2];
        Row.Column4= tokenArray[3];
        Row.Column5= tokenArray[4];
        Row.Column6= tokenArray[5];
        Row.Column7= tokenArray[6];
        Row.Description1 = tokenArray[7];
        Row.Description2 = tokenArray[8];
        Row.Description3 = tokenArray[9];
        Row.Untranslatable = tokenArray[10];
    }

 Now let me say that I am not the expert of C Sharp or .Net but I know how to get the work done 😉

The tricky part is SPLIT function which does the magic. Note the second parameter for no of array element expected (no of column expected). This is vital otherwise SPLIT will create variable no of array element depending on number of delimiter occurance in Line.

Our target is to get 11 column and if delimiter occur more than 10 time in one row it will append rest of the line in last column.

now you can do the OLE DB destination. your data flow should look like screen below.

Image

 I would appriciate any comments but this code processed 2 GB text file in less than a minute. loading 20 million rows in table.

Thanks

GO – Run a batch multiple times March 26, 2009

Posted by furrukhbaig in SQL Server 2005.
Tags: , , , , ,
1 comment so far

Today I have come across the blog of Kalen Delaney using an undocumented feature of GO statement (batch terminator).

You can run a batch multiple time without re-runing batch or using WHILE loop.

Following code is taken from the original post. see the post below.

So, to run the INSERT 5 times, I would do this:

INSERT INTO details
(SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID,
UnitPrice, UnitPriceDiscount, rowguid, ModifiedDate)
        SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID,
             UnitPrice, UnitPriceDiscount, rowguid, ModifiedDate
        FROM Sales.SalesOrderDetail
GO 5

http://sqlblog.com/blogs/kalen_delaney/archive/2007/07/22/did-you-know-run-a-batch-multiple-times.aspx

Rowcount for Large Tables July 24, 2008

Posted by furrukhbaig in DMV's, Performance, SQL Server 2005, SQL Server 2005 features, Tips, TSQL.
Tags: , , ,
15 comments

Ever wondered why simple statements like SELECT COUNT(*) FROM [Table_Name] takes forever to return row count on large tables? Its because it uses full table scan to count number of rows. The instant way to get the row count on any table is to query new Dynamic Management Views (DMV) in SQL Server 2005 sys.dm_db_partition_stats. DMV contains row count and page counts for any table including all the partitions. Please note even you did not create partition for your table, your table still going to be created on single default partition so this will include all the tables.

Following are some useful queries that will return row count almost instantly. In SQL Server 2000 sysindexes system table is used to get the number of rows for large table to avoid full table scan.

Comments are welcome.

— All the tables having ZERO rows

— compatable with partitioned tables

SELECT

Table_Name = object_name(object_id),

Total_Rows = SUM(st.row_count)

FROM

sys.dm_db_partition_stats st

GROUP BY

object_name(object_id)

HAVING

SUM(st.row_count) = 0

ORDER BY

object_name(object_id)

 

— Row Count without doing full table Scan

— This will include provide total number of rows in all partition (if table is partitioned)

SELECT

Total_Rows= SUM(st.row_count)

FROM

sys.dm_db_partition_stats st

WHERE

object_name(object_id) = ‘Your_Table_Name_Goes_Here’

 

AND (index_id < 2) — Only Heap or clustered index

SSMS slow startup March 7, 2008

Posted by furrukhbaig in Performance, slow start, SQL Server 2005, SSMS, Tips.
Tags: , ,
add a comment

Friend of mine asked me “why SSMS is taking up to a minute to startup” on his office machine and I was not able to figure out the problem untill I find the post on Euan Garden’s Blog.

http://blogs.msdn.com/euanga/archive/2006/07/11/662053.aspx

The quick fix is to add following lines in host file

# entry to get around diabolical Microsoft certificate checks
# which slow down non internet connected computers
127.0.0.1 crl.microsoft.com

This is to do with certificate validity check. It fix the problem with my friends machine. Hope it helps others too.

Enjoy !!

Is your code really SET based ? January 30, 2008

Posted by furrukhbaig in BEST PRACTICE, CROSS JOIN, Execution Plan, Performance, Performance Tuning, RBAR, SET BASED, SQL Server 2005, SQL Server 2005 features, Tips, Triangular Join, TSQL.
Tags: , , ,
add a comment

Its a well known fact that best practice is to write the set based code to get better performance. While there is no absolute defination of set based and many people think that set based code is anything except CURSORS and LOOPs. Believe me that is not true.

I have been thinking to write about this topic for a while and just today i have come across very usefull article that explain exactly what I was trying to say. It also explains about Triangular and Cross Joins and a new (for me atleast) terminology ‘RBAR’. Its interesting.

http://www.sqlservercentral.com/articles/T-SQL/61539/

Enjoy !!

Find out whats running on SQL Server 2005 December 13, 2007

Posted by furrukhbaig in DMV's, Execution Plan, Optimize, Performance, Performance Tuning, Profiler, Queries running on the server, SQL Server 2005, SQL Server 2005 features, XML Execution Plan.
Tags: , , , ,
2 comments

Everyone wants to know whats running on the box. Specially if your job is to stabalise server you always concern what is killing the box.

A Friend of mine has published very usefull post with scripts to find out whats running on the SQL Server with their execution plan.

http://www.proteanit.com/b/2007/01/22/a-useful-script-to-analyse-current-activity-on-your-box/

The same can also be managed if you run profiler and capture XML execution plan which is not always possible due to security issue and overhead of profiler itself.

Index Fragmentation and dm_db_index_physical_stats December 12, 2007

Posted by furrukhbaig in DMV's, dm_db_index_physical_stats, Fragmentation, Index Fragmentation, Index tuning, Index usefulness, Indexes, Optimize, Performance, Performance Tuning, SQL Server 2005, SQL Server 2005 features, Tips.
Tags: , , , , , ,
1 comment so far

Fragmentation can cause serious performance issues and its vital to understand how this can impact performance before I tell you how to fix it.

Fragmentation can cause slow running queries due to more disk IO then usual which itself cause by page spliting. Fragmentation not only cause query performance it can also slow down write operation such as INSERT, DELETE and UPDATE.

SQL Server write data on 8k data pages. This limit can be reached very quickly, for wide tables storing lots of varchar/nvarchar columns (nvarchar take 2 bytes to store 1 character), will cause request for more data pages. When SQL Sever uses whole page then it occupy another page to store more data. If table has got cluster index then data needs to be stored in sorted order as cluster index ensure physical order of data according to the cluster key and thats why you can only have one clustered index per table.

Now assume you have compound clustered index on CountryID and CustomerNo colums on a transactional table which recieve 10 transactions per minute for an online car rental portal. Clustered index will make sure that data is stored in sorted order on CountryID and CustomerNo. Having millions of rows in table consider what will happen when you recieve transaction row which needs to be inserted on page 1 (CountryId = 1, CustomerId = 100) . Above will require to make some room on 1st data page (depending on FILLFACTOR, default is 100%). If require data page has not got enough room for the row to fit then rows on the data pages needs to be move forward (called PageSplit) in order to make some room for new row to fit on page in sorted order. This can cause lot of page split and fragmentation as new pages required (caused by page splits on existing data pages)  can not be guarantee to be in sequence with other page. This can seriously hurt performance for data modification (due to pagesplit) and data read (due to out of order pages) operations.

dm_db_index_physical_stats can be used to check fragmentation.

SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(N’YourDB’), OBJECT_ID(N’dbo.CarRental’), NULL, NULL , ‘DETAILED’);

The above SQL will return fragmentation information for given Database name (Parameter 1) and Table Name (Parameter 2). Last parameter suggest the mode of analysis.

It is good practice to create clustered indexes on incrementing value (e.g. IDENTITY) to avoid page splits and out of order pages.

Statement Level Recompile August 22, 2007

Posted by furrukhbaig in Performance, RECOMPILE hint, SQL Server 2005, stored procedure, TSQL.
Tags: , , ,
1 comment so far

In SQL Server 2000, the unit of compilation was the whole stored procedure. So even if you wanted just one particular query to be recompiled, you couldn’t request it. If you created the stored procedure with the RECOMPILE option, the whole procedure went through recompilation every time you invoked it.SQL Server 2005 supports statement-level recompile. Instead of having all queries in the stored procedure recompiled, SQL Server can now recompile individual statements. You’re provided with a new RECOMPILE query hint that allows you to explicitly request a recompilation of a particular query. This way, other queries can benefit from reusing previously cached execution plans if there’s no reason to recompile them every time the stored procedure is invoked.

Parameter Sniffing and OPTIMIZE FOR August 22, 2007

Posted by furrukhbaig in Optimize, Parameter Sniffing, Performance, Query hint, SQL Server 2005, stored procedure, TSQL.
Tags: , , , , ,
14 comments

Parameter Sniffing refers to a process whereby SQL Server’s execution environment “sniffs” the parameter values during first invocation, and passes it along to the query optimizer so that they can be used to generate optimal query execution plans.

“First invocation” also refers to the first invocation after a plan was removed from cache for lack of reuse or for any other reason. The optimizer “knows” what the values of the input parameters are, and it generates an adequate plan for those inputs parameters. SQL Server internally maintains the statistics and distribution of the values in the columns used for filtering.

While parameter sniffing is certainly a powerful feature, it can cause problems when a procedure’s plan happens to have been kicked out of the procedure cache (or was never in it) just prior to the procedure being called with atypical parameter values. This can result in a plan that is skewed toward atypical use, one that is not optimal when called with typical values. Since, once cached, a query plan can be reused for parameter values that vary widely, the ideal situation is to have a plan in the cache that covers the typical usage of the procedure as much as possible. If a plan makes it into the cache that is oriented toward atypical parameter values, it can have a devastating effect on performance when executed with typical values.

An example would probably help here. Suppose we had a stored procedure that returns sales data by country. In our case, three-fourths of our sales is in the UK. The procedure takes a single parameter, @country, indicating the country for which to return sales info. It uses this parameter to filter a simple SELECT statement that returns the requested sales data.

CREATE PROCEDURE uspGetCountrySale
(@Country Varchar(50))
AS 
SELECT OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.SaleOrders
WHERE Country = @Country 
GO

 The optimizer would most likely to choose to do a clustered index scan when creating execution plan for this query because (given that “UK” would normally be passed in for @country) so much of the table would be traversed anyway that scanning it would require less I/O and be faster than repeated nonclustered index lookups. However, what happens if the plan happens to have been kicked out of the cache (let’s say due to an auto-statistics update) just prior to a user calling it with, say, “Spain”, where we have almost no sales? Assuming a suitable index exists, the optimizer may decide to use a nonclustered index seek in the new query plan. Subsequent executions of the procedure would reuse this plan, even if they passed in “UK” for @country. This could result in performance that is very slower than the scan-based plan.

As a workaround prior to SQL Server 2005, local variables can be used instead of stored procedure parameters. Please note SQL Server can not sniff the value of local variable. This will lead SQL Server to use statistics on filter column and create a plan which is best for average values in that column. This can also lead to serious performance when same procedure called with atypical value but will do best for typical values.

CREATE PROCEDURE uspGetCountrySale
(@Country Varchar(50))
AS 
DECLARE @_Country Varchar(20)
SET @_Country = @Country 
SELECT OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.SaleOrders
WHERE Country = @_Country 
GO
 
 

There’s a new query hint provided in SQL Server 2005 to tackle the problem—the OPTIMIZE FOR query hint. This hint allows you to provide SQL Server with a literal that reflects the selectivity of the variable, in case the input is typical. For example, if you know that the variable will typically end up with a highly selective value, you can provide the literal which reflects the typical value. for example.

 
CREATE PROCEDURE uspGetCountrySale (@Country Varchar(50))
AS 
DECLARE @_Country Varchar(20)
SET @_Country = @Country 
SELECT OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.SaleOrders
WHERE Country = @_Country
OPTION (OPTIMIZE FOR(@Country = ‘UK’)); 
GO

 

 

Stored Procedures !! Some Facts you should know August 22, 2007

Posted by furrukhbaig in Performance, SQL Server 2005, TSQL.
Tags: , ,
8 comments

Most of us are familliar with the stored procedures and use them on daily bases. But there are some facts which can cause some performance issues and here I just want to shed some light on them.

  • Always use qualified name when calling sprocs for example EXEC dbo.Your_Proc

This is very common mistake which cause an extra trip to procedure cache to get execution plan for execution. SQL Server compiles stored procedure on its first execution and store the execution plan in procedure cache to be reuse in subsequent call for same sproc. In order to get the execution plan it require qualified stored procedure name e.g. dbo.My_Proc (fully qualified name contain Server.Database.Owner.My_Proc). When owner name is not specified then initial cache lookup by object name fails as owner name was not specified. SQL Server then acquire exclusive compile lock on stored procedure and all the referenced objects including tables for recompilation. Next step SQL Server will resolve the object name to a objectID and before compilation it makes another trip to procedure cache by using object id that can result finding previous exection plan. But as you noticed this can cause blocking in certain situation where many SPIDs are calling same sproc frequently and while its is lock for compilation all the caller have to wait until sql server find the execution plan in cache or recompile the sproc.

So the rule of thumb is Always qualify your objects (sproc, tables, views, functions) with owner name.

This behaviour can be captured using profiler by capturing following events.

The SP:CacheMiss event occurs when the cache lookup by name fails. The subsequent SP:ExecContextHit indicates that a matching cached plan was ultimately found in cache once the ambiguous object name had been resolved to an object ID. Depending on the circumstances, SP:CacheHit may appear in place of SP:ExecContextHit.

for more detail see http://support.microsoft.com/kb/263889

  • SET NOCOUNT ON

This can cause extra network traffic and can have some serious impact on performance when sproc get called frequently.

  • Don’t use sp_ prefix in stored procedure name

As sp_ prefix is reserved for system stored procedure and any stored procedure which has sp_ prefix will cause an extra lookup in MASTER database. There is another point to note that if a stored procedure uses same name, in user database as system stored procedure in master database, the stored procedure in user database will never get executed as SQL Server will always look first in master database and will execute that one rather one in user database.

  • Avoid using temp tables and DDL statements

This can cause stored procedure recompile when temp table, created in stored procedure, get referenced first time within stored procedure. Due to non-existense of statistics optimizer will not be able to reuse or create execution plan for queries using temp table which is created within stored procedure. Same applies on DDL statement as they also force stored procedure to recompile.

When a batch is recompiled in SQL Server 2000, all of the statements in the batch are recompiled, not just the one that triggered the recompilation. SQL Server 2005 improves upon this behavior by compiling only the statement that caused the recompilation, not the entire batch. This “statement-level recompilation” feature will improve SQL Server 2005’s recompilation behavior when compared to that of SQL Server 2000. In particular, SQL Server 2005 spends less CPU time and memory during batch recompilations, and obtains fewer compile locks.