<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: memcopy, memmove, and Speed over Safety</title>
	<atom:link href="http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/feed/" rel="self" type="application/rss+xml" />
	<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/</link>
	<description>my weblog.</description>
	<lastBuildDate>Wed, 08 Feb 2012 13:49:22 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: David</title>
		<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/comment-page-1/#comment-1354</link>
		<dc:creator>David</dc:creator>
		<pubDate>Wed, 08 Feb 2012 13:49:22 +0000</pubDate>
		<guid isPermaLink="false">http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/#comment-1354</guid>
		<description>I disagree, is usually obvious when using memcpy that memory will never overlap, and thus no danger.  The function is commonly used by others (so not making code harder to follow for sake of efficiency.

If you are going to add inefficiency, then do it where mistakes actually commonly happen.  (Eg buffer overruns, including strings, and pointers to memory that has not been allocated or already been freed, including from bad error handling.)

Speed:  there are many different ways to do both memcpy and memmove in real world, so one situation doesn&#039;t mean much.  Intel processors for example have dedicated memcpy type instructions, and sometimes SSE instruction set.  Most modern processors can move 4 bytes much faster than 1.</description>
		<content:encoded><![CDATA[<p>I disagree, is usually obvious when using memcpy that memory will never overlap, and thus no danger.  The function is commonly used by others (so not making code harder to follow for sake of efficiency.</p>
<p>If you are going to add inefficiency, then do it where mistakes actually commonly happen.  (Eg buffer overruns, including strings, and pointers to memory that has not been allocated or already been freed, including from bad error handling.)</p>
<p>Speed:  there are many different ways to do both memcpy and memmove in real world, so one situation doesn&#8217;t mean much.  Intel processors for example have dedicated memcpy type instructions, and sometimes SSE instruction set.  Most modern processors can move 4 bytes much faster than 1.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Santiago</title>
		<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/comment-page-1/#comment-964</link>
		<dc:creator>Santiago</dc:creator>
		<pubDate>Thu, 21 Jan 2010 04:10:10 +0000</pubDate>
		<guid isPermaLink="false">http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/#comment-964</guid>
		<description>If memory areas don&#039;t overlap memmove should be just a call to memcpy, so both calls would have the same performance. Otherwise the compiler / standard libraries aren&#039;t optimized. But if both memory areas overlap, memmove would have to copy the memory backwards, which is always muuuch slower (memory caches and other parts of the hardware aren&#039;t optimized for that, and there is nothing we can do about that). However, as you said, memcpy could destroy the data, and that&#039;s not a viable option.

In summary, memmove and memcpy should be equally faster (otherwise, use another implementation), the overhead of testing if memory overlaps is negligible (but memmove can&#039;t never be faster in real scenarios). So I agree, memmove should always be used preferably over memcpy.</description>
		<content:encoded><![CDATA[<p>If memory areas don&#8217;t overlap memmove should be just a call to memcpy, so both calls would have the same performance. Otherwise the compiler / standard libraries aren&#8217;t optimized. But if both memory areas overlap, memmove would have to copy the memory backwards, which is always muuuch slower (memory caches and other parts of the hardware aren&#8217;t optimized for that, and there is nothing we can do about that). However, as you said, memcpy could destroy the data, and that&#8217;s not a viable option.</p>
<p>In summary, memmove and memcpy should be equally faster (otherwise, use another implementation), the overhead of testing if memory overlaps is negligible (but memmove can&#8217;t never be faster in real scenarios). So I agree, memmove should always be used preferably over memcpy.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Vincent Gable</title>
		<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/comment-page-1/#comment-889</link>
		<dc:creator>Vincent Gable</dc:creator>
		<pubDate>Wed, 18 Nov 2009 10:44:39 +0000</pubDate>
		<guid isPermaLink="false">http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/#comment-889</guid>
		<description>&lt;blockquote&gt;In your example you run &lt;code&gt;memcpy()&lt;/code&gt; on uncached memory, &lt;code&gt;memmove()&lt;/code&gt; afterwards so it can run on cached memory.&lt;/blockquote&gt;
Yes. I guess I didn&#039;t point that out clearly enough in comment #2 and the article. &lt;b&gt;The benchmark isn&#039;t a measure of real-world &lt;code&gt;memmove()&lt;/code&gt; vr &lt;code&gt;memcpy()&lt;/code&gt; performance.&lt;/b&gt;

What is interesting about that benchmark, and why I left it in the article, is that it shows how &lt;i&gt;small&lt;/i&gt; (and unpredictable) the performance gains from using the less-safe &lt;code&gt;malloc()&lt;/code&gt; generally are. Here I am copying 4194304 bytes of data a hundred times, and changing functions has less of an impact on performance than changing the order they are called in. (The term &quot;premature optimization&quot; comes to mind…)

In some cases using &lt;code&gt;memcpy()&lt;/code&gt; matters, but not I think in most cases.

&lt;blockquote&gt;
For fairness you should iterate at least once over both of the memory arrays (and write something to every byte in it) before you start any benchmarking.&lt;/blockquote&gt;
Yes, I need to warm up the caches. Or make sure they both start cold. But that&#039;s not as easy as just touching every byte once! Each array is twice the size of my L2 cache, so if I wrote to every byte of A, than ditto for B, I&#039;d only have the last half of B in cache. That&#039;s still not fair!</description>
		<content:encoded><![CDATA[<blockquote><p>In your example you run <code>memcpy()</code> on uncached memory, <code>memmove()</code> afterwards so it can run on cached memory.</p></blockquote>
<p>Yes. I guess I didn&#8217;t point that out clearly enough in comment #2 and the article. <b>The benchmark isn&#8217;t a measure of real-world <code>memmove()</code> vr <code>memcpy()</code> performance.</b></p>
<p>What is interesting about that benchmark, and why I left it in the article, is that it shows how <i>small</i> (and unpredictable) the performance gains from using the less-safe <code>malloc()</code> generally are. Here I am copying 4194304 bytes of data a hundred times, and changing functions has less of an impact on performance than changing the order they are called in. (The term &#8220;premature optimization&#8221; comes to mind…)</p>
<p>In some cases using <code>memcpy()</code> matters, but not I think in most cases.</p>
<blockquote><p>
For fairness you should iterate at least once over both of the memory arrays (and write something to every byte in it) before you start any benchmarking.</p></blockquote>
<p>Yes, I need to warm up the caches. Or make sure they both start cold. But that&#8217;s not as easy as just touching every byte once! Each array is twice the size of my L2 cache, so if I wrote to every byte of A, than ditto for B, I&#8217;d only have the last half of B in cache. That&#8217;s still not fair!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom Strainer</title>
		<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/comment-page-1/#comment-888</link>
		<dc:creator>Tom Strainer</dc:creator>
		<pubDate>Wed, 18 Nov 2009 10:19:25 +0000</pubDate>
		<guid isPermaLink="false">http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/#comment-888</guid>
		<description>In your example you run memcpy() on uncached memory, memmove() afterwards so it can run on cached memory.
For fairness you should iterate at least once over both of the memory arrays (and write something to every byte in it) before you start any benchmarking.

Tom S.</description>
		<content:encoded><![CDATA[<p>In your example you run memcpy() on uncached memory, memmove() afterwards so it can run on cached memory.<br />
For fairness you should iterate at least once over both of the memory arrays (and write something to every byte in it) before you start any benchmarking.</p>
<p>Tom S.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Vincent Gable</title>
		<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/comment-page-1/#comment-854</link>
		<dc:creator>Vincent Gable</dc:creator>
		<pubDate>Sun, 18 Oct 2009 06:50:06 +0000</pubDate>
		<guid isPermaLink="false">http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/#comment-854</guid>
		<description>&lt;blockquote&gt;I can’t think of any possible circumstances where memmove could be faster. It should degenerate into memcpy when the blocks don’t overlap. &lt;/blockquote&gt; Here&#039;s a hypothetical one. &lt;code&gt;memcpy&lt;/code&gt; is implemented in hand-tuned assembly, and optimized for large chunks of memory. &lt;code&gt;memmove&lt;/code&gt; is very simple, less than 6 lines of C (something like my implementation, but with &lt;code&gt;memcpy_{backwards,forward}&lt;/code&gt; inlined). &lt;i&gt;But&lt;/i&gt; the &lt;code&gt;memmove&lt;/code&gt; function is put &lt;code&gt;inline&lt;/code&gt; in a header.

Now imagine a loop that copies lots of little data structures.

Invoking &lt;code&gt;memcpy&lt;/code&gt; will incur the overhead of a function call, but &lt;code&gt;memmove&lt;/code&gt; won&#039;t, because it&#039;s inlined by the compiler. The result: the loop goes faster with &lt;code&gt;memmove&lt;/code&gt;. Even if we did inline &lt;code&gt;memcpy&lt;/code&gt;, it&#039;s use of naked assembly would prevent the compiler from optimizing the function it was put in -- and that would probably be a bigger performance hit.

But I haven&#039;t actually &lt;i&gt;tested&lt;/i&gt; this scenario. And I believe that if you don&#039;t test, you don&#039;t know, when it comes to hardware-level optimizations.</description>
		<content:encoded><![CDATA[<blockquote><p>I can’t think of any possible circumstances where memmove could be faster. It should degenerate into memcpy when the blocks don’t overlap. </p></blockquote>
<p> Here&#8217;s a hypothetical one. <code>memcpy</code> is implemented in hand-tuned assembly, and optimized for large chunks of memory. <code>memmove</code> is very simple, less than 6 lines of C (something like my implementation, but with <code>memcpy_{backwards,forward}</code> inlined). <i>But</i> the <code>memmove</code> function is put <code>inline</code> in a header.</p>
<p>Now imagine a loop that copies lots of little data structures.</p>
<p>Invoking <code>memcpy</code> will incur the overhead of a function call, but <code>memmove</code> won&#8217;t, because it&#8217;s inlined by the compiler. The result: the loop goes faster with <code>memmove</code>. Even if we did inline <code>memcpy</code>, it&#8217;s use of naked assembly would prevent the compiler from optimizing the function it was put in &#8212; and that would probably be a bigger performance hit.</p>
<p>But I haven&#8217;t actually <i>tested</i> this scenario. And I believe that if you don&#8217;t test, you don&#8217;t know, when it comes to hardware-level optimizations.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/comment-page-1/#comment-853</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Sat, 17 Oct 2009 15:51:36 +0000</pubDate>
		<guid isPermaLink="false">http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/#comment-853</guid>
		<description>A few months back, I spent a few days doing extensive benchmarks on various assembler versions of memcpy (x86, SSE w/ non-temporal writes, prefetching, etc), and found that it&#039;s not very difficult to beat the compiler if you&#039;re doing memcpy&#039;s of more than a few bytes at a time (especially if you use non-temporal writes). On an Intel P8400 w/ 3GB DDR2-1066, I&#039;d easily double the memory bandwidth (in the average case) over the best inlined-code that the MSVC compiler could generate, and in certain cases, the performance was an order of magnitude higher (especially if the working set is resident entirely in L1 cache).

memcpy is slow because compilers generally don&#039;t generate great code. If memmove is faster (especialy 40% faster!), it&#039;s because the compiler probably isn&#039;t doing its job (excluding the aforementioned point about warming up the caches). I can&#039;t think of any possible circumstances where memmove could be faster. It should degenerate into memcpy when the blocks don&#039;t overlap. :\

memcpy makes perfect sense to choose as the &quot;natural&quot; method of copying memory: in my experience, most memcpy&#039;s generally occur on two buffers that can&#039;t possibly overlap, and so memcpy is the natural, most sensible, and definitely the best choice for doing the copy. We can all agree that more constraints == more possible optimisation, even if the compiler doesn&#039;t always do it this way. :) Using memmove in these cases wouldn&#039;t be wrong, but it also wouldn&#039;t be good practice. Not much point in adding code to handle situations that can&#039;t possibly occur (excluding, of course, cosmic ray interference).

However! After working at Ericsson in embedded development, I tend to agree with at least considering memmove. We have a lot of subtle problems with memory corruption that could be solved by this, and in a worrying high number of places, we also have _really_ dubious reinvention of the memmove wheel:

&gt; if ((uint32)buf1 + size  memcpy()
&gt; else    
&gt; strange_memmove_style_code

I guess the obvious bottom line if you&#039;re using C/C++ and using memcpy/memmove is: if can&#039;t use them correctly, then it&#039;s time to move to an easier language. Or time to starting having less coffee and more sleep. ;)</description>
		<content:encoded><![CDATA[<p>A few months back, I spent a few days doing extensive benchmarks on various assembler versions of memcpy (x86, SSE w/ non-temporal writes, prefetching, etc), and found that it&#8217;s not very difficult to beat the compiler if you&#8217;re doing memcpy&#8217;s of more than a few bytes at a time (especially if you use non-temporal writes). On an Intel P8400 w/ 3GB DDR2-1066, I&#8217;d easily double the memory bandwidth (in the average case) over the best inlined-code that the MSVC compiler could generate, and in certain cases, the performance was an order of magnitude higher (especially if the working set is resident entirely in L1 cache).</p>
<p>memcpy is slow because compilers generally don&#8217;t generate great code. If memmove is faster (especialy 40% faster!), it&#8217;s because the compiler probably isn&#8217;t doing its job (excluding the aforementioned point about warming up the caches). I can&#8217;t think of any possible circumstances where memmove could be faster. It should degenerate into memcpy when the blocks don&#8217;t overlap. :\</p>
<p>memcpy makes perfect sense to choose as the &#8220;natural&#8221; method of copying memory: in my experience, most memcpy&#8217;s generally occur on two buffers that can&#8217;t possibly overlap, and so memcpy is the natural, most sensible, and definitely the best choice for doing the copy. We can all agree that more constraints == more possible optimisation, even if the compiler doesn&#8217;t always do it this way. :) Using memmove in these cases wouldn&#8217;t be wrong, but it also wouldn&#8217;t be good practice. Not much point in adding code to handle situations that can&#8217;t possibly occur (excluding, of course, cosmic ray interference).</p>
<p>However! After working at Ericsson in embedded development, I tend to agree with at least considering memmove. We have a lot of subtle problems with memory corruption that could be solved by this, and in a worrying high number of places, we also have _really_ dubious reinvention of the memmove wheel:</p>
<p>&gt; if ((uint32)buf1 + size  memcpy()<br />
&gt; else<br />
&gt; strange_memmove_style_code</p>
<p>I guess the obvious bottom line if you&#8217;re using C/C++ and using memcpy/memmove is: if can&#8217;t use them correctly, then it&#8217;s time to move to an easier language. Or time to starting having less coffee and more sleep. ;)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Vincent Gable</title>
		<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/comment-page-1/#comment-429</link>
		<dc:creator>Vincent Gable</dc:creator>
		<pubDate>Sat, 17 Jan 2009 08:28:10 +0000</pubDate>
		<guid isPermaLink="false">http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/#comment-429</guid>
		<description>Thanks Eric, that really shows how variable low-level optimizations can be.  I would not have guessed that something would be faster with no optimizations and full optimizations, but slower with &lt;i&gt;some&lt;/i&gt; optimizations.

The for-loop ordering thing sounds like bad benchmarking on my part.  I should have touched the data first to warm up any caches.  I suspect the first iteration of the first loop to run may be taking much longer while data is fetched into caches. That&#039;s why timing &lt;code&gt;memmove&lt;/code&gt; first made it slower the &lt;code&gt;memcpy&lt;/code&gt;.  Mea culpa.

Also, it occurred to me that since &lt;code&gt;restrict&lt;/code&gt; is now a keyword in C99, and the compiler might be able to help out a little more with making sure &lt;code&gt;memcpy&lt;/code&gt; is being used correctly, &lt;code&gt;memcpy&lt;/code&gt; is a less dangerous default choice today.

But it&#039;s still telling, and perhaps disturbing, that &lt;code&gt;memcpy&lt;/code&gt; is the cultural choice.</description>
		<content:encoded><![CDATA[<p>Thanks Eric, that really shows how variable low-level optimizations can be.  I would not have guessed that something would be faster with no optimizations and full optimizations, but slower with <i>some</i> optimizations.</p>
<p>The for-loop ordering thing sounds like bad benchmarking on my part.  I should have touched the data first to warm up any caches.  I suspect the first iteration of the first loop to run may be taking much longer while data is fetched into caches. That&#8217;s why timing <code>memmove</code> first made it slower the <code>memcpy</code>.  Mea culpa.</p>
<p>Also, it occurred to me that since <code>restrict</code> is now a keyword in C99, and the compiler might be able to help out a little more with making sure <code>memcpy</code> is being used correctly, <code>memcpy</code> is a less dangerous default choice today.</p>
<p>But it&#8217;s still telling, and perhaps disturbing, that <code>memcpy</code> is the cultural choice.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric Wing</title>
		<link>http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/comment-page-1/#comment-426</link>
		<dc:creator>Eric Wing</dc:creator>
		<pubDate>Fri, 16 Jan 2009 21:16:58 +0000</pubDate>
		<guid isPermaLink="false">http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety/#comment-426</guid>
		<description>I played with your benchmark a little. If I turn on basic optimization (e.g. -O1 or -O2, etc) I can get different results.

Also, if I swap the order of the for-loops (i.e. do memmove first and memcpy second), I can can also get different results. 

Combining the two things, I always get memcpy as the faster run.

(Tested on Intel Core 2 Duo iMac).</description>
		<content:encoded><![CDATA[<p>I played with your benchmark a little. If I turn on basic optimization (e.g. -O1 or -O2, etc) I can get different results.</p>
<p>Also, if I swap the order of the for-loops (i.e. do memmove first and memcpy second), I can can also get different results. </p>
<p>Combining the two things, I always get memcpy as the faster run.</p>
<p>(Tested on Intel Core 2 Duo iMac).</p>
]]></content:encoded>
	</item>
</channel>
</rss>

